Remove Outliers - SQL Interview Query 11 | SQL Problem Level 'MEDIUM'

Показать описание

30DaySQLQueryChallenge is a series of 30 videos covering 30 SQL Interview Queries. This is the ELEVENTH video in this series. This video series aims to provide 30 SQL Queries that can be asked during SQL Interviews. I will explain how to solve and address such SQL queries during interviews in these videos.

Let's follow the below routine to make the best use of it:
1. Watch the YouTube video (first half) to understand the problem statement.

2. Go to my discord server (link below), download the dataset for each problem, and try solving it yourself.

3. Share your solution on Discord and discuss different solutions and issues on my Discord server.

4. Watch the second half of my YouTube video to find my solution to the problem.

5. Share it with your contacts and spread the knowledge.

DOWNLOAD the Dataset from below:

Timeline:
00:00 Intro
00:10 Understanding Problem Statement
03:02 Solution to the SQL Problem
06:40 Quick understanding of SQL Frame Clause
10:27 Continue with the solution to the SQL Problem

Thanks for participating in this challenge!

Good luck and Happy Learning!

Рекомендации по теме

Комментарии

Not a full proof solution as one partition might have more than one distinct outliers..Not sure how rank function will help there

arnabsarkar

Thank you for your instruction. However, I noticed some redundancy in the query. Here is my refined solution based on your guidance:

WITH cte1 AS (
SELECT *, AVG(rating) OVER (PARTITION BY hotel) AS r_avg
FROM hotel_ratings),
cte2 AS (
SELECT *, ABS (rating - r_avg) AS diff
FROM cte1
)
SELECT hotel, year, rating
FROM cte2
WHERE diff <= 1
ORDER BY hotel DESC, year;

The RANK function is not necessary in this case because it is only suitable for small datasets where outliers can be easily identified visually. For simple visual identification, using ABS(rating - r_avg) AS diff <= 1 is enough.

When it comes to large datasets with more than 2 outliers, this solution is not suitable. I prefer using z-scores for a more universal approach:

WITH cte1 AS (
SELECT *, AVG(rating) OVER (PARTITION BY hotel) AS r_avg,
STDDEV_SAMP(rating) OVER (PARTITION BY hotel) AS std_dev
FROM input_11
),
cte2 AS (
SELECT *, (rating - r_avg)/std_dev AS z_scores
FROM cte1
)
SELECT hotel, year, rating FROM cte2
WHERE ABS(z_scores) < 1.4
ORDER BY hotel DESC, year;

camomile

my approach :
with cte as( select *, avg(rating)over(partition by hotel )as avg
from hotel_ratings),
cte2 as (select *, abs(avg-rating) as abs from cte),

cte3 as (select *, max(abs) over(partition by hotel) as exclude from cte2)
select hotel, year, rating from cte3
where exclude<>abs
order by hotel desc

cqlhfmc

Since more than one outlier can mess up the average value. Would suggest to use median value in outlier detection and removal. Using a z Score method or a percentile method would be appropriate.

vensandy_Data_Analyst

Sorry to say but your approach to this problem is naive and as a data analyst/scientist you need to make sure your solution is statistically accurate. I'll tell you a major flaw with your solution with an example:
If you have multiple outliers in each hotel your solution will not identify those outliers since you are only identifying 1 extreme outlier with the maximum deviation from the mean but this does not mean that other outliers cannot exist.

Right solution:
Apart from mean also calculate the standard deviation for each hotel (every rdbms has a function for that else can be done manually too). In statistics there is something called the 3 Sigma rule (if an assumption can be made that the data distribution is approximately gaussian and this is also called the Z score technique) which needs to be applied here. Essentially, the outliers will be those records whose values are mean (mu) +/- 3* standard deviation (Sigma). This logic has to be applied to obtain the outliers.
If your goal is to just apply SQL logic with window functions it is okay but since you are talking about a business case scenario it is important to be statistically accurate
That being said thank you for your continuous efforts in helping us build strong SQL foundations.

Tusharchitrakar

with cte as (
SELECT *,
round(avg(rating) over(partition by hotel order by year
range between unbounded preceding and unbounded following ), 2) as avg_rating
FROM hotel),
cte1 as (select *,
abs(avg_rating-rating) as diif
from cte),
cte2 as (select *,
dense_rank() over(partition by hotel order by abs(avg_rating-rating) desc
) as rn
from cte1)
select hotel, year, rating from cte2
where rn>1

parmoddhiman

with cte as(select *, rank() over(order by rating asc)as max_rnk, rank() over(order by rating desc)as min_rnk from hotel_ratings)
select hotel, year, rating from cte where max_rnk <>8 and max_rnk<>1 and min_rnk <>1 order by hotel
donot know whether it will be dynamic or not...

anirbanbiswas

with cte as
( select *, avg(rating)over(partition by hotel ) as avg_rating, round(STDDEV(rating)over(partition by hotel), 4) as std
from hotel_ratings),
cte1 as
(select *, abs((rating - avg_rating)/std) as out_detect
from cte)
select hotel, year, rating from cte1 where out_detect < 1.5
order by hotel desc, year

nileshpandey

with cte as
(select hotel, year, rating, average_rating, average_rating_difference,
over(partition by hotel order by year range between unbounded preceding and unbounded following) as max_difference
from (select *,
over(partition by hotel order by year range between unbounded preceding and unbounded following) as average_rating,
- AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following)) as average_rating_difference
hotel_ratings)x )
select hotel, year, rating
from cte
where average_rating_difference <> max_difference

vaibhavbatra

WITH AVG_RATING_CTE AS
(SELECT *,
BY HOTEL ORDER BY YEAR
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 1) AS AVG_RATING
FROM HOTEL_RATINGS)
, DIFF_CTE AS(SELECT
HOTEL, YEAR, RATING, AVG_RATING, ABS(AVG_RATING - RATING) AS DIFF
FROM AVG_RATING_CTE)
, MAX_CTE AS(
SELECT
HOTEL, YEAR, RATING, DIFF, MAX(DIFF)OVER(PARTITION BY HOTEL ORDER BY YEAR
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS MAX
FROM DIFF_CTE
)
SELECT HOTEL, YEAR, RATING
FROM MAX_CTE
WHERE DIFF <> MAX
ORDER BY HOTEL, YEAR

vivekshirsat

@techTFQ I also face the same problme of range. However when I tried to order by hotel then it gave me the result as expected why does orderng by year create so much of difference ??

AjinkyaGhadge-wl

@TFQ Generalized solution that will work for many outliers

WITH cte AS (
SELECT hotel, year, rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9, 1)) AS avg_rating, STDEV(rating) over(partition by hotel order by hotel) as std_dev
FROM hotel_ratings
),
cte1 AS (
SELECT *, ABS(rating - avg_rating) AS diff_rating, cast(avg_rating + std_dev as decimal(9, 1)) as positive, cast(avg_rating - std_dev as decimal(9, 1)) as negative from cte

), cte2
as
(
Select *, CASE WHEN rating <= positive and rating >=negative THEN 1 else 0 end as flag from cte1
)
select hotel, year, rating from cte2
where flag <> 0
order by hotel desc, year asc

navaneeth

This might be applicable only for MS SQL, (I am not sure about PostgreSQL) a constant average can be obtained simply by removing the order by clause.
select avg(rating) over(partition by hotel)
from hotel_ratings
This code gives the required output. Though, it helps that avg() does not require an order by clause mandatorily. This wouldn't work for fuctions that require order by, like rank(), etc.

gphanisrinivasful

sir, please put a video on different star pattern forming with mysql .

cfzfigc

with cte as(
select *, avg(rating) over(order by hotel) as rating2 from hotel_ratings
),
cte2 as
(select *, abs(rating-rating2) as did from cte),
cte3 as
(select *, rank() over(partition by hotel order by did desc) as a from cte2)
select hotel, year, rating from cte3
where a>1
order by hotel desc, year;

gouthamstar

SQL server -
WITH CTE AS
(SELECT *, ABS(rating - AVG(rating) OVER (PARTITION BY hotel)) as average_difference FROM hotel_ratings),
CTE_FINAL AS
(SELECT *, CASE WHEN average_difference = MAX(average_difference) OVER (PARTITION BY hotel) THEN 1 ELSE 0 END AS flag FROM CTE)
SELECT hotel, year, rating FROM CTE_FINAL WHERE flag=0

DataSailor

This is my alternate solution:

select * from hotel_ratings where rating <
(select avg(rating) from hotel_ratings where hotel='Radisson Blu')
and hotel='Radisson Blu'
union
select * from hotel_ratings where rating >
(select avg(rating) from hotel_ratings where hotel='InterContinental')
and hotel='InterContinental' order by hotel desc

Ajaysaini

Oracle SQL | Solution that will work for many outliers.

WITH cte AS (
SELECT
hotel,
year,
rating,
AVG(rating) OVER (PARTITION BY hotel) AS average,
rating - AVG(rating) OVER (PARTITION BY hotel) AS difference
FROM
hotel_ratings
),
result_table AS (
SELECT
hotel,
year,
rating,
difference,
AVG(ABS(difference)) OVER (PARTITION BY hotel) AS avg_abs_diff,
CASE
WHEN ABS(difference) - AVG(ABS(difference)) OVER (PARTITION BY hotel) >= AVG(ABS(difference)) OVER (PARTITION BY hotel)
THEN 'remove'
ELSE 'keep'
END AS flag
FROM
cte
)
SELECT
hotel,
year,
rating
FROM
result_table
WHERE
flag = 'keep'
ORDER BY
hotel DESC,
year;

ratheeshg

with cte as(
select hotel, avg(rating) as rat
from hotel_ratings
group by hotel)
, cte2 as(select h.hotel, h.year, rating, rank() over (partition by hotel order by abs(rating -rat) desc) as r
from cte c
inner join hotel_ratings h
on c.hotel=h.hotel)
select * from cte2 where r>1;

dumdum

with cte as(
select
*, avg(rating) over(partition by hotel ) avg_rating, abs(rating - avg(rating) over(partition by hotel )) diff_rating
from hotel_ratings),
cte2 as (SELECT *, max(diff_rating) over(partition by hotel) max_diff from cte
)
select hotel, year, rating from cte2
where diff_rating<>max_diff
order by hotel desc, year ;

abhijitvernekar

Remove Outliers - SQL Interview Query 11 | SQL Problem Level 'MEDIUM'

Remove Outliers - SQL Interview Query 11 | SQL Problem Level 'MEDIUM'

Find Outliers in SQL | Data Analyst SQL Interview Question

Find Outliers in SQL

Detecting and Removing Outliers in SQL at Snowflake

SQL Analysis Project: Identifying Outliers and Department Statistics

Totally didn't remove outliers 😎 #physics #science #phd #phdlife #research #analisis

T-SQL Example: An Outliers Effect On Average

SQL Interview Question - Compensation Outliers

How to handle outliers in A/B Testing - Che Sharma - The Data Scientist Show #068

Extract ZIP Codes | String Contains Substring | Data Analyst SQL Interview Questions

How do you handle outliers in data analysis? | Data Analyst Interview Questions and Answers

Outliers Removal from the Data in regression

What are techniques to identify outliers and how to handle them? | Machine learning Interview

Outliers Detection and Treatment | Data Analytics Interview Questions and Answers | Beginner level

Tricky SQL Interview Problem with a Simple Solution | Data Analytics

Watch Brent Tune Queries - SQLSaturday Oslo

Finding Outliers with Python is Easy

Finding an outlier in a dataset using Python

Data Analyst Interview Questions And Answers | Data Analytics Interview Questions and Answers

Remove outliers using percentile and z-score method

Data Science Interview Question - 1 | Outlier

Null Values, Outliers, & EDA (Python) | Data Analyst sample interview question

11 years later ❤️ @shrads

Bing Lecture - Learn SQL from Real Jobs : Outlier by Using Calculated Column