Remove Outliers - SQL Interview Query 11 | SQL Problem Level 'MEDIUM'

preview_player
Показать описание
30DaySQLQueryChallenge is a series of 30 videos covering 30 SQL Interview Queries. This is the ELEVENTH video in this series. This video series aims to provide 30 SQL Queries that can be asked during SQL Interviews. I will explain how to solve and address such SQL queries during interviews in these videos.

Let's follow the below routine to make the best use of it:
1. Watch the YouTube video (first half) to understand the problem statement.

2. Go to my discord server (link below), download the dataset for each problem, and try solving it yourself.

3. Share your solution on Discord and discuss different solutions and issues on my Discord server.

4. Watch the second half of my YouTube video to find my solution to the problem.

5. Share it with your contacts and spread the knowledge.

DOWNLOAD the Dataset from below:

Timeline:
00:00 Intro
00:10 Understanding Problem Statement
03:02 Solution to the SQL Problem
06:40 Quick understanding of SQL Frame Clause
10:27 Continue with the solution to the SQL Problem

Thanks for participating in this challenge!

Good luck and Happy Learning!
Рекомендации по теме
Комментарии
Автор

Not a full proof solution as one partition might have more than one distinct outliers..Not sure how rank function will help there

arnabsarkar
Автор

Thank you for your instruction. However, I noticed some redundancy in the query. Here is my refined solution based on your guidance:

WITH cte1 AS (
SELECT *, AVG(rating) OVER (PARTITION BY hotel) AS r_avg
FROM hotel_ratings),
cte2 AS (
SELECT *, ABS (rating - r_avg) AS diff
FROM cte1
)
SELECT hotel, year, rating
FROM cte2
WHERE diff <= 1
ORDER BY hotel DESC, year;

The RANK function is not necessary in this case because it is only suitable for small datasets where outliers can be easily identified visually. For simple visual identification, using ABS(rating - r_avg) AS diff <= 1 is enough.

When it comes to large datasets with more than 2 outliers, this solution is not suitable. I prefer using z-scores for a more universal approach:

WITH cte1 AS (
SELECT *, AVG(rating) OVER (PARTITION BY hotel) AS r_avg,
STDDEV_SAMP(rating) OVER (PARTITION BY hotel) AS std_dev
FROM input_11
),
cte2 AS (
SELECT *, (rating - r_avg)/std_dev AS z_scores
FROM cte1
)
SELECT hotel, year, rating FROM cte2
WHERE ABS(z_scores) < 1.4
ORDER BY hotel DESC, year;

camomile
Автор

my approach :
with cte as( select *, avg(rating)over(partition by hotel )as avg
from hotel_ratings),
cte2 as (select *, abs(avg-rating) as abs from cte),

cte3 as (select *, max(abs) over(partition by hotel) as exclude from cte2)
select hotel, year, rating from cte3
where exclude<>abs
order by hotel desc

cqlhfmc
Автор

Since more than one outlier can mess up the average value. Would suggest to use median value in outlier detection and removal. Using a z Score method or a percentile method would be appropriate.

vensandy_Data_Analyst
Автор

Sorry to say but your approach to this problem is naive and as a data analyst/scientist you need to make sure your solution is statistically accurate. I'll tell you a major flaw with your solution with an example:
If you have multiple outliers in each hotel your solution will not identify those outliers since you are only identifying 1 extreme outlier with the maximum deviation from the mean but this does not mean that other outliers cannot exist.

Right solution:
Apart from mean also calculate the standard deviation for each hotel (every rdbms has a function for that else can be done manually too). In statistics there is something called the 3 Sigma rule (if an assumption can be made that the data distribution is approximately gaussian and this is also called the Z score technique) which needs to be applied here. Essentially, the outliers will be those records whose values are mean (mu) +/- 3* standard deviation (Sigma). This logic has to be applied to obtain the outliers.
If your goal is to just apply SQL logic with window functions it is okay but since you are talking about a business case scenario it is important to be statistically accurate
That being said thank you for your continuous efforts in helping us build strong SQL foundations.

Tusharchitrakar
Автор

with cte as (
SELECT *,
round(avg(rating) over(partition by hotel order by year
range between unbounded preceding and unbounded following ), 2) as avg_rating
FROM hotel),
cte1 as (select *,
abs(avg_rating-rating) as diif
from cte),
cte2 as (select *,
dense_rank() over(partition by hotel order by abs(avg_rating-rating) desc
) as rn
from cte1)
select hotel, year, rating from cte2
where rn>1

parmoddhiman
Автор

with cte as(select *, rank() over(order by rating asc)as max_rnk, rank() over(order by rating desc)as min_rnk from hotel_ratings)
select hotel, year, rating from cte where max_rnk <>8 and max_rnk<>1 and min_rnk <>1 order by hotel
donot know whether it will be dynamic or not...

anirbanbiswas
Автор

with cte as
( select *, avg(rating)over(partition by hotel ) as avg_rating, round(STDDEV(rating)over(partition by hotel), 4) as std
from hotel_ratings),
cte1 as
(select *, abs((rating - avg_rating)/std) as out_detect
from cte)
select hotel, year, rating from cte1 where out_detect < 1.5
order by hotel desc, year

nileshpandey
Автор

with cte as
(select hotel, year, rating, average_rating, average_rating_difference,
over(partition by hotel order by year range between unbounded preceding and unbounded following) as max_difference
from (select *,
over(partition by hotel order by year range between unbounded preceding and unbounded following) as average_rating,
- AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following)) as average_rating_difference
hotel_ratings)x )
select hotel, year, rating
from cte
where average_rating_difference <> max_difference

vaibhavbatra
Автор

WITH AVG_RATING_CTE AS
(SELECT *,
BY HOTEL ORDER BY YEAR
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 1) AS AVG_RATING
FROM HOTEL_RATINGS)
, DIFF_CTE AS(SELECT
HOTEL, YEAR, RATING, AVG_RATING, ABS(AVG_RATING - RATING) AS DIFF
FROM AVG_RATING_CTE)
, MAX_CTE AS(
SELECT
HOTEL, YEAR, RATING, DIFF, MAX(DIFF)OVER(PARTITION BY HOTEL ORDER BY YEAR
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS MAX
FROM DIFF_CTE
)
SELECT HOTEL, YEAR, RATING
FROM MAX_CTE
WHERE DIFF <> MAX
ORDER BY HOTEL, YEAR

vivekshirsat
Автор

@techTFQ I also face the same problme of range. However when I tried to order by hotel then it gave me the result as expected why does orderng by year create so much of difference ??

AjinkyaGhadge-wl
Автор

@TFQ Generalized solution that will work for many outliers

WITH cte AS (
SELECT hotel, year, rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9, 1)) AS avg_rating, STDEV(rating) over(partition by hotel order by hotel) as std_dev
FROM hotel_ratings
),
cte1 AS (
SELECT *, ABS(rating - avg_rating) AS diff_rating, cast(avg_rating + std_dev as decimal(9, 1)) as positive, cast(avg_rating - std_dev as decimal(9, 1)) as negative from cte

), cte2
as
(
Select *, CASE WHEN rating <= positive and rating >=negative THEN 1 else 0 end as flag from cte1
)
select hotel, year, rating from cte2
where flag <> 0
order by hotel desc, year asc

navaneeth
Автор

This might be applicable only for MS SQL, (I am not sure about PostgreSQL) a constant average can be obtained simply by removing the order by clause.
select avg(rating) over(partition by hotel)
from hotel_ratings
This code gives the required output. Though, it helps that avg() does not require an order by clause mandatorily. This wouldn't work for fuctions that require order by, like rank(), etc.

gphanisrinivasful
Автор

sir, please put a video on different star pattern forming with mysql .

cfzfigc
Автор

with cte as(
select *, avg(rating) over(order by hotel) as rating2 from hotel_ratings
),
cte2 as
(select *, abs(rating-rating2) as did from cte),
cte3 as
(select *, rank() over(partition by hotel order by did desc) as a from cte2)
select hotel, year, rating from cte3
where a>1
order by hotel desc, year;

gouthamstar
Автор

SQL server -
WITH CTE AS
(SELECT *, ABS(rating - AVG(rating) OVER (PARTITION BY hotel)) as average_difference FROM hotel_ratings),
CTE_FINAL AS
(SELECT *, CASE WHEN average_difference = MAX(average_difference) OVER (PARTITION BY hotel) THEN 1 ELSE 0 END AS flag FROM CTE)
SELECT hotel, year, rating FROM CTE_FINAL WHERE flag=0

DataSailor
Автор

This is my alternate solution:

select * from hotel_ratings where rating <
(select avg(rating) from hotel_ratings where hotel='Radisson Blu')
and hotel='Radisson Blu'
union
select * from hotel_ratings where rating >
(select avg(rating) from hotel_ratings where hotel='InterContinental')
and hotel='InterContinental' order by hotel desc

Ajaysaini
Автор

Oracle SQL | Solution that will work for many outliers.

WITH cte AS (
SELECT
hotel,
year,
rating,
AVG(rating) OVER (PARTITION BY hotel) AS average,
rating - AVG(rating) OVER (PARTITION BY hotel) AS difference
FROM
hotel_ratings
),
result_table AS (
SELECT
hotel,
year,
rating,
difference,
AVG(ABS(difference)) OVER (PARTITION BY hotel) AS avg_abs_diff,
CASE
WHEN ABS(difference) - AVG(ABS(difference)) OVER (PARTITION BY hotel) >= AVG(ABS(difference)) OVER (PARTITION BY hotel)
THEN 'remove'
ELSE 'keep'
END AS flag
FROM
cte
)
SELECT
hotel,
year,
rating
FROM
result_table
WHERE
flag = 'keep'
ORDER BY
hotel DESC,
year;

ratheeshg
Автор

with cte as(
select hotel, avg(rating) as rat
from hotel_ratings
group by hotel)
, cte2 as(select h.hotel, h.year, rating, rank() over (partition by hotel order by abs(rating -rat) desc) as r
from cte c
inner join hotel_ratings h
on c.hotel=h.hotel)
select * from cte2 where r>1;

dumdum
Автор

with cte as(
select
*, avg(rating) over(partition by hotel ) avg_rating, abs(rating - avg(rating) over(partition by hotel )) diff_rating
from hotel_ratings),
cte2 as (SELECT *, max(diff_rating) over(partition by hotel) max_diff from cte
)
select hotel, year, rating from cte2
where diff_rating<>max_diff
order by hotel desc, year ;

abhijitvernekar