tiger analytics interview questions and answers in pyspark | #interview

preview_player
Показать описание
tiger analytics interview questions and answers in pyspark | #interview | #dataengineers | #datascience | #dataanalytics

"tiger analytics interview questions and answers in pyspark "

Create DataFrame Code :
======================

flights_data = [(1,'Flight1' , 'Delhi' , 'Hyderabad'),
(1,'Flight2' , 'Hyderabad' , 'Kochi'),
(1,'Flight3' , 'Kochi' , 'Mangalore'),
(2,'Flight1' , 'Mumbai' , 'Ayodhya'),
(2,'Flight2' , 'Ayodhya' , 'Gorakhpur')
]

_schema = "cust_id int, flight_id string , origin string , destination string"

top interview question and answer in pyspark :

#fang #pyspark #sql #interview #dataengineers #dataanalytics #datascience #StrataScratch #Facebook #data #dataengineeringinterview #codechallenge #datascientist #pyspark #CodingInterview
#dsafordataguy #dewithdhairy #DEwithDhairy #dhiarjgupta #leetcode #topinterviewquestion
Рекомендации по теме
Комментарии
Автор

SQL Version : Ankit Bansal ( Have done little Twist )

DEwithDhairy
Автор

ans=df_flight.groupBy("cust_id").agg(first("origin"), last("destination"))
display(ans)
directly we can use this

siddharthchoudhary
Автор

Well this question is good one. i solved it before using recursive cte in SQL

jhonsen
Автор

Thank you so much for sharing the knowledge, like and subscribed

mahendranath
Автор

ms sql server query
with cte as(
select *,
ROW_NUMBER() over(partition by id order by fly) as r
from flight
)
select distinct id,
first_value(origin) over(partition by id order by r ) as start,
LAST_VALUE(dest) over(partition by id order by r
range between unbounded preceding and unbounded following) as endp
from cte

yptjces
Автор

Great i am just curious to know how much time usually they give to solve it. More you pracitce faster you will solve but depends how good you are. Secondly what is the use of aggregate and max in the final result if i am grouping by cust_id and comparing when start == rn = 1 as origin and end == rn == 3 as destination in both origin and dest col.

jhonsen
Автор

Below is my query, which works for any number of flights:

with cte1 as (
select *, row_number() over(partition by cid order by fid asc) as rn from flights),

cte2 as(
select cid, min(rn) as mi, max(rn) as mx from cte1 GROUP by cid)

select ori.cid, ori.origin, des.destination from
(
select c1.* from cte1 c1
inner join cte2 c2 on c1.cid=c2.cid and c1.rn=c2.mi) ori
join (
select c1.* from cte1 c1
inner join cte2 c2 on c1.cid=c2.cid and c1.rn=c2.mx) des
on ori.cid=des.cid;

saheedahmed
Автор

I'm little new to these type of questions, I have used another method to solve the same, can you tell me why this approach is not suited ?

Select a.cust_id, a.origin, b.destination from (Select o.cust_id, o.origin from travelling_details o
where origin not In ( Select t.destination from travelling_details t
where t.cust_id = o.cust_id)) a join
(Select o.cust_id, o.destination from travelling_details o
where o.destination not In ( Select t.origin from travelling_details t
where t.cust_id = o.cust_id)) b
on a.cust_id = b.cust_id;

hariprasad
Автор

one doubt: how we ensure flight id describe the order of connectivity

ashish.barmer
Автор

Your solution will not work if flight10, flight11...flight15 will be there

AbhishekKumar_
Автор

WITH RankedFlights AS (
SELECT
customer_id,
origin,
destination,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY flight) AS row_num,
MAX(ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY flight)) AS max_row_num
FROM
flights_data
)
SELECT
customer_id,
MAX(CASE WHEN row_num = 1 THEN origin END) AS origin,
MAX(CASE WHEN row_num = max_row_num THEN destination END) AS destination
FROM
RankedFlights
GROUP BY
customer_id;

Maheshwaripremierleague