pyspark scenario based interview questions and answers | #pyspark | #interview | #data

preview_player
Показать описание
pyspark scenario based interview questions and answers
pyspark interview questions and answers
spark interview questions and answers

Create DataFrame Code :
====================
data = [ ('c1', 'New York', 'Lima'),
('c1', 'London', 'New York'),
('c1', 'Lima', 'Sao Paulo'),
('c1', 'Sao Paulo', 'New Delhi'),
('c2', 'Mumbai', 'Hyderabad'),
('c2', 'Surat', 'Pune'),
('c2', 'Hyderabad', 'Surat'),
('c3', 'Kochi', 'Kurnool'),
('c3', 'Lucknow', 'Agra'),
('c3', 'Agra', 'Jaipur'),
('c3', 'Jaipur', 'Kochi')]

schema = "customer string , start_location string , end_location string"

top interview question and answer in pyspark :

#freshworks #deloitte #zs #fang #pyspark #sql #interview #dataengineers #dataanalytics #datascience #StrataScratch #Facebook #data #dataengineeringinterview #codechallenge #datascientist #pyspark #CodingInterview
#dsafordataguy #dewithdhairy #DEwithDhairy #dhiarjgupta #leetcode #topinterviewquestion
Рекомендации по теме
Комментарии
Автор

Thanks for informative stuff.
Instead of specifying all conditions in the join.
Just we can specify only one condition ( I mean not required and Or conditions)
It works and fetch expected output.
Cheers!!

Tech.S
Автор

In last after finding unique record can we use collectlist by using group by on customer then using indexes as start and end location in withcolumn?

siddharthchoudhary
Автор

here is my solution

#creating two dataframes for start and end
df1=df.select('customer', 'start_location').alias('a')
df2=df.select('customer', 'end_location').alias('b')
#checking for locations
df3=df1.join(df2, concat(col('a.customer'), col('a.start_location'))==concat(col('b.customer'), col('b.end_location')), 'leftanti')
df4=df2.join(df1, concat(col('a.customer'), col('a.start_location'))==concat(col('b.customer'), col('b.end_location')), 'leftanti')
#final output
df5=df3.join(df4, ["customer"], 'inner')

torrentdownloada
Автор

with t1 AS (select customer, start_loc from travel_data where start_loc not in (select end_loc from travel_data))
, t2 AS (select customer, end_loc from travel_data where end_loc not in (select start_loc from travel_data))
select t1.customer, t1.start_loc, t2.end_loc from t2 join t1 on t2.customer=t1.customer

VikasChavan-vc
Автор

df1=df.select("customer", "start_location")
df2=df.select("customer", "end_location")
df3=df1.union(df2).groupBy("customer", "start_location").agg(count("start_location").alias("count")).filter("count==1")
df3.alias("a").join(df3.alias("b"), ["customer"], "inner").filter("a.start_location<b.start_location").selectExpr("customer", "a.start_location", "b.start_location as end_location").display()

prabhatgupta
Автор

No udf, no join, no subquery


.agg(collect_set("start_location").alias("start_list"), collect_set("end_location").alias("end_list"))
.withColumn("start_location", array_except("start_list", "end_list").getItem(0))
.withColumn("end_location", array_except("end_list", "start_list").getItem(0))
.drop("start_list", "end_list"))

tradingwithk
Автор

from pyspark.sql.functions import collect_list, udf
from pyspark.sql.types import StringType

def loc(x, y):
a = [i for i in x if i not in y]
return a[0]

loc_udf = udf(loc, StringType())
df1 = df.groupBy('customer').agg(collect_list('start_location').alias('start_list'),
display(df1)
df2 = df1.withColumn('start', loc_udf(df1.start_list, df1.end_list)).withColumn('end', loc_udf(df1.end_list, df1.start_list)).drop(*('start_list', 'end_list'))
display(df2)

KAVURURAMANUJAM
Автор

from pyspark.sql.functions import collect_list, udf
from pyspark.sql.types import StringType

def loc(x, y):
a = [i for i in x if i not in y]
return a[0]

loc_udf = udf(loc, StringType())
df1 = df.groupBy('customer').agg(collect_list('start_location').alias('start_list'),
display(df1)
df2 = df1.withColumn('start', loc_udf(df1.start_list, df1.end_list)).withColumn('end', loc_udf(df1.end_list, df1.start_list)).drop(*('start_list', 'end_list'))
display(df2)

KAVURURAMANUJAM