pyspark scenario based interview questions and answers | #pyspark | #interview | #data

Показать описание

pyspark scenario based interview questions and answers
pyspark interview questions and answers
spark interview questions and answers

Create DataFrame Code :
====================
data = [ ('c1', 'New York', 'Lima'),
('c1', 'London', 'New York'),
('c1', 'Lima', 'Sao Paulo'),
('c1', 'Sao Paulo', 'New Delhi'),
('c2', 'Mumbai', 'Hyderabad'),
('c2', 'Surat', 'Pune'),
('c2', 'Hyderabad', 'Surat'),
('c3', 'Kochi', 'Kurnool'),
('c3', 'Lucknow', 'Agra'),
('c3', 'Agra', 'Jaipur'),
('c3', 'Jaipur', 'Kochi')]

schema = "customer string , start_location string , end_location string"

top interview question and answer in pyspark :

#freshworks #deloitte #zs #fang #pyspark #sql #interview #dataengineers #dataanalytics #datascience #StrataScratch #Facebook #data #dataengineeringinterview #codechallenge #datascientist #pyspark #CodingInterview
#dsafordataguy #dewithdhairy #DEwithDhairy #dhiarjgupta #leetcode #topinterviewquestion

DEwithDhairy

Рекомендации по теме

Комментарии

Thanks for informative stuff.
Instead of specifying all conditions in the join.
Just we can specify only one condition ( I mean not required and Or conditions)
It works and fetch expected output.
Cheers!!

Tech.S

In last after finding unique record can we use collectlist by using group by on customer then using indexes as start and end location in withcolumn?

siddharthchoudhary

here is my solution

#creating two dataframes for start and end
df1=df.select('customer', 'start_location').alias('a')
df2=df.select('customer', 'end_location').alias('b')
#checking for locations
df3=df1.join(df2, concat(col('a.customer'), col('a.start_location'))==concat(col('b.customer'), col('b.end_location')), 'leftanti')
df4=df2.join(df1, concat(col('a.customer'), col('a.start_location'))==concat(col('b.customer'), col('b.end_location')), 'leftanti')
#final output
df5=df3.join(df4, ["customer"], 'inner')

torrentdownloada

with t1 AS (select customer, start_loc from travel_data where start_loc not in (select end_loc from travel_data))
, t2 AS (select customer, end_loc from travel_data where end_loc not in (select start_loc from travel_data))
select t1.customer, t1.start_loc, t2.end_loc from t2 join t1 on t2.customer=t1.customer

VikasChavan-vc

df1=df.select("customer", "start_location")
df2=df.select("customer", "end_location")
df3=df1.union(df2).groupBy("customer", "start_location").agg(count("start_location").alias("count")).filter("count==1")
df3.alias("a").join(df3.alias("b"), ["customer"], "inner").filter("a.start_location<b.start_location").selectExpr("customer", "a.start_location", "b.start_location as end_location").display()

prabhatgupta

No udf, no join, no subquery

.agg(collect_set("start_location").alias("start_list"), collect_set("end_location").alias("end_list"))
.withColumn("start_location", array_except("start_list", "end_list").getItem(0))
.withColumn("end_location", array_except("end_list", "start_list").getItem(0))
.drop("start_list", "end_list"))

tradingwithk

from pyspark.sql.functions import collect_list, udf
from pyspark.sql.types import StringType

def loc(x, y):
a = [i for i in x if i not in y]
return a[0]

loc_udf = udf(loc, StringType())
df1 = df.groupBy('customer').agg(collect_list('start_location').alias('start_list'),
display(df1)
df2 = df1.withColumn('start', loc_udf(df1.start_list, df1.end_list)).withColumn('end', loc_udf(df1.end_list, df1.start_list)).drop(*('start_list', 'end_list'))
display(df2)

KAVURURAMANUJAM

pyspark scenario based interview questions and answers | #pyspark | #interview | #data

pyspark scenario based interview questions and answers | #pyspark | #interview | #data

Top 50 PySpark Interview Questions & Answers 2024 | PySpark Interview Questions | MindMajix

Understanding how to Optimize PySpark Job | Cache | Broadcast Join | Shuffle Hash Join #interview

Top 15 Spark Interview Questions in less than 15 minutes Part-2 #bigdata #pyspark #interview

PySpark Interview Questions & Answers | PySpark Interview Questions

10 recently asked Pyspark Interview Questions | Big Data Interview

Trending Big Data Interview Question - Number of Partitions in your Spark Dataframe

Real-time Big Data Project Common Scenarios | How are Duplicates handled in PySpark #interview

pyspark data engineer interview questions and answers || 3-7 years || Cache vs persist

day 6 | fill null values | pyspark scenario based interview questions and answers

day 8 | capgemini interview question | pyspark scenario based interview questions and answers

day 7 | pyspark scenario based interview questions and answers

Understanding How to Handle Data Skewness in PySpark #interview

10 PySpark Product Based Interview Questions

Spark Interview Question | How many CPU Cores | How many executors | How much executor memory

4 Recently asked Pyspark Coding Questions | Apache Spark Interview

day 12 : pyspark scenario based interview questions and answers | join in pyspark | #interview #de

10. Solve using regexp_extract method |Top 10 PySpark Scenario-Based Interview Question| MNC

day 3 | consecutive days | pyspark scenario based interview questions and answers

40 Scenario based pyspark interview question | pyspark interview

1. Merge two Dataframes using PySpark | Top 10 PySpark Scenario Based Interview Question|

Data Engineer Mock Interview | SQL | PySpark | Project & Scenario based Interview Questions

HCL Pyspark Interview question #ScenarioBasedInterviewQuestions #PysparkInterviewQuestions

Most Asked Coding Interview Question (Don't Skip !!😮) #shorts