Most Important Question of PySpark in Deutsche Bank Interview Question | PySpark Join |

preview_player
Показать описание
data=[(1,5),(2,6),(3,5),(3,6),(1,6)]
schema="customer_id int,product_key int"

data=[(5,),(6,)]
schema="product_key int"

Databricks-PySpark RealTime Scenarios Interview Question Series

Project Link:

#hashtags
#tags #pysparkinterview #pysparkforbeginners
Рекомендации по теме
Комментарии
Автор

Was searching for the true purpose of countDistinct in a similar case... thanks Sagar

_Sujoy_Das
Автор

What if we have 5, 6, 7, 8 in product table and customer table is same as mentioned?? i dont think count distinct will work in this case

df = customer_df.withColumn('flag', when(col('product_key').isin(product_df.select('product_key').rdd.flatMap(lambda x : x).collect()), 1).otherwise(0)).distinct()

==

This should work in most of the cases.. Thanks for bringing such questions to us Sagar

chetanphalak
Автор

I remember there was a concept of where exist with correlated query, with works like (for all) expression. I will try solving that as well

ashishagarwal
Автор

Hey Sagar
Can you give discount to buy your course

devamurugansankaran
Автор

Bro from where do u collect such questions. Please share the resources to practice more sucu questions

surajpatil
Автор

Hey @sagar
Thank you for posting such questions..
Can you please recheck once if this works for all related scenarios or only for these particular dataframes which you are using in this example?

I tried and found out that simple inner join can get us the required result without the use of countdistinct function and also the solution which you shared above is not working for all scenarios. Like I tried tweaking your dataframes which different values like 7, 8, 9, 10, 11 as product key in customer_df and kept product_key in product_df as same 5, 6. And the logic fails there.

I might be missing something. Please do correct me out in case I am ignoring anything important.

My solution which works for all possible scenarios:
Final_df=customer_df.join(product_df, on=‘product_key’,

saurabh
Автор

Hi Sir My Solution

df =
df.filter(col('count')>= product_df.count()).show()

rawat