109. Databricks | Pyspark| Coding Interview Question: Pyspark and Spark SQL

preview_player
Показать описание
Azure Databricks Learning: Coding Interview Exercise: Pyspark and Spark SQL
=================================================================================

Coding exercises are very common in most of the Bigdata interviews. It is important to develop coding skills before appearing for Spark/Databricks interviews.

In this video, I have explained a coding scenario to find out start and end date of data buckets. To get more understanding, watch this video

#CodingInterviewQuestion, #ApacheSparkInterview, #SparkCodingExercise, #DatabricksCodingInterview,#SparkWindowFunctions,#SparkDevelopment,#DatabricksDevelopment, #DatabricksPyspark,#PysparkTips, #DatabricksTutorial, #AzureDatabricks, #Databricks, #Databricksforbeginners,#datascientists, #datasciencecommunity,#bigdataengineers,#machinelearningengineers
Рекомендации по теме
Комментарии
Автор

FIrst time I saw scenario-based and interview-based solutions in Youtube videos. Thanks for your commitment and for sharing the knowledge.

prasadtelu
Автор

Please continue this series it will be very helpful to crack the interview and thank for starting this series.

adiityagupta-wutz
Автор

Thanks Sir..create playlist of coding questions which are frequently asked.

prabhatgupta
Автор

can i get the code copy pasted in description or maybe ink to the notebook

harithad
Автор

One more suggestion plz do put the daatset in description

prabhatgupta
Автор

here is my SQL query for the same.


declare @Event_Table table ( Event_date date, Event_status varchar(8))


insert into @Event_Table

select getdate()+Value, case when value<3 then 'Won'
when value > 3 and value < 7 then 'Lost' else 'Won' end from generate_series(1, 10, 1)

; with cte
as
(
select *
, row_number() over ( order by Event_date) - row_number() over ( order by Event_status, Event_date) as GroupId
from @Event_Table
)
select GroupId
, min(Event_status) as Event_status
, min(Event_date) as Start_date
, max(Event_date) as End_Date
, count(1) as Consequtive_Events
from cte
group by GroupId

landchennai
Автор

Thanks for this video, But I am curious why didnt you directly use min max with group by which would have fetched the same result
```
result = df.withColumn("event_date", F.to_date("event_date")) \
.groupBy("event_status") \
.agg(
F.min("event_date").alias("event_start_date"),

) \
.orderBy("event_start_date")

result.show()
```

jinsonfernandez
Автор

It should be 1 in first row of change event at 08:10 as previous value is not same with first row of event status but why it is coming as 0?

DataEngineering-niot
Автор

Hi sir could you please share the notebook and dataset in the description. as it will helpful for our practice. Thanks in advance.

arrooow
Автор

Hi sir could you please share the notebook and the github repository link to access the code

namratachavan
Автор

This solution will work only when the dates are in order wrt events. Tried jumbling them, didnt work.

saurabh
Автор

how to explain project in an interview data engineering project

Tushar
Автор

This solution will not work if you have data like this, may be some tweak will be needed - data = [
("2020-06-01", "Won"),
("2020-06-02", "Won"),
("2020-06-03", "Won"),
("2020-06-03", "Lost"),
("2020-06-04", "Lost"),
("2020-06-05", "Lost"),
("2020-06-06", "Lost"),
("2020-06-07", "Won")

]

roshniagrawal
Автор

I did it in something like this . By suing a default date, a running number and datediff


from pyspark.sql.functions import to_date, row_number,asc,date_add,lit,datediff,min,max
from pyspark.sql.window import Window

eventDF.withColumn("event_date", to_date(col="event_date", format= "dd-MM-yyyy")) \
.withColumn("rank", row_number().over(Window.partitionBy("event_status").orderBy(asc("event_date")))) \
.withColumn("startDate", date_add(lit("1900-01-01"), "rank")) \
.withColumn("datediff", datediff("event_date", "startDate")) \
.groupBy("datediff", "event_status").agg(min("event_date").alias("start_date"), max("event_date").alias("end_date")) \
.drop("rangeDate") \
.sort("start_date").show()

starmscloud