109. Databricks | Pyspark| Coding Interview Question: Pyspark and Spark SQL

Показать описание

Azure Databricks Learning: Coding Interview Exercise: Pyspark and Spark SQL
=================================================================================

Coding exercises are very common in most of the Bigdata interviews. It is important to develop coding skills before appearing for Spark/Databricks interviews.

In this video, I have explained a coding scenario to find out start and end date of data buckets. To get more understanding, watch this video

#CodingInterviewQuestion, #ApacheSparkInterview, #SparkCodingExercise, #DatabricksCodingInterview,#SparkWindowFunctions,#SparkDevelopment,#DatabricksDevelopment, #DatabricksPyspark,#PysparkTips, #DatabricksTutorial, #AzureDatabricks, #Databricks, #Databricksforbeginners,#datascientists, #datasciencecommunity,#bigdataengineers,#machinelearningengineers

Raja's Data Engineering

Рекомендации по теме

Комментарии

FIrst time I saw scenario-based and interview-based solutions in Youtube videos. Thanks for your commitment and for sharing the knowledge.

prasadtelu

Please continue this series it will be very helpful to crack the interview and thank for starting this series.

adiityagupta-wutz

Thanks Sir..create playlist of coding questions which are frequently asked.

prabhatgupta

can i get the code copy pasted in description or maybe ink to the notebook

harithad

One more suggestion plz do put the daatset in description

prabhatgupta

here is my SQL query for the same.

declare @Event_Table table ( Event_date date, Event_status varchar(8))

insert into @Event_Table

select getdate()+Value, case when value<3 then 'Won'
when value > 3 and value < 7 then 'Lost' else 'Won' end from generate_series(1, 10, 1)

; with cte
as
(
select *
, row_number() over ( order by Event_date) - row_number() over ( order by Event_status, Event_date) as GroupId
from @Event_Table
)
select GroupId
, min(Event_status) as Event_status
, min(Event_date) as Start_date
, max(Event_date) as End_Date
, count(1) as Consequtive_Events
from cte
group by GroupId

landchennai

Thanks for this video, But I am curious why didnt you directly use min max with group by which would have fetched the same result
```
result = df.withColumn("event_date", F.to_date("event_date")) \
.groupBy("event_status") \
.agg(
F.min("event_date").alias("event_start_date"),

) \
.orderBy("event_start_date")

result.show()
```

jinsonfernandez

It should be 1 in first row of change event at 08:10 as previous value is not same with first row of event status but why it is coming as 0?

DataEngineering-niot

Hi sir could you please share the notebook and dataset in the description. as it will helpful for our practice. Thanks in advance.

arrooow

Hi sir could you please share the notebook and the github repository link to access the code

namratachavan

This solution will work only when the dates are in order wrt events. Tried jumbling them, didnt work.

saurabh

how to explain project in an interview data engineering project

Tushar

This solution will not work if you have data like this, may be some tweak will be needed - data = [
("2020-06-01", "Won"),
("2020-06-02", "Won"),
("2020-06-03", "Won"),
("2020-06-03", "Lost"),
("2020-06-04", "Lost"),
("2020-06-05", "Lost"),
("2020-06-06", "Lost"),
("2020-06-07", "Won")

]

roshniagrawal

I did it in something like this . By suing a default date, a running number and datediff

from pyspark.sql.functions import to_date, row_number,asc,date_add,lit,datediff,min,max
from pyspark.sql.window import Window

eventDF.withColumn("event_date", to_date(col="event_date", format= "dd-MM-yyyy")) \
.withColumn("rank", row_number().over(Window.partitionBy("event_status").orderBy(asc("event_date")))) \
.withColumn("startDate", date_add(lit("1900-01-01"), "rank")) \
.withColumn("datediff", datediff("event_date", "startDate")) \
.groupBy("datediff", "event_status").agg(min("event_date").alias("start_date"), max("event_date").alias("end_date")) \
.drop("rangeDate") \
.sort("start_date").show()

starmscloud

109. Databricks | Pyspark| Coding Interview Question: Pyspark and Spark SQL

109. Databricks | Pyspark| Coding Interview Question: Pyspark and Spark SQL

111. Databricks | Pyspark| SQL Coding Interview: Exchange Seats of Students

Most Important Question of PySpark in Tech Tech Interview Question #pysparkinterview #interview

Optimize Tricks of PySpark | Databricks Tutorial | PySpark |

Trending Big Data Interview Question - Number of Partitions in your Spark Dataframe

Ali Ghodsi's $38B Databricks

Most Important Question of PySpark in LTIMindTree Interview Question | Salary in each department |

HOW TO CRACK DATA ENGINEERING JOB WITH HIGH PACKAGE 🔥🔥🔥

108. Databricks | Pyspark| Window Function: First and Last

Tiger Analytics PySpark Interview Question | Very Important Question of PySpark |

Pyspark Advanced interview questions part 1 #Databricks #PysparkInterviewQuestions #DeltaLake

Model Serving at Databricks #machinelearning #machinelearningengineer

Solve using PySpark and Spark-SQL | Accenture Interview Question |

44. Get Maximum and Maximum Value From Column | PySpark Max Min

101. Databricks | Pyspark |Core/Architecture: Spark/Databricks Interview Question Series - I

1. Merge two Dataframes using PySpark | Top 10 PySpark Scenario Based Interview Question|

pyspark interview questions| DAG & Lineage | #subscribe #dataengineers #databricks

22. union and unionAll in PySpark | unionbyname in pyspark | pyspark tutorial for beginners

10. Solve using regexp_extract method |Top 10 PySpark Scenario-Based Interview Question| MNC

91. Databricks | Pyspark | Interview Question |Handlining Duplicate Data: DropDuplicates vs Distinct

PySpark Kickstart - Your first Apache Spark data pipeline

RDD Vs Dataframe in Pyspark - Databricks

Collect_List and Collect_Set in PySpark| Databricks Tutorial Series|

102. Databricks | Pyspark |Performance Optimization: Spark/Databricks Interview Question Series - II