Data Engineer Mock Interview | SQL | PySpark | Project & Scenario based Interview Questions

preview_player
ะŸะพะบะฐะทะฐั‚ัŒ ะพะฟะธัะฐะฝะธะต

I have trained over 20,000+ professionals in the field of Data Engineering in the last 5 years.

๐–๐š๐ง๐ญ ๐ญ๐จ ๐Œ๐š๐ฌ๐ญ๐ž๐ซ ๐’๐๐‹? ๐‹๐ž๐š๐ซ๐ง ๐’๐๐‹ ๐ญ๐ก๐ž ๐ซ๐ข๐ ๐ก๐ญ ๐ฐ๐š๐ฒ ๐ญ๐ก๐ซ๐จ๐ฎ๐ ๐ก ๐ญ๐ก๐ž ๐ฆ๐จ๐ฌ๐ญ ๐ฌ๐จ๐ฎ๐ ๐ก๐ญ ๐š๐Ÿ๐ญ๐ž๐ซ ๐œ๐จ๐ฎ๐ซ๐ฌ๐ž - ๐’๐๐‹ ๐‚๐ก๐š๐ฆ๐ฉ๐ข๐จ๐ง๐ฌ ๐๐ซ๐จ๐ ๐ซ๐š๐ฆ!

"๐€ 8 ๐ฐ๐ž๐ž๐ค ๐๐ซ๐จ๐ ๐ซ๐š๐ฆ ๐๐ž๐ฌ๐ข๐ ๐ง๐ž๐ ๐ญ๐จ ๐ก๐ž๐ฅ๐ฉ ๐ฒ๐จ๐ฎ ๐œ๐ซ๐š๐œ๐ค ๐ญ๐ก๐ž ๐ข๐ง๐ญ๐ž๐ซ๐ฏ๐ข๐ž๐ฐ๐ฌ ๐จ๐Ÿ ๐ญ๐จ๐ฉ ๐ฉ๐ซ๐จ๐๐ฎ๐œ๐ญ ๐›๐š๐ฌ๐ž๐ ๐œ๐จ๐ฆ๐ฉ๐š๐ง๐ข๐ž๐ฌ ๐›๐ฒ ๐๐ž๐ฏ๐ž๐ฅ๐จ๐ฉ๐ข๐ง๐  ๐š ๐ญ๐ก๐จ๐ฎ๐ ๐ก๐ญ ๐ฉ๐ซ๐จ๐œ๐ž๐ฌ๐ฌ ๐š๐ง๐ ๐š๐ง ๐š๐ฉ๐ฉ๐ซ๐จ๐š๐œ๐ก ๐ญ๐จ ๐ฌ๐จ๐ฅ๐ฏ๐ž ๐š๐ง ๐ฎ๐ง๐ฌ๐ž๐ž๐ง ๐๐ซ๐จ๐›๐ฅ๐ž๐ฆ."

๐‡๐ž๐ซ๐ž ๐ข๐ฌ ๐ก๐จ๐ฐ ๐ฒ๐จ๐ฎ ๐œ๐š๐ง ๐ซ๐ž๐ ๐ข๐ฌ๐ญ๐ž๐ซ ๐Ÿ๐จ๐ซ ๐ญ๐ก๐ž ๐๐ซ๐จ๐ ๐ซ๐š๐ฆ -

30 INTERVIEWS IN 30 DAYS- BIG DATA INTERVIEW SERIES

This mock interview series is launched as a community initiative under Data Engineers Club aimed at aiding the community's growth and development

Link of Free SQL & Python series developed by me are given below -

Don't miss out - Subscribe to the channel for more such informative interviews and unlock the secrets to success in this thriving field!

Social Media Links :

Discussed Questions : Timestamp
1:30 Introduction
3:29 When you are processing the data with databricks pyspark job. What is the sink for your pipeline?
4:58 Are you incorporating fact and dimension tables, or any schema in your project's database design?
5:50 What amount of data are you dealing with in your day to day pipeline?
6:33 What are the different types of triggers in ADF?
7:45 What is incremental load ? How can you implement it through ADF ?
10:03 Difference between Data Lake and Data Warehouse?
11:41 What is columnar storage in a data warehouse ?
13:38 What were some challenges encountered during your project, and how were they resolved? Describe the strategies implemented to optimize your pipeline?
16:18 Optimizations related to Databricks or pyspark ?
20:41 What is broadcast join ? What exactly happens when we broadcast the table ?
23:01 SQL coding question
35:46 PySpark coding question

Tags
#mockinterview #bigdata #career #dataengineering #data #datascience #dataanalysis #productbasedcompanies #interviewquestions #apachespark #google #interview #faang #companies #amazon #walmart #flipkart #microsoft #azure #databricks #jobs
ะ ะตะบะพะผะตะฝะดะฐั†ะธะธ ะฟะพ ั‚ะตะผะต
ะšะพะผะผะตะฝั‚ะฐั€ะธะธ
ะะฒั‚ะพั€

Please provide the interview feedback in few mins at the end to help more with this.

ShubhamYadav-gqfe
ะะฒั‚ะพั€

For incremental laod why we go about MERGE or UPSERT. MERGE or UPSERT we use to implement SCD types. For incremental load what we want is to copy newly arrived data in ADLS. For which we keep track of some reference key, through which we can recognize the new data. For example, in an Order fact table lets say it is Order_ID which keeps on increasing whenever we get a new order.

rajnarayanshriwas
ะะฒั‚ะพั€

Please attach the questions list link(in view mode) that are asked in mock interview in description

harshitgoel
ะะฒั‚ะพั€

Sir please make videos on topics like " Someone working in Tech Support from past 5 years and now moving to Data Engineer" What they should write in their resume like in experience section... Whether should give try as fresher or whatever

AA
ะะฒั‚ะพั€

Good initiative. This is quite helpful on how to answer the scenario based questions, with an example. Thank you sir, Ankur and Praroop! ๐Ÿ™Œ

PradyutJoshi
ะะฒั‚ะพั€

great content! very insightful questions and answers!

yifeichen
ะะฒั‚ะพั€

Great video for new data engineers like me.

WadieGamer
ะะฒั‚ะพั€

Please make videos for freshers as well, because these days no one is looking for freshers for data engineering roles...

gopalgaihre
ะะฒั‚ะพั€

Hi Folks, below is the solution to the PySpark problem written in >>SCALA<<

df.withColumn("LOCATION-CODE", split((split(col("REF-ID"), "-")(1)), "_")(0))
.withColumn("LOCATION",
when(col("LOCATION-CODE")==="CHN", "CHENNAI")
.when(col("LOCATION-CODE")==="HYD", "HYDERABAD")
.when(col("LOCATION-CODE")==="AP", "ANDHRA PRADESH")
.when(col("LOCATION-CODE")==="PUNE", "PUNE"))
.show()

RahulSaini-ngpo
ะะฒั‚ะพั€

Please also some video regarding what kinds of problems data engineer face in their day to days working

BooksWala
ะะฒั‚ะพั€

Hi Sir, Thanks for this series, very insightful. Just a query, does majority of the interviews goes till coding part or majority cases its theory only? or is it mix and match?

NabaKrPaul-ikoy
ะะฒั‚ะพั€

thank you so much sumit sir its really helpful

jgtnwbd
ะะฒั‚ะพั€

sir please make complete video on sql and mock interviews too

Raghavendraginka
ะะฒั‚ะพั€

Solution for Pyspark Problem
def location_f(loc):
if loc == 'CHN':
return 'CHENNAI'
elif loc == 'AP':
return 'ANDHRA PRADESH'
elif loc == 'HYD':
return 'HYDERABAD'
else:
return loc

re_location = F.udf(location_f, StringType())

df1 = df.withColumn('ref_id1', F.split('ref_id', '\DIV-|\_')).drop('ref_id')

df2 = df1.withColumn('ref_id', F.col('ref_id1')[2]).withColumn('location',

df3 = df2.select('name', 'ref_id', 'salary', 'location')

df3.show

karthikeyanr
ะะฒั‚ะพั€

Hi Sir, Request you to please upload more videos on Data engineer mock interview

swapnildande
ะะฒั‚ะพั€

Please make interview session
for fresher.

ashwinigadekar
ะะฒั‚ะพั€

can u make a video for aws cloud as of azure

saurabhgavande
ะะฒั‚ะพั€

df_new = df.select(col("name"), col("refid"), col("salary"), split("refid", "-")[1].alias("l"), split("l", "_")[0].alias("loc")).drop(col("l"))

final_result_df = df_new.withColumn("location", when(col("loc")=="CHN", "CHENNAI")\
.when(col("loc")=="HYD", "HYDERABAD")\
.when(col("loc")=="AP", "ANDRA_PRADESH")\
.when(col("loc")=="PUN", "PUNE") ).drop("loc")

vishaldeshatwad