How to use GROUPBY and AGG functions in pySpark in Microsoft Fabric (Day 9 of 30)

preview_player
Показать описание
Learn Apache Spark in Microsoft Fabric in the 30 days of September.

Spark is the engine behind both the Data Engineering AND the Data Science experience in Microsoft Fabric, so in September I'll be walking you through Apache Spark: what it is, why you should learn it, how to use it, and how it integrates into Microsoft Fabric.

No previous Spark knowledge is required, some basic Python would be useful!

#pyspark #microsoftfabric #apachespark

Here's the schedule:

Timeline
0:00 Intro
1:02 Loading in some data
2:36 The simplest GroupBy
3:32 Renaming aggregate column names
5:34 Multiple aggregate functions
6:38 Grouping and filtering in same function
7:50 GroupBy multiple columns
8:36 Summary

--BROWSE MY OTHER FABRIC PLAYLISTS--

--LINKEDIN--

--ABOUT WILL--
Hi, I'm Will! I'm hugely passionate about data and using it to create a better world. I currently work as a Consultant, focusing on Data Strategy, Data Engineering and Business Intelligence (within the Microsoft/Azure/Fabric environment). I have previously worked as a Data Scientist. I started Learn Microsoft Fabric to share my learnings on how Microsoft Fabric works and help you build your career and build meaningful things in Fabric.

--SUBSCRIBE--
Not subscribed yet? You should! There are lots of new videos in the pipeline covering all aspects of Microsoft Fabric.
Рекомендации по теме
Комментарии
Автор

Will i have a question about a strange behavior i am seeing in fabric notebook, I am executing the following code, a simple group by (but my df has around 300 Million rows), then displaying the df_agg that should be less than a 1000 rows df.
df_agg = df.groupBy("DATE", "CATEGORY").agg(count(col("ID")).alias("ID_COUNT"), countDistinct(col("ID")).alias("DISTINCT_ID_COUNT"))
display(df_agg)
What i am seeing is that calling the display(df_agg) is recomputing the groupby statement, i.e i am performing the process of groupby + aggregate twice, instead of calling it once. Is that normal? am i doing something wrong?

mkj-md
Автор

Will, I could not find the property-sales-extended.csv file from your github?

etrasher