How to use PySpark DataFrame API? | DataFrame Operations on Spark

preview_player
Показать описание
In this tutorial we will continue with PySpark. In the previous session we covered the setup, learned about the basics of PySpark and explored few of the features that it offers for example DataFrame API and Spark SQL. In this session we will further explore these features before we dive into building data pipelines with PySpark (Spark API).
Spark is a distributed engine designed for processing large amount of data. It offers scalability beyond a single machine. If you encounter Pandas memory error due to data size then it is time to explore Spark. It is designed for large data. It is the engine behind the AWS Glue.

Subscribe to our channel:

-------------------------------------------
Follow me on social media!

-------------------------------------------

#apachespark #pyspark #dataframe

Topics covered in this video:
0:00 - Introduction to PySpark
0:28 - Spark in current context of Data
1:16 - Spark DataFrame API
2:22 - Jupyter Notebook
3:00 - Read Data from Database
4:06 - DataFrame API Operations - Rename and Select
4:35 - Sort DataFrame
5:14 - Filter Operation in DataFrame and Spark SQL
7:40 - DataFrame & SQL Join & Aggregate Operation
9:22 - Create new Columns based on condition
11:06 - Replace Null & Drop Columns
Рекомендации по теме
Комментарии
Автор

Great video! does the database need to always by JDBC? can it be ODBC as well?

breadandcheese
Автор

Hi, Thanks for the session on this.
I'm trying to create a dashboard on Spark UI using pyspark. Is that possible?

yuvan
Автор

Thank you for your videos. Would be nice, if you can also show Kafka implementation examples.

nbkurup