filmov
tv
How to process large dataset with pandas | Avoid out of memory issues while loading data into pandas

Показать описание
In this tutorial, we are covering how to handle large dataset with pandas. I have received few questions regarding handling dataset that is larger than the available memory of the computer. How can we process such datasets via pandas?
My first suggestion would be to filter the data prior to loading it into pandas dataframe. Second, use a distributed engines that is designed for big data. Some of the examples are Dask, Apache Flink, Kafka and Spark. We are covering Spark in the recent series. These systems use a cluster of computers called nodes to process data. They can handle terabyte of data depending on the available nodes.
Anyways, let’s say we have some data in a relational database, it is a medium size dataset and we want to process it with Pandas. How can we safely load it into pandas.
#pandas #memorymanagement #batchprocessing
Subscribe to our channel:
---------------------------------------------
Follow me on social media!
---------------------------------------------
#ETL #Python #SQL
Topics covered in this video:
0:00 - Introduction to Pandas large data handling
0:19 - Recommendation for large datasets
0:58 - Why memory error occurs?
1:26 - Pandas batching or Server side cursor a solution
1:49 - Simple example with Jupyter Notebook
3:04 - Method Two Batch Processing on the client
4:56 - Method Three Batch Processing on the Server
6:19 - Pandas-dev PR for Server side cursor
6:36 - Pandas batching overview and summary
My first suggestion would be to filter the data prior to loading it into pandas dataframe. Second, use a distributed engines that is designed for big data. Some of the examples are Dask, Apache Flink, Kafka and Spark. We are covering Spark in the recent series. These systems use a cluster of computers called nodes to process data. They can handle terabyte of data depending on the available nodes.
Anyways, let’s say we have some data in a relational database, it is a medium size dataset and we want to process it with Pandas. How can we safely load it into pandas.
#pandas #memorymanagement #batchprocessing
Subscribe to our channel:
---------------------------------------------
Follow me on social media!
---------------------------------------------
#ETL #Python #SQL
Topics covered in this video:
0:00 - Introduction to Pandas large data handling
0:19 - Recommendation for large datasets
0:58 - Why memory error occurs?
1:26 - Pandas batching or Server side cursor a solution
1:49 - Simple example with Jupyter Notebook
3:04 - Method Two Batch Processing on the client
4:56 - Method Three Batch Processing on the Server
6:19 - Pandas-dev PR for Server side cursor
6:36 - Pandas batching overview and summary
Комментарии