How to work with big data files (5gb+) in Python Pandas!

preview_player
Показать описание
In this video, we quickly go over how to work with large CSV/Excel files in Python Pandas. Instead of trying to load the full file at once, you should load the data in chunks. This is especially useful for files that are a gigabyte or larger. Let me know if you have any questions :).

Source code on Github:

Raw data used (from Kaggle):

I want to start uploading data science tips & exercises to this channel more frequently. What should I make videos on??

-------------------------
Follow me on social media!

-------------------------

Practice your Python Pandas data science skills with problems on StrataScratch!

Join the Python Army to get access to perks!

*I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.

-------------------------
Video timeline!
0:00 - Overview
1:25 - What not to do.
2:16 - Python code to load in large CSV file (read_csv & chunksize)
8:00 - Finalizing our data
Рекомендации по теме
Комментарии
Автор

The end of the video was so fascinating to see how that huge amount of data was compressed to such a manageable size.

Hossein
Автор

glad you are back my man, I am currently in a data science bootcamp and you are way better than some of my teachers ;)

michaelhaag
Автор

During my 3 years in the field of data science, this course would be the best I've ever watched.
thank you brother, go ahead.

mjacfardk
Автор

If (and only if) you only want to read a few columns, just specify the columns you want to process from the CSV by adding *usecols=["brand", "category_code", "event_type"]* to the *pd.read_csv* function. Took about 38seconds to read on an M1 Macbook Air.

fruitfcker
Автор

It was quick and straight to the point. Very good one thanks.

ahmetsenol
Автор

Never used chunk in read_csv before, it helps a lot! Great tip, thanks

dhssb
Автор

Since you are working with Python, another approach would be to import the data into SQLite db. Then create some aggregate tables and views ...

CaribouDataScience
Автор

No wonder I've had trouble with Kaggle datasets! "Big" is a relative term. It's great to have a reasonable benchmark to work with! Many thanks!

jacktrainer
Автор

OMG..this is gold..thank you for sharing

firasinuraya
Автор

great short video! nice job and thanks!

elu
Автор

Great video! Hope you start making more soon

andydataguy
Автор

Thanks Keith. Please do more videos on EDA python.

rishigupta
Автор

thanks for the great lesson wondering what would be the performance between output = pd.concat([output, summary) vs output.append(summary)?

spicytuna
Автор

Pandas have capabilities I don't know it - secret Keith knows everything

AshishSingh-
Автор

How would a go about it if it was a jsonlines(jsonl) data file?

agnesmunee
Автор

Why and how you use 'append' with DataFrame? I have an error, when I do the same thing. Only if I use a list instead, and then concat all the dfs in the list I have the same result as you do.

DataAnalystVictoria
Автор

why not groupby.size() instead of groupby.sum() the column of 1's?

CS_nb
Автор

i have error message on this one. it says 'DataFrame' object is not callable. why is that and how to solve it? thanks
for chunk in df:
details = chunk[['brand', 'category_code', 'event_type']]
display(details.head())
break

manyes
Автор

This works fine if you don't have any duplicates in your data. Even if you de-dupe every chunk, aggregating it makes it impossible to know whether there are any dupes between the chunks. In other words, do not use this method if you're not sure whether your data contains duplicates.

rokaskarabevicius
Автор

I have tried and followed each step however it gives this error:
OverflowError: signed integer is greater than maximum

machinelearning