How to work with big data files (5gb+) in Python Pandas!

Показать описание

In this video, we quickly go over how to work with large CSV/Excel files in Python Pandas. Instead of trying to load the full file at once, you should load the data in chunks. This is especially useful for files that are a gigabyte or larger. Let me know if you have any questions :).

Source code on Github:

Raw data used (from Kaggle):

I want to start uploading data science tips & exercises to this channel more frequently. What should I make videos on??

-------------------------
Follow me on social media!

-------------------------

Practice your Python Pandas data science skills with problems on StrataScratch!

Join the Python Army to get access to perks!

*I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.

-------------------------
Video timeline!
0:00 - Overview
1:25 - What not to do.
2:16 - Python code to load in large CSV file (read_csv & chunksize)
8:00 - Finalizing our data

Рекомендации по теме

Комментарии

The end of the video was so fascinating to see how that huge amount of data was compressed to such a manageable size.

Hossein

glad you are back my man, I am currently in a data science bootcamp and you are way better than some of my teachers ;)

michaelhaag

During my 3 years in the field of data science, this course would be the best I've ever watched.
thank you brother, go ahead.

mjacfardk

If (and only if) you only want to read a few columns, just specify the columns you want to process from the CSV by adding *usecols=["brand", "category_code", "event_type"]* to the *pd.read_csv* function. Took about 38seconds to read on an M1 Macbook Air.

fruitfcker

It was quick and straight to the point. Very good one thanks.

ahmetsenol

Never used chunk in read_csv before, it helps a lot! Great tip, thanks

dhssb

Since you are working with Python, another approach would be to import the data into SQLite db. Then create some aggregate tables and views ...

CaribouDataScience

No wonder I've had trouble with Kaggle datasets! "Big" is a relative term. It's great to have a reasonable benchmark to work with! Many thanks!

jacktrainer

OMG..this is gold..thank you for sharing

firasinuraya

great short video! nice job and thanks!

elu

Great video! Hope you start making more soon

andydataguy

Thanks Keith. Please do more videos on EDA python.

rishigupta

thanks for the great lesson wondering what would be the performance between output = pd.concat([output, summary) vs output.append(summary)?

spicytuna

Pandas have capabilities I don't know it - secret Keith knows everything

AshishSingh-

How would a go about it if it was a jsonlines(jsonl) data file?

agnesmunee

Why and how you use 'append' with DataFrame? I have an error, when I do the same thing. Only if I use a list instead, and then concat all the dfs in the list I have the same result as you do.

DataAnalystVictoria

why not groupby.size() instead of groupby.sum() the column of 1's?

CS_nb

i have error message on this one. it says 'DataFrame' object is not callable. why is that and how to solve it? thanks
for chunk in df:
details = chunk[['brand', 'category_code', 'event_type']]
display(details.head())
break

manyes

This works fine if you don't have any duplicates in your data. Even if you de-dupe every chunk, aggregating it makes it impossible to know whether there are any dupes between the chunks. In other words, do not use this method if you're not sure whether your data contains duplicates.

rokaskarabevicius

I have tried and followed each step however it gives this error:
OverflowError: signed integer is greater than maximum

machinelearning

How to work with big data files (5gb+) in Python Pandas!

How to work with big data files (5gb+) in Python Pandas!

Career Advice: Should You Work for a Big or Small Company

The big debate about the future of work, explained

The #1 thing employees want at work | Todd Rose for Big Think

Should You Work for a Big Company or a Small Company? [Deep Dive Comparison]

How to Work on BIG and Small Juniper Bonsai

Putting big data to work: Changing the organization

Keys to a Successful Engineering Career: Should I Work For A Big or Small Company

What it’s like to work at big box freight brokerages?

Your BIG LAW Questions Answered: Salary, Work-Life Balance, Hours, and More!

Internet of Things and Big Data: How they work

How to work with big youtubers #youtubers #work #earning

Putting Big Data to Work to Optimize Energy Systems

Big Data at Work

Ask May: Made a big mistake at work? 3 steps to fix it

How I work on a BIG Canvas 🤔

How to Work With Different Personality Types (Including Big Five Personality Traits)

Big 4 AUDIT Reality vs Expectations (Salary, Work-Life Balance)

Sheldon Doesn't Work Here | The Big Bang Theory

DOCTOR: Who do You Work For?? Big-pharma or Patients?

How To Work At Big Cat Rescue - FAQ Friday

My Honest Experience At Morgan Stanley - Big Pay, Work Hours & Culture

big 4 work week in my life: busy season routine, how I stay disciplined with a busy lifestyle + more

How does a plasma ball work? - A REAL BIG PLASMA GLOBE