Read Giant Datasets Fast - 3 Tips For Better Data Science Skills

Показать описание

We've learned how to work with data. But how about massive amounts of data? as in - files with millions of rows, tens of gigabytes in size, and ages of staring at your computer waiting for everything to load?
Luckily, in this tutorial, I will show you how to work with a gigantic dataset of Amazon Best Seller Products that has over 2 million rows, and takes up 11GB in size 😱😱😱
A huge shoutout to Bright Data for supplying it and helping this video come to life!
⭐ you can get a free sample of this dataset here:

Additionally, I will demonstrate that slight improvements to your code make a huge impact on the processing speed - regardless of how strong and powerful your computer is!!
For this, we will compare the performance across 2 different systems:
🖥️ my custom build new-gen PC
💻 my poor old laptop (yes, the one that is held by scotch tape and is barely operational 😅)

You will see that well-written code can even make my old laptop run like a supercomputer! 💪💪💪 #python #datasets #brightdata #data #ecommerce #datascience #pandas #pythonprogramming

📽️ RELATED TUTORIALS 📽️
----------------------------------------------
⭐ Anaconda Guide For Beginners (Install Jupyter Notebook):
⭐ Pandas Guide For Beginners:
⭐ For Loop For Beginners:

⏰ TIME STAMPS ⏰
----------------------------------------------
00:00 - intro
01:05 - intro to working with professional data platforms
03:38 - complexity of loading very large datasets
06:43 - focus on relevant data ⭐
09:09 - load data in small chunks ⭐
10:25 - access and change data chunks values
12:19 - save modified data into a new csv file ⭐
14:49 - Thanks for watching! 😀

🤝 Connect with me 🤝
----------------------------------------------
🔗 Github:
🔗 Discord:
🔗 LinkedIn:
🔗 Twitter:
🔗 Blog:

💳 Credits 💳
----------------------------------------------
⭐ Beautiful titles, transitions, sound FX, and music:
⭐ Beautiful icons:
⭐ Beautiful graphics:

Python Simplified

Рекомендации по теме

Комментарии

it would have been complete if you would have shown how to save to a new file if the loading was in chucks.

xr

You can time things in a Jupyter notebook using an inbuilt magic command. If you wanna time a single line statement, prefix it with %time. If you wanna time the whole cell put %%time at the start. Similarly there are magic commands %timeit and %%timeit which will run your code multiple times and report the fastest time.

phsopher

Your enthusiasm shines through as usual. I know there have been some difficult times since the introduction of Openai etc, but you must not stop doing what you’re doing because thousands of people are relying on you, your wonderful teaching skills and your python (amongst other) knowledge. Thank you.

shanesteven

I need to get more into Data Science and Machine Learning processes and such videos help me a lot. Thanks for that

danielschwan

Saving as a pickle or feather format instead of csv will be much faster and less memory consumable.

pd

For me personally, the best channel out there to learn python!!! I am not kidding! Thank you so much! ❤

mellowbeatz

Beautifully done! Please never stop making these videos :)

faisalee

My approach when downloading very large csv files, is to use
data = requests.get(url, stream=True).iter_lines()

That returns an iterable to the data, but doesn't start downloading it at this stage.

The first row will be the headings, so get that with something like
headings = next(data).decode("utf-8").split(", ")

Then loop over the body of the data either with a for loop, list comprehension, or multiprocessing.Pool().map()
and dump each line into a database, then do queries on the database to analyse it.

Or, if it isn't quite so big, then put it in a numpy array and work on it from there.

katrinabryce

Great presentation- your explanations are the best!!

mtmanalyst

Thank you for your great video. But perhaps for 15 GB data it's better to use Polars instead of Pandas. It has a similar syntax to Pandas so you don't find yourself on a different planet and it uses Rust code for faster execution. It is particularly suitable for processing large data sets, as it has built-in support for multi-threaded and multi-core processing.

visualish

Your videos are extremely didactic and easy to understand, they are the most beautiful and elegant projects on youtube! Congratulations.

jorgevector

I am just learning python and found some of your videos. You are very good and very clear. I had issues installing Anaconda on my Windows 11 computer. It was very slow and crashing most of the time. I have good hardware so that was not the issue. I might try it again as this Juniper looks good. Thank for your time and efforts making these videos.

marcq

Great vid! Are you going to do any vids on Natural Language Programing (NLP), with tools like, Spacy, NLTK, Genism, Core NLP?

d-rey

I really was suprised how fast Python loaded that huge dataset!

LostPlaceChroniken

The most beautiful voice on youtube, thank you for the well narrated and produced content ;)

ssbrunocode

Wow! very helpful video. I was dealing with this problem. Love your videos thanks.

fredoh

I really would suggest to use Polars instead of pandas when using big files...it can be 7 times faster, use timeit to measure to difference. Handling that amount of data every min counts. I love pandas but Polars is WAY faster. Cheers. Nice tutorial though you bought the full set? How did you get access to the full dataset?

pieterbosch

Great information and knowledge! and Love your energy!

rayoh

This was beatiful, just in time for my work. Keep it up!

ciscodea

Very clearly explained and with your usual enthousiasm, keep it up! :)

HadiLePanda

Read Giant Datasets Fast - 3 Tips For Better Data Science Skills

Read Giant Datasets Fast - 3 Tips For Better Data Science Skills

Handling kaggle large datasets on 16Gb RAM | CSV | Yashvi Patel

This INCREDIBLE trick will speed up your data processes.

Python Pandas Tutorial 15. Handle Large Datasets In Pandas | Memory Optimization Tips For Pandas

Best Places to Find Datasets for Your Projects

How to process large dataset with pandas | Avoid out of memory issues while loading data into pandas

How To Read And Process Huge Datasets in Seconds Using Vaex Library| Data Science| Machine Learning

How to train a ML model on a dataset with 3 crore rows?

Working with very LARGE Datasets | 4+ Million Rows | Power Query and Power Pivot | Big Data in Excel

The Trick to Get Unlimited Datasets

Read huge Datasets(single file JSON)without choking the memory | Data Science Stories 01

Testing very large datasets

D3.js in 100 Seconds

Linear Regression in 2 minutes

Make Your Pandas Code Lightning Fast

Big Data In 5 Minutes | What Is Big Data?| Big Data Analytics | Big Data Tutorial | Simplilearn

7. Efficiently Uploading Large Datasets (20x Faster) to MySQL Using Python

Normalize JSON Dataset With pandas

Working with larger-than-memory datasets with Polars

Generate Data Science/Data Analysis Report of your DataSet in 5 Minutes

EXCEL TRICK - Select large data quickly in columns & rows WITHOUT click & drag or unwanted c...

FASTEST Way to Become a Data Analyst and ACTUALLY Get a Job

Database vs Data Warehouse vs Data Lake | What is the Difference?

Why and How to use Dask (Python API) for Large Datasets ?