Read Giant Datasets Fast - 3 Tips For Better Data Science Skills

preview_player
Показать описание
We've learned how to work with data. But how about massive amounts of data? as in - files with millions of rows, tens of gigabytes in size, and ages of staring at your computer waiting for everything to load?
Luckily, in this tutorial, I will show you how to work with a gigantic dataset of Amazon Best Seller Products that has over 2 million rows, and takes up 11GB in size 😱😱😱
A huge shoutout to Bright Data for supplying it and helping this video come to life!
⭐ you can get a free sample of this dataset here:

Additionally, I will demonstrate that slight improvements to your code make a huge impact on the processing speed - regardless of how strong and powerful your computer is!!
For this, we will compare the performance across 2 different systems:
🖥️ my custom build new-gen PC
💻 my poor old laptop (yes, the one that is held by scotch tape and is barely operational 😅)

You will see that well-written code can even make my old laptop run like a supercomputer! 💪💪💪 #python #datasets #brightdata #data #ecommerce #datascience #pandas #pythonprogramming

📽️ RELATED TUTORIALS 📽️
----------------------------------------------
⭐ Anaconda Guide For Beginners (Install Jupyter Notebook):
⭐ Pandas Guide For Beginners:
⭐ For Loop For Beginners:

⏰ TIME STAMPS ⏰
----------------------------------------------
00:00 - intro
01:05 - intro to working with professional data platforms
03:38 - complexity of loading very large datasets
06:43 - focus on relevant data ⭐
09:09 - load data in small chunks ⭐
10:25 - access and change data chunks values
12:19 - save modified data into a new csv file ⭐
14:49 - Thanks for watching! 😀

🤝 Connect with me 🤝
----------------------------------------------
🔗 Github:
🔗 Discord:
🔗 LinkedIn:
🔗 Twitter:
🔗 Blog:

💳 Credits 💳
----------------------------------------------
⭐ Beautiful titles, transitions, sound FX, and music:
⭐ Beautiful icons:
⭐ Beautiful graphics:
Рекомендации по теме
Комментарии
Автор

it would have been complete if you would have shown how to save to a new file if the loading was in chucks.

xr
Автор

You can time things in a Jupyter notebook using an inbuilt magic command. If you wanna time a single line statement, prefix it with %time. If you wanna time the whole cell put %%time at the start. Similarly there are magic commands %timeit and %%timeit which will run your code multiple times and report the fastest time.

phsopher
Автор

Your enthusiasm shines through as usual. I know there have been some difficult times since the introduction of Openai etc, but you must not stop doing what you’re doing because thousands of people are relying on you, your wonderful teaching skills and your python (amongst other) knowledge. Thank you.

shanesteven
Автор

I need to get more into Data Science and Machine Learning processes and such videos help me a lot. Thanks for that

danielschwan
Автор

Saving as a pickle or feather format instead of csv will be much faster and less memory consumable.

pd
Автор

For me personally, the best channel out there to learn python!!! I am not kidding! Thank you so much! ❤

mellowbeatz
Автор

Beautifully done! Please never stop making these videos :)

faisalee
Автор

My approach when downloading very large csv files, is to use
data = requests.get(url, stream=True).iter_lines()

That returns an iterable to the data, but doesn't start downloading it at this stage.

The first row will be the headings, so get that with something like
headings = next(data).decode("utf-8").split(", ")

Then loop over the body of the data either with a for loop, list comprehension, or multiprocessing.Pool().map()
and dump each line into a database, then do queries on the database to analyse it.

Or, if it isn't quite so big, then put it in a numpy array and work on it from there.

katrinabryce
Автор

Great presentation- your explanations are the best!!

mtmanalyst
Автор

Thank you for your great video. But perhaps for 15 GB data it's better to use Polars instead of Pandas. It has a similar syntax to Pandas so you don't find yourself on a different planet and it uses Rust code for faster execution. It is particularly suitable for processing large data sets, as it has built-in support for multi-threaded and multi-core processing.

visualish
Автор

Your videos are extremely didactic and easy to understand, they are the most beautiful and elegant projects on youtube! Congratulations.

jorgevector
Автор

I am just learning python and found some of your videos. You are very good and very clear. I had issues installing Anaconda on my Windows 11 computer. It was very slow and crashing most of the time. I have good hardware so that was not the issue. I might try it again as this Juniper looks good. Thank for your time and efforts making these videos.

marcq
Автор

Great vid! Are you going to do any vids on Natural Language Programing (NLP), with tools like, Spacy, NLTK, Genism, Core NLP?

d-rey
Автор

I really was suprised how fast Python loaded that huge dataset!

LostPlaceChroniken
Автор

The most beautiful voice on youtube, thank you for the well narrated and produced content ;)

ssbrunocode
Автор

Wow! very helpful video. I was dealing with this problem. Love your videos thanks.

fredoh
Автор

I really would suggest to use Polars instead of pandas when using big files...it can be 7 times faster, use timeit to measure to difference. Handling that amount of data every min counts. I love pandas but Polars is WAY faster. Cheers. Nice tutorial though you bought the full set? How did you get access to the full dataset?

pieterbosch
Автор

Great information and knowledge! and Love your energy!

rayoh
Автор

This was beatiful, just in time for my work. Keep it up!

ciscodea
Автор

Very clearly explained and with your usual enthousiasm, keep it up! :)

HadiLePanda