Stop wasting memory in your Pandas DataFrame!

preview_player
Показать описание
Watch how quickly we can reduce your DataFrame's memory usage with just a couple of tips.

00:00 - Intro
00:10 - Initial read_csv
00:49 - Tip 1: usecols
1:58 - Calculating data types
3:32 - Tip 2: dtype
4:23 - How do you cut back on memory usage?

Theme: Dracula Soft

#jupyterinvscode, #jupyter, #python
Рекомендации по теме
Комментарии
Автор

The DataType! I had no idea this was even possible.

BurkeHolland
Автор

Just by adding the correct type of each column I could drop the memory usage by almost 50%. Thanks!!!

irodrigoarias
Автор

Pretty awesome, I really liked specifying the data types. Reducing by an order of magnitude is fantastic

stargazer
Автор

Awesome tips, data type trick is bonkers, 😀

sarangkharpate
Автор

I've honestly never had enough data to get an memory error for pandas. I really like this vid tho bc I do use pandas a bit and knowing this will help me if I ever work with huge datasets.

ashrafibrahim
Автор

So many great tricks once you read the docs

nczioox
Автор

Useful
If i were really just wanna make a quick view on an dataset and i dont know what strcuture in this.
How should i do?

ElinLiu
Автор

How to reduce the font size of the cells output( only)

abhisek
Автор

I know it's an old vid, but how can I limit the string? Say I know the max len, how to avoid over allocation? More importantly using a "non-growing" str, (dynamic allocation is performance hell), how to specify it anyone knows?

seventyfive
Автор

Great advice, given the fact that you can read a CSV file at once, because the file I need to read is so big that I can't even read it with Pandas directly.

manoeljose
Автор

I have a problem with my data.
I have losts of dataframes (excel files)
each from a dirent vendor.
All with product descriptions (code, name, size, color, price etc)

Problem is.
it is not a fixed pattern.
All vendors give me (daily) their own excel files.
But they do not have all parammeters alike.
For example some have color column others dont.

For context.
im using django.
My goal is having a Product model with all attributes but only create or update those informations given by the vendor.

First time. While creating (bulk create)
i add all fields. and set a default for those missing.
But when updating. I should be able to update only the fields with new diferent values, like price. Since descritions should be never changing, other wise it would be a new Product.

I started with a simple code. Looping.
and for a 2.000row excel file.
takes 15min to check all info and handle each field based on a preset conditions

LcTheSecond
Автор

This is very interesting, though if im being honest, everytime I actually ran into MemoryError with pandas, it was because I had made a stupid mistake and these tips wouldn't have helped much. Still, thanks for the tips.

mirimjam
Автор

I agree for most of your talk but the choice for int16 seems a little risky. With a maximum positive value of 32767, this is less than a factor ten away from the maximum of the sample presented (4611). I would not feel safe when the maximum of the current data and the maximum of the type it is represented by is in the same order of magnitude, certainly when also running models on future data, which may be quite different from the current dataset. Therefore a int32 type seems better, although the uint variant is also applicable in this case and roughly doubles the usable data range since "units sold" should not be negative I presume.

Kind regards

milanwillaert
Автор

So you are telling me. I haven’t used Pandas to it’s full potential yet?

redmastern