Working with Large Data Sets Made Easy: Understanding Pandas Data Types

Показать описание

In this video, we'll show you how to use the Pandas library to make working with large datasets easy. You'll learn about the different data types that Pandas supports and see some examples of how to use them to optimize your memory usage.

🎓 Courses:

👍 If you enjoyed this content, give this video a like. If you want to watch more of my upcoming videos, consider subscribing to my channel!

👀 Code reviewers:
- Yoriz
- Ryan Laursen
- James Dooley
- Dale Hagglund

🔖 Chapters:
0:00 Intro
0:54 About pandas
1:37 Types in pandas
3:57 Type conversion
4:30 Data type inference and conversion
11:50 Optimizing memory using categorical types
16:20 Outro

#arjancodes #softwaredesign #python

DISCLAIMER - The links in this description might be affiliate links. If you purchase a product or service through one of those links, I may receive a small commission. There is no additional charge to you. Thanks for supporting my channel so I can continue to provide you with free content each week!

Рекомендации по теме

Комментарии

It's probably worth noting that pandas 2.0 (which is due to be released soon) implements the Apache arrow backend, which brings it in line with polars and massively reduces memory usage for non numeric data types

KLM

Great video as always 👍

Two optimizations, if I may 🙏, as I know the lib. You're reading the file and detecting types twice in your final code, which isn't efficient.
You could read it once and avoid any type inference attempt:
1. You are loading the CSV twice: once with bad types to then only use column names, and then only values to better infer types by skipping the first two rows with `read_csv(..., skiprows=2)`. You could actually just do `skiprows=[1]` to read it only once with the desired type inference. When provided with an iterator, read_csv will only skip the rows at provided indices (here the second line). This will load the column names and the required values in one go, while having the improved type inference (as the first line with column names is omitted for type inference by default).
2. You can directly provide the types when reading the CSV, by providing the dict to read_csv like this: `.read_csv(..., This will prevent pandas from wasting memory and time by trying to infer them. "category" is also a valid type for pandas, and you can therefore add it to your dict so that read_csv builds the desired dataframe from the start.

Parsing CSVs is pretty slow and type inference can have a significant memory impact at load time, especially when dealing with big files. Those optimizations will lower your memory footprint and can speed up things a lot.

RomualdMenuet

We have been using “pandera” to define our pandas schemas. It is great to quickly convert the inferred data types to the target types (“coerce”). The key feature being that it can validate pandas against the Pandera schema and issue a report of errors when the pandas dataframe fails validation (either before or after coercion. It also supports checking against predefined constraints, rules, unique keys, etc. which significantly boosts confidence in both sourcing of data (inputs) and transforming to generate output datasets. Totally worth looking into!

JeremyLangdon

More pandas/numpy stuff would be appreciated! What’s the difference between a Pandas object vs. String representation?

iliasaarab

Nice video! One minor thing thing that might be worth noting is that pandas converted the postcode to int, so the memory increase is for an int -> categorical. There’s a chance categorical would still be more memory efficient than object

robbiebatley

Worth mentioning it is possible to get huge savings for numeric data types as well, just using the smaller bytes representation for integers and floats.
Usually, the min/max value being represented is small enough to be represented with 32 instead of the original 64 bit representation.
Int64 -> int32 = 50% less memory

But indeed, I am curious to see how he new backend under Pandas 2.0 handles this

Victorinoeng

👍 as a coding-hobbyist, it was fascinating to watch using pandas with the terminal and not jupyter notebooks..
perhaps Arjan could consider a series of vdo's that actually builds a project?

exoticgolfholidays

High-quality content! Amazing explanation. More pandas, please!

azadnoorani

Just a small tip!
We can use Categorical type if we know that the number of values that the column holds do not vary much. (For example: Gender --> Male, Female). In these cases, it's ideal to use categorical type to save memory usage.

pradeepgb

Please do one using pandas. Something like a small ml app to crunch numbers. Love your content

fernandino

This is golden, used to handle hundred of gbs data, and the memory usage is massive. 64gb of ram not even cut it. Apparently have to use modin.

PalataoArmy

Pandas 2.0.0 is dropping in about 2 weeks with huge improvements

dokick

Pandas is a bit messy. I switched to Pola-rs and haven’t looked back.

sambroderick

Thanks Arjan! I didn't know about this categorical data type. Thanks so much!

hcubill

Is there gonna be more movies about FastApi or maybe some personal project? I would love to watch how you code somethings from the scratch

kacperwodarczyk

That was very interesting. Another team (not my one - mine was doing JS/PHP) where I used to work did a lot of Pandas, it's nice to get a grasp of what they were doing... It would be good now if you did a part two with some example of what can be achieved with Pandas once the data is efficiently loaded.

edgeeffect

Just a heads up: when you want to return a certain type using the .asype(int) function, you'll get a numpy integer type such as np.int64, which many things outside of pandas don't like. You'll still have to do int(thing) to convert it back to a normal python int if your use case needs a regular int outside of pandas.

scottmckaygibson

Awesome. I've been using datasets for a while on very large sqlite3 reads, but not with categorical types, I'll have to see how that works. Thanks.

barrykruyssen

Just passing by to note that it is possible to specify data types on the read methods. It is always efficient (avoid typing inference processing cycles) and sometimes it is almost necessary; e.g. on the zip codes column you would lose leading zeros if you let pandas read it as integers. You could always `str.zfill(5)` the strings latter, but is it a good design?

Moncamonca

But Pandas background is swithed to pyArrow (Pandas2.0), please make small update!!! :) Im strugling with datatype changes in pyArrow (Pandas2)

Micro-bit

Working with Large Data Sets Made Easy: Understanding Pandas Data Types

Working with very LARGE Datasets | 4+ Million Rows | Power Query and Power Pivot | Big Data in Excel

Handling kaggle large datasets on 16Gb RAM | CSV | Yashvi Patel

Python Pandas Tutorial 15. Handle Large Datasets In Pandas | Memory Optimization Tips For Pandas

Big Data In 5 Minutes | What Is Big Data?| Big Data Analytics | Big Data Tutorial | Simplilearn

Working with Large Data Sets Made Easy: Understanding Pandas Data Types

10 Million Rows of data Analyzed using Excel's Data Model

How to work with big data files (5gb+) in Python Pandas!

Excel Tutorial - Using VLOOKUP with large tables

Master the VSTACK Formula in Excel: Combine Data Across Multiple Sheets Effortlessly

What do you need in order to work on large data sets?

Working with large datasets

Tips and Techniques for Working with Large High Resolution Data Sets

Google Sheets - Three Ways to Summarize Large Data Sets

A Beginners Guide To The Data Analysis Process

How to sort Large Data sets in excel: a beginners tutorial guide

How to make pivot tables from large datasets - 5 Techniques

Working with Large Datasets as a Data Scientist (with Python)

Hadoop In 5 Minutes | What Is Hadoop? | Introduction To Hadoop | Hadoop Explained |Simplilearn

Database vs Spreadsheet - Advantages and Disadvantages

Excel: Quickly Select a Large Amount Amount of Data/Cells

Try limiting rows when creating reporting for big data in Power BI

How to Export Large Data Within Power BI | Data Exceeds the Limit Solution | Large Data Export

How to use Excel pivot tables to analyze massive data sets

BigData: Demystifying Large Datasets in Mathematica