Exporting CSV files to Parquet with Pandas, Polars, and DuckDB

preview_player
Показать описание
In this video, we'll learn how to export or convert bigger-than-memory CSV files from CSV to Parquet format. We'll look at how to do this task using Pandas, Polars, and DuckDB.

#pandas #python #polars #duckdb

Resources
Рекомендации по теме
Комментарии
Автор

Many thanks for this nice Video. A question about the method you presented with DuckDB: When exporting a table from DuckDB into the disk with Parquet format using COPY, is it possible to have some partitioning parameter to specify keys (Hive style) based on which the data would be split?

myyouaccounttube
Автор

Many thanks, We have a requirement to convert huge csv file to Parquet . Is it possible using C# console program ?

PYG
Автор

Thank you, Mark!!.
Can you also explain the parquet dataset?
I used to create a partitioned Parquet dataset by using Pandas and Polars.

But I want to know how to read data from such partitioned parquet datasets directly to Polars lazy frame (not to pandas as data size is larger than memory) to do some analytics.

import polars as pl
import pyarrow.parquet as pq

# Read data written to parquet dataset
pq_df = pq.read_table(r"C:\Users\test_pl",
schema=pd_df_schema,
)

pl_df =

Is there any better way to do this

kpyoutuber
Автор

Pandas work much better in unclean data,
how do you handle pyarrow headache in data conversion error?:
ArrowInvalid: Could not convert '230' with type str: tried to convert to double

make many dependencies unusable:

to_parquet()
convert pandas to polars
open csv in data wrangle,
save as parquet in data wrangle

guocity
Автор

Why not a simple option in Excel to "save as" parquet? Why is this so hard?

Phoenixspin