Convert CSV to Parquet using pySpark in Azure Synapse Analytics

Показать описание

You've got a bunch of CSV files and you've heard of Parquet. How do you convert them for Azure Synapse Analytics? Patrick shows you how using pySpark.

pyspark DataFrame

*******************

Want to take your Power BI skills to the next level? We have training courses available to help you with your journey.

*******************
LET'S CONNECT!
*******************

***Gear***

#AzureSynapse #pySpark #GuyInACube

Комментарии

You saved my day ... Now I have to run the notebook inmy pipeline.

Many Thanks.

juanpablolopezgonzalez

Those Lego people always seem to ask interesting questions. Cool video, keep it going! :)

dbraben

Hi! So helpful, thanks. I got an error, thought : "'DataFrame' object has no attribute 'write'". Could you kindly explain if there is a step in between so I could avoid this error msg. Thank you!!

luciannafloriano

So genuine question is there a downside to always using delta? Are there scenarios where you would save as parquet without delta?

deanrobinson

I'm unable to load parquet files on my PBI desktop since the new update

djalleleddinebouakaz

Is it work paying for all this?

Currently just using the csvs and dropping them in a SharePoint folder for pbi to query. We get new queries atleast weekly

CS-clij

How would you do this with multiple csv files in a storage account folder and convert them all to parquet and store in another folder ? Thanks!

AmritaOSullivan

Can excel files be converted to parquet using synapse pipeline?

manjunathdharmatti

For anyone curious here, the header option when writing to parquet isn't actually needed and I would avoid it as its misleading that it has any impact.

mi

How to merge multiple CSV files with different columns in to one Parquet file
EX: CSV1 - Column1, Column2, Column3
CSV2 - Column1, Column2, Column3, Column4, Column5
Parquet file should create with all 5 columns (Column1, Column2, Column3, Column4, Column5)

wenuraliyanage

Create parquet file (via pyspark <-- synapse spark pool or databriks) as external sql table storage ?
Honestly I don't understand this stuff & fashion (Perhaps because I come from local sql storage <-- sql server )
Architecture regression here for me

christophehervouet

I think you misspelled your file format. You meant CSV —> DELTA! 😉

chadmacumber

Since we're still in Synapse workspace, I would use a simple copy activity pipeline, it's cheaper if you don't have a cluster up anyways

But of course this is more powerful and flexible

EmanueleMeazzo