Convert CSV to Parquet using pySpark in Azure Synapse Analytics

preview_player
Показать описание
You've got a bunch of CSV files and you've heard of Parquet. How do you convert them for Azure Synapse Analytics? Patrick shows you how using pySpark.

pyspark DataFrame


*******************

Want to take your Power BI skills to the next level? We have training courses available to help you with your journey.


*******************
LET'S CONNECT!
*******************


***Gear***

#AzureSynapse #pySpark #GuyInACube
Комментарии
Автор

You saved my day ... Now I have to run the notebook inmy pipeline.

Many Thanks.

juanpablolopezgonzalez
Автор

Those Lego people always seem to ask interesting questions. Cool video, keep it going! :)

dbraben
Автор

Hi! So helpful, thanks. I got an error, thought : "'DataFrame' object has no attribute 'write'". Could you kindly explain if there is a step in between so I could avoid this error msg. Thank you!!

luciannafloriano
Автор

So genuine question is there a downside to always using delta? Are there scenarios where you would save as parquet without delta?

deanrobinson
Автор

I'm unable to load parquet files on my PBI desktop since the new update

djalleleddinebouakaz
Автор

Is it work paying for all this?

Currently just using the csvs and dropping them in a SharePoint folder for pbi to query. We get new queries atleast weekly

CS-clij
Автор

How would you do this with multiple csv files in a storage account folder and convert them all to parquet and store in another folder ? Thanks!

AmritaOSullivan
Автор

Can excel files be converted to parquet using synapse pipeline?

manjunathdharmatti
Автор

For anyone curious here, the header option when writing to parquet isn't actually needed and I would avoid it as its misleading that it has any impact.

mi
Автор

How to merge multiple CSV files with different columns in to one Parquet file
EX: CSV1 - Column1, Column2, Column3
CSV2 - Column1, Column2, Column3, Column4, Column5
Parquet file should create with all 5 columns (Column1, Column2, Column3, Column4, Column5)

wenuraliyanage
Автор

Create parquet file (via pyspark <-- synapse spark pool or databriks) as external sql table storage ?
Honestly I don't understand this stuff & fashion (Perhaps because I come from local sql storage <-- sql server )
Architecture regression here for me

christophehervouet
Автор

I think you misspelled your file format. You meant CSV —> DELTA! 😉

chadmacumber
Автор

Since we're still in Synapse workspace, I would use a simple copy activity pipeline, it's cheaper if you don't have a cluster up anyways

But of course this is more powerful and flexible

EmanueleMeazzo