AWS Glue: Read CSV Files From AWS S3 Without Glue Catalog

preview_player
Показать описание
This video is about how to read in data files stored in csv in AWS S3 in AWS Glue when your data is not defined in the AWS Glue Catalog. This video uses the create_dynamic_frame_from_options method

#aws, #awsglue
Рекомендации по теме
Комментарии
Автор

Thank You. This is very helpful. My use case is to take the csv files from S3 and perform Data Quality checks and output in the parquet format. I was planning to use Pyspark in aws and I think this is a simple procedure I can follow to do the same.

akshitha
Автор

I am so happy that I found this channel

Diminishstudioz
Автор

Hi buddy this is a nice video, but every one creates video on reading and writing from s3.
1. Can you create a video on how to use Glue studio notebook (interactive session) to read data from Awsgluecatalog and write the results to S3?
2. Please can you include every step- i.e what kind of permissions should we need to create to read and write.
(I am getting a lot of permission denied errors)


Also recommend doing a video on Athena notebook editor reading data from Gluecatalog using pyspark.
(Please also include detailed permissions steps)

shashankreddy
Автор

thank u very much for this video playlist. pls upload new videos on multiple condition.

vvkk-vljw
Автор

here one question, why id column brought datatype as string instead of int/number. is there any reason?

VivekKBangaru
Автор

Hi, I'm having an error while running the first default code. Plz provide the IAM role used to launch notebook in the aws glue.

sumanranjan
Автор

Thank you for this awesome explanation. Can I please request you to make the video about 'How to implement Change Data Capture' using python? and Secondly, How to automate Python pipelines to load the data in AWS cloud say S3. Thanks.

PRI_Vlogs_Australia
Автор

just came with new scenario, can you please create one UDF in pyspark aws glue. needed the most

udaynayak
Автор

I am getting iam:passrole failed to start the session
I do have glue console full policy attached to iam role

himanshusingh-nvwn
Автор

Could you please let me know why are you using gluecontext as you are not using any of the glue ETL functionalities and why are you using dynamic dataframe as you are not dealing with semi-structured or unstructured data? any specific reason?

shashankemani
Автор

What is the better option? reading from glueCatalog or directly from S3 ?
I’m working on a project that everyday new data files are loaded into S3 bucket ( right now almost parquet files, but in the feature there will be any other format). When the files are already in S3, we trigger AwsGlue Job to read(via glueCatalog), transform and write to data to another S3 bucket. But before starting Glue job, we need to start the related crawlers to crawl the new files(register new partition, update schema if there is any change, …). Because of that, we need to create many crawlers and orchestrate them base on the event of corresponding file is loaded into S3, and waiting for crawlers to finish running also takes time and cost. Do you think we keep doing that or just read file directly from S3 ? is there any risk or performance issue between 2 methods or any other recommendation? Thank you very much

tiktok
Автор

There is any reason to avoid Catalog ? I'm just learning about Glue and I use the Catalog.

I have other question.. I've tried to run a Crawler to take one csv file from my S3 buket but when i check the new tables, it doensn't recognize the column names. It shows col 0 col1 col2 col3. Do you know why this happens? or how to solve it ?

joelluis
Автор

What IAM Role should I choose while creating ETL job in Jupyter notebook to write this code?

devanshaggarwal
Автор

Hello, great video. Thanks yoy.

So a cuestion.
When I run the code .printSchema()
The notebbok run :

root

++
||
++
++

and I review the file and it has header. What happened?
and thank you for your answer.

alejandrasilva
Автор

Thank you for this video, I am getting an error glueContext not defined. Even though when starting a notebook in aws glue it is getting imported automatically.

Thank you

malvika
Автор

Hello there, In my csv lot of non utf8 characters are there how can i ignore them while uploading since its throwing error "unable to parse the file"

powerspan
Автор

Can anyone please help me. I have some NON_ASCII characters in my file placed inside S3. How can I remove those junk characters from that file in S3 using AWS Glue?? Please help.

jomymcet
Автор

sorry, what is wrong with df = spark.read.csv(path)?

bk
Автор

Do you have any course related to the content?

yagnasivasai
Автор

How i can update the file and store it again in s3?

patilharss