AWS Tutorials - Working with Data Sources in AWS Glue Job

preview_player
Показать описание

When writing AWS Glue ETL Job, the question rises whether to fetch data from the data source directly or fetch via the glue catalog of the data source. The video talks about why it is recommended to use data catalog. There is also a demo showing data access from S3, PostgreSQL and Redshift using Glue data catalog.
Рекомендации по теме
Комментарии
Автор

ab ahiste ahiste ye tough hota jara no doubt detailed video h awesome video h bus ye lakeformation ko lekr br br confusion create hojara lake formation and data lake ko lekr

shamstabrez
Автор

Thank you so much for the video. I think it is very useful. I have a few questions, apologies if you might have mentioned them in other videos. 1. On Glue Crawler jobs, assume there is an ETL ingest a particular data source into the Data Lake (product a data file every time), what is your recommendation on the frequency of the glue crawler job: run every time when there is an ETL output file or once a day (if its ingestion frequency is high) to keep the cost low?
2. On the Glue Crawler connection such as the Redshift Connection and the JDBC Connection (in the tutorial), can a single connection be used by multiple Glue jobs simultaneously, i.e. each Glue job create an instance or an object of the Connection?
3. In the video, at time 38.54, a Glue job populated a table "employmentmini" to RDS database, I have not seen the primary key is created on the table into Postgres in the Notebook code. Does this mean that Postgres doesn't enforce the primary key on a table created by Glue job via Glue connection?

hsz
Автор

Hi Aws-tutorial, can you please help me in this below situation-
Expected output
-S3 to sql-server(hosted in window server)

Questions 1)
if daily, multiple files are coming in S3 then there would be the same number of table are going to be made in catalog by the crawler?

Question2) To store, all the daily data coming in S3, do we need the as number of etl job as we have the catelog tables?


I know, it's hard to reply to all the questions, I am just hoping you will reply me ! Thanks in advance ❤️

deepakbhardwaj
Автор

I have one question, if using data catalog is recommended approach then how you handle daily load coming to data source using crawler.. I finding it difficult to handle daily load using glue crawler.

deepakbhutekar
Автор

The does not have way to provide sql query when extracting data. From that we need to use spark. Im getting a communication link failure when I'm trying to read mysql with spark.

SheetalPandrekar
Автор

Hi
If cross account access is provided but classification of table is unknown (supposed to be parquet), how to handle this issue?
Without classification, job throws error - No classification for table

swapnilkulkarni
Автор

Thank you for the amazing content. I want to ask if my data source is RDS postgres database. and i want to create connection to this data source from another AWS account, then how can i do this. Actually my data source is in another AWS account and am trying to connect it from another AWS account but its not working. Recommendations will be highly appreciated.

imransadiq
Автор

Hi, I am trying to consume a Data catalog from a different AWS account into the current account and write a transformation to join both catalogs on a common ID field and store the outcome catalog into this current AWS account. Here is an example
AWSAccount1 had DataCatalog1 and the AWSAccount2 (current AWS account) had DataCatalog2.
I want to write a transformation with join as
DataCatalog1.Table1.empid = DataCatalog2.Table2.empid
and store this merged Datacatalog as Datacatalog3.Table3 in this current account.
Basically, I want to merge the 2 data catalogs into a single bigger Data catalog.
AWSAccount1 only shares its Data Catalog. We do not know much about the data internals.
Is it possible to do this way? I hope we can do it. What are the activities I need to achieve this requirement? Your quick help in this is greatly appreciated. We can do this Athena, but we want to perform this activity in Glue Studio.

sankarsb
Автор

Hi How about for Sharepoint as a source? is it possible for AWS Glue job? And is it a jdbc connection or api?

quezobars
Автор

n plz elaborate wht is dynamic frame n data frame

shamstabrez
join shbcf.ru