AWS Glue 101 | Lesson 1: The Glue Data Catalog And Crawlers

preview_player
Показать описание

00:00 - Intro
00:24 - What is the AWS Glue Data Catalog?
00:36 - What is a metadata repository?
00:53 - What is metadata information?
01:18 - How do we collate the metadata?
01:43 - AWS Crawler
02:01 - When do we use the Data Cataalog?
03:32 - Interacting With The Glue Data Catalog
04:12 - What the tutorial will cover
04:34 - Hands on Tutorial
04:52 - S3 configuration
08:16 - Creating a database
08:56 - Setting up a crawler
12:28 - Recap
12:59 - Bonus: Athena

In this series of videos we take a look at AWS Glue. We mix the theory with the practical as we build a functioning ETL application using the Glue Data Catalog, Crawlers, Glue ETL, Triggers, Workflows and Dev Endpoints

In this video we configure our S3 bucket to act as our data repository, ingest data, register that data using a crawler with the Glue Data Catalog and finally use Athena to query the newly ingested data.

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.

Data integration is the process of preparing and combining data for analytics, machine learning, and application development. It involves multiple tasks, such as discovering and extracting data from various sources; enriching, cleaning, normalizing, and combining data; and loading and organizing data in databases, data warehouses, and data lakes. These tasks are often handled by different types of users that each use different products.

AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows with a few clicks in AWS Glue Studio. Data analysts and data scientists can use AWS Glue DataBrew to visually enrich, clean, and normalize data without writing code. With AWS Glue Elastic Views, application developers can use familiar Structured Query Language (SQL) to combine and replicate data across different data stores.

😎 About me
I have spent the last decade being immersed in the world of big data working as a consultant for some the globe's biggest companies. My journey into the world of data was not the most conventional. I started my career working as performance analyst in professional sport at the top level's of both rugby and football. I then transitioned into a career in data and computing. This journey culminated in the study of a Masters degree in Software development. Alongside many a professional certification in AWS and MS SQL Server.
Рекомендации по теме
Комментарии
Автор

best video of Glue on Youtube. Thanks Johnny

josephattabenninjr
Автор

Keep up the good work, you'll be viral soon.

adityarajora
Автор

Johnny, Following step by step from your detailed tutorial. Super helpful :)

jadenguyen
Автор

Fantastic 101 Glue session! Good Job Johnny

karamveerhooda
Автор

Thanks for the explanation! Something is not yet cleat to me - at min 13:29, why is there a need to set the Query result location? does this mean the every Athena query performed on this catalog table is saved at this location or only the latest query results? thanks in advance!

Incognitowil
Автор

awesome video, upto the point and clear explaination.

The_Bold_Statement
Автор

Great video and channel! Keep'em coming, buddy!

tommysera
Автор

Ty so much for sharing your experience 💜 your insightful content!

nikozerk
Автор

So, what happened if in the same folder (for example your customer folder) we have two CSV, each with different schema? Will crawler create a two tables?

katsouranis
Автор

Hey Johnny I have a question. If you produce new data and want new tables with each crawler run is that possible or would you need to create new crawlers per external table you want produced

lotannanweke
Автор

can we do read from catalog and use glue etl (spark) and save into a new glue catalog without using s3 and crawler? glue catalog -> glue etl-> glue catalog ?

mehedeehassan
Автор

So if you have multiple csv files that contain different data, do you set up different data stores for each file, or will one data store handle the different schemas?

ws
Автор

how to add column names in camel case in glue catalog

saibaba