how to upload on premise database data to AWS S3 | Build a Data Lake | Python

preview_player
Показать описание
A data lake is a centralized cloud storage in which you can store of all the data, both structured and unstructured, at any scale. This platform is fast becoming the standard for users looking to store and process big data. we will cover how to build an AWS S3 data lake with an on-premise SQL Server database. S3 is an easy to use data store. We use it to load large amounts of data for later analysis.

Subscribe to our channel:

---------------------------------------------
Follow me on social media!

---------------------------------------------

#AWS #S3 # DataLake

Topics covered in this video:
0:00 - Intro data lake from on-premise to to AWS S3
1:03 Create S3 user with programmatic access
2:37 - Create S3 bucket
3:04 - Python setup
3:56 - Read data from SQL Server
5:04 - Load Data to S3 Bucket
6:59 - Code Demo
7:36 - Review S3 Data Lake
Рекомендации по теме
Комментарии
Автор

Hey, how to automate this process, like after certain time, the code will run aotmatically and upload sql data to s3.

kshitijbansal
Автор

Thanks for the informative content! so how would you deal with large tables especially for the initial loads? assume that table is 200-300GB, selecting all data and keeping in data frames/ in memory objects doesn't look practical so I believe defining a batch key/partitions on source side and iterate it on codes could be a way..

ahmetaslan
Автор

Thanks for great video! instead of csv, would you recommend to upload the same structured data as parquet?

uatcloud
Автор

Where would you run main.py if you dont want to run locally

AnshuJoshi-ohio
Автор

I am working with ms sql the data base is on aws rds and contains data in billions i want to extract and load data to s3 via glue how can i do it ??

mukeshgupta
Автор

i found problem like this
importing rows0 to 606 for table DimProduct
Data load error: name 'upload_file_bucket' is not defined
how to problem solve this case? thanks

ihab
Автор

What if the tables are so big that you cannot load all the data on your local/Pandas? It would be great to show how to batch the table to s3 using SQL and Pandas or SQL and PySpark. I am using a Docker Container currently, but my source is so big, even with all my Mac's computing power allocated to the Docker container, that the transfer still fails with an OOM 137 from Docker.

jasonp
Автор

I have 4 different csv files in s3 and i need to load into four different tables in redshift, can tell me how it is possible using lambda function

muppallavenkatadri
Автор

can i move any type of files by using this

AnanthuS-miqm
Автор

I have these two databases from amazon AWS s3 into my on premise sql server

is there a way i can migrate this entire DB on amazon s3 to snowflake ?

socialawareness
Автор

hi if you can show us how to connect to MSSQL with details

montassarbenkraiem
Автор

Is it possible to update an existing file on s3 line by line ?

KeshavChoudhary-dxxd
Автор

what's your email? I do have some questions for you?

kwabenau
Автор

Hey, how to automate this process, like after certain time, the code will run aotmatically and upload sql data to s3.

kshitijbansal
Автор

Hey, how to automate this process, like after certain time, the code will run aotmatically and upload sql data to s3.

kshitijbansal
Автор

Hey, how to automate this process, like after certain time, the code will run aotmatically and upload sql data to s3.

kshitijbansal
Автор

Excellent content, just what I need to know! Looking forward to your Airflow videos next

richardhoppe
Автор

Hey Haq, How to do incremental load in this pipeline? Do we need to rewrite the new data all again into S3 or is there a way to make field level changes / inserts /deletes on existing buckets? I am assuming that this is where data lake file formats such as Apache Hudi into picture, correct me if i am wrong and please walk me through to workaround process

jaswanth
join shbcf.ru