Realtime Socket Streaming with Apache Spark | End to End Data Engineering Project

preview_player
Показать описание
In this video, you will be building a real-time data streaming pipeline with a dataset of 7 million records. We'll utilize a powerful stack of tools and technologies, including TCP/IP Socket, Apache Spark, OpenAI Large Language Model (LLM), Kafka, and Elasticsearch.

📚 What You'll Learn:
👉 Setting up and configuring TCP/IP for data transmission over Socket.
👉 Streaming Data With Apache Spark from Socket
👉 Realtime Sentiment Analysis with OpenAI LLM (ChatGPT)
👉 Prompt Engineering
👉 Setting up Kafka for real-time data ingestion and distribution.
👉 Using Elasticsearch for efficient data indexing and search capabilities.

✨ Timestamps: ✨
0:00 Introduction
01:10 Creating Spark Master-worker architecture with Docker
10:40 Setting up the TCP IP Socket Source Stream
23:25 Setting up Apache Spark Stream
42:56 Setting up Kafka Cluster on confluent cloud
47:12 Getting Keys for Kafka cluster and Schema Registry
1:12:53 Realtime Sentiment Analysis with OpenAI LLM (ChatGPT)
1:24:10 Setting up Elasticsearch deployment on Elastic cloud
1:30:50 Realtime Data Indexing on Elasticsearch
1:36:05 Testing and Results
1:41:50 Outro

🌟 Please LIKE ❤️ and SUBSCRIBE for more AMAZING content! 🌟

🔗 Useful Links and Resources:

✨ Tags ✨
Data Engineering, Apache Airflow, Kafka, Apache Spark, Cassandra, PostgreSQL, Zookeeper, Docker, Docker Compose, ETL Pipeline, Data Pipeline, Big Data, Streaming Data, Real-time Analytics, Kafka Connect, Spark Master, Spark Worker, Schema Registry, Control Center, Data Streaming, Real-time Data Streaming, OpenAI LLM, Elasticsearch, Data Processing, Data Analytics, TCP/IP, Streaming Solutions, Data Ingestion, Real-time Analysis, Spark Configuration, OpenAI Integration, Kafka Topics, Elasticsearch Indexing, Data Storage, Stream Processing, Machine Learning Integration

✨ Hashtags ✨
#confluent #DataEngineering #TCP #TCPIP #sockets #socketstreaming #Kafka #ApacheSpark #Docker #ETLPipeline #DataPipeline #DataStreaming #OpenAI #Elasticsearch #RealTimeData #BigData #TechTutorial #StreamingAnalytics #MachineLearning #DataFlow #SparkStreaming #DataScience #AIIntegration #RealTimeAnalytics #StreamingData #realtimestreaming #realtime
Рекомендации по теме
Комментарии
Автор

Thanks for watching! Hit the LIKE button, SUBSCRIBE and comment for wider reach 🥺🙏

CodeWithYu
Автор

Your Channel is very helpful and affective. I am learning a lot from you.

judyramphele
Автор

Your content is so helpful, hope your channel grows to million.

pratiknarendraraut
Автор

I really enjoyed going through this video. Very informative. Thank you very much

ataimebenson
Автор

Another awesome piece of you, a great contribution to the data engineering community

travelwithshayan
Автор

Well done @CodeWithYu. This is elaborate and I love it. I was following up at the beginning but got lost along the way. Though I have resolved to watching this several time to be able to understand it better.

adebisiabioduntedvideo
Автор

Definitely you are one of the best professionals that share the knowledge properly. You will help all of us to boost our data engineering skills.

I've just watch this tutorial twice in order to understand the architecture and workflow correctly before start coding. For sure it will help me to bring my portfolio to the next level and i will mention you of course.

Regarding this project, where is the spark df storaged? Is it keeped into chache or into the docker image volume?

About visualization, I've good knowledge about PBI but not conecting it with elastiksearch. I would appreciate any suggestions, although I will explore by my self eks-pbi connectors.

Thanks for your unselfishness teaching
Regards

RafaVeraDataEng
Автор

This is awesome. Thanks for the great content.

____prajwal____
Автор

Thanks for your videos. can you make a project using spark, kafka and jenkins for ci/cd and test automation ?

jmagames
Автор

Quick question:

Why do you submit spark job separately? You have initially ran the socket streaming as independent process and later you mentioned to submit it to spark-worker but eventually you have submitted to spark-master itself, just wanted to understand the motive behind it.

Great project though. Keep up the good work.

saikirannukala
Автор

Thank you so much, this is so informative

RecaAtoz
Автор

Perfects! But I have a question: bcz you have all of the py files and other files, why you just run spark locally?

Sakasiton
Автор

Some Useful solutions

For the schema registry, scroll down => cli & tools => kafka connect => schema registry create api key => 4 => Generate config => scroll down to the bottom of the code to copy the schema url

ataimebenson
Автор

Can I ask how do you run the streaming pyspark job. I saw u spinning up spark with docker compose. But how do we run the pyspark streaming job to the spun up containers.

HaiDo-bd
Автор

great video m8t, have you a video on how to create a kafka on a baremetal ?

sclem
Автор

HI, I am.a beginner, love your channel and knowledge you give, Love from India!!
I have a doubt, in the docker compose file im unsure about the network, can i use the same network given code-with-yu or should i use a different network name for doing this project?
If i have to use a different one how do i do that?

DivineSam-wm
Автор

Hey Yusuf Amazing Video. But Facing Some Errors

jaysinhpadhiyar
Автор

Thanks for your awesome content. About visualization, I want to use kibana to draw a line chart to plot the ratio of positive and negative reviews in real time, the x axis will be timestamps, the y axis will be %, how I can do? Pls help me.

anhminh-jtql
Автор

In Getting Keys for Kafka cluster and Schema Registry part exactly in 50:28, I didn't find the "create schema Registry API key" below the "create Kafka Cluster API key" in the second step under the clients section. What should I do? Many thanks

yasminemasmoudi
Автор

Can You please make a video as how to setup the Environment

ShubhamKumar-zwoq