Master Databricks and Apache Spark Step by Step: Lesson 13 - Using SQL Joins

preview_player
Показать описание
In this video, you learn how to query perform joins using Spark Structured Query Language (SQL). Spark SQL is the most performant way to do data engineering on Databrick and Spark. I'll explain the concepts and demonstrate them with code in a Databricks notebook.

Get Master Azure Databricks Step by Step at

Example Notebook for lesson 13 at:

You need to unzip the file and import the notebook into Databricks to run the code.

Video on Creating and Loading the tables used in this video

Video on Dimensional Modeling - with an explanation of Snowflake Schema
.​
Рекомендации по теме
Комментарии
Автор

Hello there I found your presentation interesting. If that can help, I can provide another use case where cross join is usefull.
For instance, you have a table of cars with their attributes and you want to compare them and give a comparison score.

Scoring could be as following
- Checking if both car have the same transmission type (manual or automatic)
- Checking if both cars have the same fuel type
- Comparing the HP (closet gives the higher score)
- And so on...
- For each attributes we can give a score which will give a total at the end between each cars

In order to do it, you can create the cartesian product of the cars
Then we filter the list where car_id_x <> car_id_y (so we dont compare the cars with themselves)
We then obtain the score for each permutation as mentionned
Then we can order by score DESC

At the end we obtain an ordered list by score DESC of all the other cars for each cars!

oliviersac
Автор

This video helped me get notice in a new program to me, Databricks, in week 2 of a new job. You explained these joins very well. Great Video!!

moe
Автор

I'm learning SQL thanks to you. This video clarified my ideas on how to use SQL joins. Amazing content. It would have been nice to have the same lesson using Python.

gianniprocida
Автор

Thank you very much for the series! They were very helpful !!

josegheevarghese
Автор

Thanks a lot for this great series on Spark

vinr
Автор

Thanks you for your videos. They have been so helpful.

rhard
Автор

HI Bryan, you mentioned that ideal practise is thst system of record should come from warehouse .. directly pulling from application production databases is not a best practise.
My question is that even to populate warehouse you need to pull data from application databases right ?? which can never be avoided .. Am i missing anything here ?

potnuruavinash
Автор

Can you please explain your point at 7:04 "I prefer not to use outer joins...." ? I think you said you prefer to use outer joins in the beginning of this video, to identify data missing..

Raaj_ML
Автор

Can I ask a question related to spark streaming, say we have incoming csv files and we need to process them, however we need to do all the transformation on the data within a single file and output it as a file, which means each incoming file should have and corresponding out going file with the necessary transformation done to the records in that file only. However we also need to work on a cluster so that load can be distributed and files can be processed in parallel. Is this something possible? Thanks

vinr
Автор

Hi Bryan,
Could you please create a video how to append the data to existing table when we receive new file or additional data.

Thanks,
Sri

kumarpyarasani
join shbcf.ru