Master Databricks and Apache Spark Step by Step: Lesson 16 - Using SQL Window Functions

preview_player
Показать описание
In this video, you learn how to use Spark Structured Query Language (SQL) window functions. Spark SQL is the most performant way to do data engineering on Databricks and window functions expand SQL functionality to include things like cumulative totals, ranking values, and including aggregations alongside detail rows. They can save you a lot of work. I'll explain the concepts and demonstrate them with code in a Databricks notebook.

Get my book Master Azure Databricks Step by Step at

Example Notebook for lesson 16 at:

You need to unzip the file and import the notebook into Databricks to run the code.

Video on Creating and Loading the tables used in this video
Рекомендации по теме
Комментарии
Автор

Thanks Bryan, You created spark in my journey towards Data bricks. Love from India.

amarnadhg
Автор

Best explanation of windows function. Thanks Bryan

vibhaskashyap
Автор

Bryan, Amazing lesson....window functions is a part of SQL that I haven't used much. So thanks for taking the time to go through this on the Databricks lessons! A big thank you for all your efforts in sharing your knowledge and experience and creating this series!

getsid
Автор

Hello Bryan: Thank you for the great SQL WINDOW FUNCTION in databricks as I am new to the analytics platform. I am hoping you'll provide a new video focused on customer surveys (i.e., healthcare customer satisfaction survey required after an encounter/visit). This would be a huge help.

Karen-
Автор

@Bryan Cafferky, love this topic! Learned a few key SQL concepts that I didn't know how to tackle before. For the given sales data, what would the query statement look like IF I wanted to show Customer, Year, and Total, Highest and Lowest Sales amounts for EACH Year? In other words, I do NOT want to show each transaction rows, but only the aggregates for each Year for each Customer. Thanks in advance.

anthonygonsalvis
Автор

Hi Bryan,



I need to write a query, I think it should be a correlated sub query.

My scenario is :



1. I have two tables, Table A and Table B
2. For every row of Table A I need to run a query on Table B, based on a criteria that is coming from Table A Column.
3. The Query needs to filter rows in table B with criteria present in table A columns for every row, and after filtering, it has to take the maximum amount in that result set and return the adjacent column's value from Table B.

For Ex:



Table A

Name | Gender
XYZ M
ABC F
DEF M





Table B



Name | Tenure | Salary | Rating
XYZ-1 5 5000 1
XYZ-2 8 5500 5
XYZ-3 4 1100 2
ABC-1 1 1200 3
ABC-2 7 1000 4
ABC-3 8 6000 1
DEF-1 5 8000 2
DEF-2 3 1500 1
DEF-3 1 1000 5




Query Result:



Name | Gender | Rating
XYZ M 5
ABC F 1
DEF M 2





1st Row Execution:



XYZ will be taken as a parameter to filter from Table B, hence the result set after filtering Table B will be:

1. All the names from Table B that begin with XYZ



Name | Tenure | Salary | Rating
XYZ-1 5 5000 1
XYZ-2 8 5500 5
XYZ-3 4 1100 2




2. Maximum Salary from the Step 1 needs to be filtered



Name | Tenure | Salary | Rating
XYZ-2 8 5500 5



Output:

Name | Gender | Rating
XYZ M 5




Kindly explain the query format or syntax to achieve this.

tauseefguard
Автор

I have question in regards to creating view v_productcatalog, in that view creation, I am seeing that you are joining 2 dim tables, isn't that you doing snowflake query ???, can you please clarify.

JD-xdxp
Автор

We are using azure synapse spark pools. Is this more like databricks or pure spark?

MathewBurford
Автор

little hard to understand....need to watch and practice again

Ron-gndv