Advancing Spark - Understanding Low Shuffle Merge

preview_player
Показать описание
Back in Databricks Runtime 9.0 we saw the introduction of a preview "Low Shuffle Merge" feature, but it seemed to go fairly unnoticed. In DBR 10.4, it's now enabled by default and a fully GA part of the platform... but what does it actually do?

In this video, Simon walks through the theory of low shuffle merge, and what you should expect to see happening to both your runtime executions, but also the data layout before and after the change. Make no mistake, it's a real speed boost to many common patterns, so use it if you can!

And as always, get in touch with Advancing Analytics if you need help on your Lakehouse journey
Рекомендации по теме
Комментарии
Автор

Really useful.
New to Databricks and you're last couple of videos have really helped me understand how it will support some of the key concepts needed and some of the gotchas that are actually being resolved in the new releases. Thank you.

WastedFury
Автор

Apologies - looks like I wiped out comments when clearing some initial spam. Apologies if anyone's actual comments got dropped!
Simon

AdvancingAnalytics
Автор

I've been using it since the day 1. It has improved my merges :D

YoussefMrini
Автор

Kudos for the whiteboard. You should do it more often

fb-guer
Автор

Thanks for explanation. I am working on one such scenario where table (no efficient column for partition in table, not able to use predicate pushdown in merge )has 2 bn rows and my batch job run every 1 hour for loading(1mn rows every hour). Now merge is taking more time upwards of 50mins. I will try to implement low shuffle merge and also optimize z order by (once daily). Can you suggest any other optimization techniques?

Vikasptl
Автор

Thanks for the explanation. What device are you using for the whiteboarding part?

ArcaLuiNeo
Автор

very time consuming explanation method.

tarun