Advancing Spark - Data + AI Summit 2022 Day 2 Recap

preview_player
Показать описание
The Data + AI Summit 2022 brought a ton of updates, so many that we're only just on to Day 2 of Keynote announcements! Whilst traditionally the "Data Science" updates day, this time we saw announcements across MLFlow, Databricks Workflows and Delta Live Tables!

In this video, Simon steps through the Day 2 Keynote, available to watch On Demand via the DAIS 2022 website. He discusses MLFlow 2.0, which brings MLFlow pipelines, serverless model serving and deep model observability. He then looks briefly at Workflows before digging into some exciting DLT announcements, including Project Enzyme!

As always, feel free to get hold of Advancing Analytics if you need any help on your Lakehouse journey!

Рекомендации по теме
Комментарии
Автор

I really like the concept of delta live tables. Building the central part of a data platform in a streaming fashion and then let the source system teams deliver data with the latency they can handle. After creating data warehouses for 15 years I’m so tired of the discussion with the consumers of how low latency the need and with the source system owners of how little I’m allowed to touch their systems. Would prefer to have them fight it out between them self 😉

peterydethomsen
Автор

Machine learning and data science are really nice and complex fields. When they've got all that to manage I scratch my head when companies don't invest in the data engineering and analytics engineering that will allow these folks to do their job. We see that so often whether it be for data scientists or even analysts where the data isn't in a usable form. I find that enabling these people to do their jobs is very gratifying as a data/ analytics engineer and this way each one can do what they're good at (and like).

Databricks probably had this recap on data science/ML because that's a big focus for them and a good competitive advantage they have over other solutions that are more focused on data warehousing.

alexischicoine
Автор

Enzyme update
That's going to be great if it works ehe well said. Hopefully it's better than auto scaling. But it's definitely true that when you try to be too smart with incremental changes it just ends up super complex and potentially slower than rewriting the whole table. I'll usually do a method similar to partition switch where I'll redo the dates that have changes and use that for a merge (including a portion of existing data that should be deleted). This makes sure I dont have a delete and then an insert which would make my table incomplete in between and create a mess in my change data feed. It can definitely make it easier to use databricks though which is something delta live tables attempt to do as well including an automatic job that goes maintain your tables.

alexischicoine
Автор

Great video. Thoroughly enjoying them. Just my observation, am I missing something, SCD2 types are supposed to be used in dimensional modeling with an important component / concept namely Surrogate keys to more simpy / accurately/ performant method to join those SCD dimensions to fact tables so the history tracking makes sense. These auditory fields in the "target/silver" tables that are added lack this functionality and the matching fact tables will need more complex logic to match the appropriate records from different tables correctly, checking for date relevancy as well. I hope they add this to the "target/silver" tables in the future especially now that Identity columns are supported in Delta tables and migrating models from warehouses to lakehouses using SCD2 functionality for BI purposes would be easier.

hjgeyer
Автор

I don't think I've ever come across a need for Enzyme. A table that is being updated with incremental changes is going to be designed to specifically handle those merges/updates by the data engineer. I have a hard time believing you would just accidentally partition by the predicate you use to update. A better more useful approach would be to have an algorithm design your table in a manner to be most efficiently updated when analyzing your entire pipeline.

gardnmi