Advancing Spark - JSON Schema Drift with Databricks Autoloader

preview_player
Показать описание
We've come full circle - the whole idea of lakes was that you could land data without worrying about the schema, but the move towards more managed, governed lakes using Delta has meant we need to apply a schema again... so how do we balance evolving schemas with the need for managed structures?

The new schema drift features in Databricks Autoloader take a decent stab at this problem - when reading from JSON sources, we can now pull the attributes we want into a known schema, but keep everything else as a json string that we can then extract further details from. In this week's video, Simon takes a look into the new feature, how it works and one or two of the limitations.

As always, don't forget to like & subscribe!
Рекомендации по теме
Комментарии
Автор

Good stuff! The "issue" you stumbled at 14:40 with multiple json_tuple calls is not a big issue since json_tuple accepts multiple fields which means you can extract multiple fields in one go.

ArcaLuiNeo
Автор

Hello Simon. Thank you for this insightful video. I think you should be able to read inside '_unparsed_data' with dot notation if you create a new data frame as delta table from table 'newdf', and removing the problem of the limitation on the operations you can do.
One of the possible scenario is reading JSON file gotten from an API response, where some fields are present or less depending on if they are valorised. Searching the minimum number of fields which are always present and add unparsed fields in this way doesn't seem to me the best of the viable solutions. How would you deal this particular situation?
Would you suggest a way the include the use of the most complete of the schema, which is the one that the JSON file would have in case of all fields available are valorised?
How would it be possible to handle missing fields and not compromise the all the rest of good information?

elbaldos
Автор

Great Content as always. Just shared your channel to my company :)

anildangol
Автор

Hi, Thanks for the content. Any idea how can we handle other schema changes such as dropping of a column, change in data type ?

saurabh
Автор

does it work well with an array encapsulated in json?

douglasleal
Автор

Hi Simon,
I want watch a ADLS Gen2 folder with autoloader.
When a file (json) is added to this folder, I want to overwrite a the delta table I’m streaming to.
Do you know if this is possible?
So far I wasn’t able to find a solution to do that. In “normal” delta format, I would just use
Cheers,
Marco

marcocaviezel
Автор

Simon, how can we contact you? do you offer trainings on data bricks?

crazybauns
Автор

is the schema drift only available in databricks?

mighelone