Advancing Spark - JSON Schema Drift with Databricks Autoloader

Показать описание

We've come full circle - the whole idea of lakes was that you could land data without worrying about the schema, but the move towards more managed, governed lakes using Delta has meant we need to apply a schema again... so how do we balance evolving schemas with the need for managed structures?

The new schema drift features in Databricks Autoloader take a decent stab at this problem - when reading from JSON sources, we can now pull the attributes we want into a known schema, but keep everything else as a json string that we can then extract further details from. In this week's video, Simon takes a look into the new feature, how it works and one or two of the limitations.

As always, don't forget to like & subscribe!

Advancing Analytics

Рекомендации по теме

Комментарии

Good stuff! The "issue" you stumbled at 14:40 with multiple json_tuple calls is not a big issue since json_tuple accepts multiple fields which means you can extract multiple fields in one go.

ArcaLuiNeo

Hello Simon. Thank you for this insightful video. I think you should be able to read inside '_unparsed_data' with dot notation if you create a new data frame as delta table from table 'newdf', and removing the problem of the limitation on the operations you can do.
One of the possible scenario is reading JSON file gotten from an API response, where some fields are present or less depending on if they are valorised. Searching the minimum number of fields which are always present and add unparsed fields in this way doesn't seem to me the best of the viable solutions. How would you deal this particular situation?
Would you suggest a way the include the use of the most complete of the schema, which is the one that the JSON file would have in case of all fields available are valorised?
How would it be possible to handle missing fields and not compromise the all the rest of good information?

elbaldos

Great Content as always. Just shared your channel to my company :)

anildangol

Hi, Thanks for the content. Any idea how can we handle other schema changes such as dropping of a column, change in data type ?

saurabh

does it work well with an array encapsulated in json?

douglasleal

Hi Simon,
I want watch a ADLS Gen2 folder with autoloader.
When a file (json) is added to this folder, I want to overwrite a the delta table I’m streaming to.
Do you know if this is possible?
So far I wasn’t able to find a solution to do that. In “normal” delta format, I would just use
Cheers,
Marco

marcocaviezel

Simon, how can we contact you? do you offer trainings on data bricks?

crazybauns

is the schema drift only available in databricks?

mighelone

Advancing Spark - JSON Schema Drift with Databricks Autoloader

Advancing Spark - JSON Schema Drift with Databricks Autoloader

Advancing Spark - Runtime 8 2 and Advanced Schema Evolution

Pyspark Scenarios 21 : Dynamically processing complex json file in pyspark #complexjson #databricks

Advancing Spark - The Photon Whitepaper

Advancing Spark - Delta Merging with Structured Streaming Data

AWS Glue PySpark: Flatten Nested Schema (JSON)

How to create Schema Dynamically? | Databricks Tutorial | PySpark |

So you think you understand JSON Schema? - Ben Hutton, Postman/JSON Schema

Advancing Spark - Give your Delta Lake a boost with Z-Ordering

Advancing Spark - Getting hands-on with Delta Cloning

Advancing Spark - Dynamic Data Decryption

95. Databricks | Pyspark | Schema | Different Methods of Schema Definition

Working with JSON in PySpark - The Right Way

flatten nested json in spark | Lec-20 | most requested video

46. from_json() function to convert json string into StructType in Pyspark | Azure Databricks #spark

Advancing Spark - Autoloader Resource Management

Advancing Spark - Databricks Delta Live Tables First Look

Advancing Spark - Your Delta & Spark Q&A (SQLBits 2020 Part 1)

Advancing Spark - Building Delta Live Table Frameworks

Easy JSON Data Manipulation in Spark - Yin Huai (Databricks)

Top 6 Most Popular API Architecture Styles

Advancing Spark - Rethinking ETL with Databricks Autoloader

Working PySpark with JSON file | How to work with JSON file using Spark | dr.dataspark

JSON Schema in Production - #1 Chuck Reeves at Zones