Advancing Spark - Identity Columns in Delta

Показать описание

A classic challenge in Data Warehousing is getting your surrogate key patterns right - but without the same tooling, how do we achieve it in a Lakehouse Environment? We've had several patterns in the past, each with their drawbacks, but now we've got a brand new IDENTITY column type... so how does it size up?

In this video Simon does a quick recap of the existing surrogate key methods within Spark-based ETL processes, before looking through the new Delta Identity functionality!

As always, if you're beginning your lakehouse journey, or need an expert eye to guide you on your way, you can always get in touch with Advancing Analytics.

00:00 - Hello
01:37 - Existing Key Methods
10:36 - New Identity Functionality
15:18 - Testing a larger insert

Advancing Analytics

Рекомендации по теме

Комментарии

Thats a great addition to the runtime. Thank you for the awesome vid!

JulesTuffrey

'There's no Will' 😂 😂 😂
Nicely done!

WastedFury

I finally find a good use case for the identity function in Databricks! Typically we use hash keys so we can parallelise our jobs but I needed to create a unique identifier for an XML output, which was limited to a maximum 30 character string. Our natural keys were all GUIDs and our hash keys were also too long - delta identity to the rescue! Now we have a nice little mapping table from our natural keys to a bigint identity which we use in our XML output :D

MrMikereeve

Nice video Simon. You can create the delta table with the python library called delta it works pretty well. You can change the table location here too if you don't want it where the database default is which will make it an external table.

alexischicoine

Hi Simon. I see that new feature works only with Spark SQL. Any similar approach using dataframe api? I don't want to create a table and use it to load a dataframe..

zycbrasil

Some interesting ways to handle ids in the video. I used the monotonically_increasing_id for a while but have moved to using zipWuthUniqueId():

new_schema = StructType([StructField(colName, LongType(), True)] + df.schema.fields)
zipped_rdd = df.rdd.zipWithUniqueId()
new_rdd = zipped_rdd.map(lambda row: ([row[1] + offset] + list(row[0])))
spark.createDataFrame(new_rdd, new_schema)

seanp

Any idea if this is coming to Azure Synapse soon?

KristofDeMiddelaer

Delta tables with Identity columns loose the ability to be inserted by several processes simultaneously. Only one process will insert data, others will get MetadataChangedException. Are there any workarounds?

isgqsph

Very nice, Does IDENTITY auto generates number if id do df.write.() instaed on INSERT statement ?

surenderraja

Is this new feature only available for delta tables or we can use the identity option in parquet tables? Reading the documentation I think you can use it for several kinds of tables, thanks for your super useful videos! I'm a big fan!

guilleromero

Good one!! Will save a lot of coding + execution time :)

rajkumarv

Now that IDENTITY columns have been out for 2-ish years, how did they perform for you after you got a chance to put them thru the tests/paces?

NoahPitts

I´m looking for a way now to read from the metadata these details that were set for the identity. If I start at 100 and increase by 10 e.g. that must have been defined at the creation of the table and must be stored somewhere in the metadata. But how can I get to that information. I already tried with information_shema.columns but for some reason (not runtime) it does not work in my database. It doesnt recognize the information_schema function. Is there any other way to get this info from the metadata? Maybe in python or scala?
Please let me know.
Otherwise great Video. I quite enjoy your style of explaining.

flammenman

Is this only in Databricks, I can utilize this feature?, or can I use Spark SQL without using Databricks and make use of identity column feature using Delta Lake ?

Please tell.

YogeshShivakumar

Its a great video. I have a small question can we add the identity column by altering the table. I have tried fe different ways but not working: 'Alter table db.student add column student_id bigint GENERATED ALWAYS AS IDENTITY (start with 100 increment by 1 )'

rakeshreddybadam

Can this be used without Databricks as well? or Can I use Spark SQL to utilize identity column feature without Databricks and with using DeltaLake?

Please tell.

YogeshShivakumar

Any known issues with merges? .whenNotMatchedInsertAll() Getting error: AnalysisException: cannot resolve Key in UPDATE clause given columns src.Id....

jefflingen

Anyone knows what kind of function/logic sits behind the GENERATED ALWAYS AS IDENTITY .... ? Is it still doing windowing/sorting or is it hashing? I am not quite sure if this is mentioned somewhere in the docs so far, so wondering what it might be doing under the hood.

sval

Is there no way to reset the identity column? Say, 6 months from now, our Identity column has value 10 billion?

andrewfogarty

What if you did a second pass and ranked the monotonic ids.

alexischicoine

Advancing Spark - Identity Columns in Delta

Advancing Spark - Identity Columns in Delta

How and When to Use Databricks Identity Column

Advancing Spark - JSON Schema Drift with Databricks Autoloader

Advancing Spark - Understanding Low Shuffle Merge

Advancing Spark - How to pass the Spark 3.0 accreditation!

This can happen in Thailand

Advancing Spark - Bloom Filter Indexes in Databricks Delta

Delta Identity Column with Databricks 10.4 - crash test

Advancing Spark - Azure Databricks News August 2022

Advancing Spark - Data + AI Summit 2022 Day 1 Recap

Advancing Spark - Working with Hive

Advancing Spark - Databricks Delta Live Tables First Look

Crazy tick removal? Or fake?

Advancing Spark - Azure Databricks News Feb & March 2022

Advancing Spark - Spark 3.2 Released & Databricks Runtime 10

Advancing Spark - External Tables with Unity Catalog

Pyspark Scenarios 8: How to add Sequence generated surrogate key as a column in dataframe. #pyspark

Advancing Spark - Managing Files with Unity Catalog Volumes

Advancing Spark - Delta Sharing

Advancing Spark - Azure Databricks News June - July 2024

Advancing Spark - Tracking Lineage with Unity Catalog

Advancing Spark - Azure Databricks News Oct 2022

Advancing Spark - Data + AI Summit 2022 Day 2 Recap

Advancing Spark - Azure Databricks News May 2023