Advancing Spark - Identity Columns in Delta

preview_player
Показать описание
A classic challenge in Data Warehousing is getting your surrogate key patterns right - but without the same tooling, how do we achieve it in a Lakehouse Environment? We've had several patterns in the past, each with their drawbacks, but now we've got a brand new IDENTITY column type... so how does it size up?

In this video Simon does a quick recap of the existing surrogate key methods within Spark-based ETL processes, before looking through the new Delta Identity functionality!

As always, if you're beginning your lakehouse journey, or need an expert eye to guide you on your way, you can always get in touch with Advancing Analytics.

00:00 - Hello
01:37 - Existing Key Methods
10:36 - New Identity Functionality
15:18 - Testing a larger insert
Рекомендации по теме
Комментарии
Автор

Thats a great addition to the runtime. Thank you for the awesome vid!

JulesTuffrey
Автор

'There's no Will' 😂 😂 😂
Nicely done!

WastedFury
Автор

I finally find a good use case for the identity function in Databricks! Typically we use hash keys so we can parallelise our jobs but I needed to create a unique identifier for an XML output, which was limited to a maximum 30 character string. Our natural keys were all GUIDs and our hash keys were also too long - delta identity to the rescue! Now we have a nice little mapping table from our natural keys to a bigint identity which we use in our XML output :D

MrMikereeve
Автор

Nice video Simon. You can create the delta table with the python library called delta it works pretty well. You can change the table location here too if you don't want it where the database default is which will make it an external table.

alexischicoine
Автор

Hi Simon. I see that new feature works only with Spark SQL. Any similar approach using dataframe api? I don't want to create a table and use it to load a dataframe..

zycbrasil
Автор

Some interesting ways to handle ids in the video. I used the monotonically_increasing_id for a while but have moved to using zipWuthUniqueId():

new_schema = StructType([StructField(colName, LongType(), True)] + df.schema.fields)
zipped_rdd = df.rdd.zipWithUniqueId()
new_rdd = zipped_rdd.map(lambda row: ([row[1] + offset] + list(row[0])))
spark.createDataFrame(new_rdd, new_schema)

seanp
Автор

Any idea if this is coming to Azure Synapse soon?

KristofDeMiddelaer
Автор

Delta tables with Identity columns loose the ability to be inserted by several processes simultaneously. Only one process will insert data, others will get MetadataChangedException. Are there any workarounds?

isgqsph
Автор

Very nice, Does IDENTITY auto generates number if id do df.write.() instaed on INSERT statement ?

surenderraja
Автор

Is this new feature only available for delta tables or we can use the identity option in parquet tables? Reading the documentation I think you can use it for several kinds of tables, thanks for your super useful videos! I'm a big fan!

guilleromero
Автор

Good one!! Will save a lot of coding + execution time :)

rajkumarv
Автор

Now that IDENTITY columns have been out for 2-ish years, how did they perform for you after you got a chance to put them thru the tests/paces?

NoahPitts
Автор

I´m looking for a way now to read from the metadata these details that were set for the identity. If I start at 100 and increase by 10 e.g. that must have been defined at the creation of the table and must be stored somewhere in the metadata. But how can I get to that information. I already tried with information_shema.columns but for some reason (not runtime) it does not work in my database. It doesnt recognize the information_schema function. Is there any other way to get this info from the metadata? Maybe in python or scala?
Please let me know.
Otherwise great Video. I quite enjoy your style of explaining.

flammenman
Автор

Is this only in Databricks, I can utilize this feature?, or can I use Spark SQL without using Databricks and make use of identity column feature using Delta Lake ?

Please tell.

YogeshShivakumar
Автор

Its a great video. I have a small question can we add the identity column by altering the table. I have tried fe different ways but not working: 'Alter table db.student add column student_id bigint GENERATED ALWAYS AS IDENTITY (start with 100 increment by 1 )'

rakeshreddybadam
Автор

Can this be used without Databricks as well? or Can I use Spark SQL to utilize identity column feature without Databricks and with using DeltaLake?

Please tell.

YogeshShivakumar
Автор

Any known issues with merges? .whenNotMatchedInsertAll() Getting error: AnalysisException: cannot resolve Key in UPDATE clause given columns src.Id....

jefflingen
Автор

Anyone knows what kind of function/logic sits behind the GENERATED ALWAYS AS IDENTITY .... ? Is it still doing windowing/sorting or is it hashing? I am not quite sure if this is mentioned somewhere in the docs so far, so wondering what it might be doing under the hood.

sval
Автор

Is there no way to reset the identity column? Say, 6 months from now, our Identity column has value 10 billion?

andrewfogarty
Автор

What if you did a second pass and ranked the monotonic ids.

alexischicoine