To Surrogate Key or Not...

preview_player
Показать описание
We've talked about using a surrogate key in your data warehouse whether that's Azure Synapse Analytics or something else. Patrick looks at why you should consider this even if you aren't using a slowly changing dimension.


*******************

Want to take your Power BI skills to the next level? We have training courses available to help you with your journey.


*******************
LET'S CONNECT!
*******************


***Gear***

#AzureSynapse #DataWarehouse #GuyInACube
Рекомендации по теме
Комментарии
Автор

Surrogate keys are THE most critical piece for good model management. Management of keys pushes you to ensure the proper Schema.

Do it! Do it!

ChrisWagnerDatagod
Автор

Incredible! Keep these SQL theory vids coming. This is high-level stuff. Thanks Patrick for continuing to be there for us enterprise level SQL folks. YOU DA MAN

eziola
Автор

On a related note, I've seen users replace string-based natural keys with integer-based surrogate keys in their Power BI data models, assuming the smaller integer data type would be more performant. However, most of the time, the VertiPaq engine will hash these integers, making them strings anyways, so all the user is really doing is adding complexity to their transformations and compute to their refreshes.

jeffshieldsdev
Автор

and now the question is beacuse i not have options to use identity in DWH in microsoft fabric?

vicenteantoniomagallanesju
Автор

Important to understand when to use them, and also when not to. Sometimes its very easy to overcomplicate your models with an insistence on 'incremental' SKs, even when they are not needed or even costly (e.g. in columnar datasets). I prefer the Super Natural Key / Durable Key model, or even better, having natural dimensional fields actually in the facts for the columnar data compression to work with.

JohnBrown-hxbq
Автор

Nice simple explanation of when & why to use surrogate keys. Particularly the why - too often surrogate keys are presented as being good design always, even when they aren't needed and the natural key will work fine.

Side note:

One habit that doesn't seem to go away, but has no value vs using the natural DATE is creating a YYYYMMDD integer surrogate key for a date (aka Calendar) table.

Bad: The INT has more bytes than DATE - it will cause more IO. So it will be slower, even if only marginally.
Bad: It has no integrity enforcement as a valid date.
Bad: It is awkward to manipulate as a date - e.g. using date functions.

Dates as 4 byte INT has not had value since the introduction of SQL Server 2008, with the 3 byte DATE type.

pauloneufneufneuf
Автор

How about you need to allow to user to add like a budget fact table to the model (Copositive model) then they will need to know the SK value of an Item, if the Item Dimension has NK then they don't, just add it and it works.

etacarl
Автор

Not my first priority. But you hit on the best reason to use it, if your IDs like CustomerID may change or be augmented in future.

MilhouseBS
Автор

Q: i don’t understand the implementation of surrogate keys. I feel like I’m missing an obvious and low compute way of maintaining all the surrogate keys on your facts. No videos I’ve seen discuss this. But it seems every time a new fact record is generated you have to join every related dim on the foreign natural key and update the fact with the dim’s related surrogate key. So that you can later perform joins using the surrogate key. Am I thinking through this correctly?

dunlapww
Автор

Nice video. But i have question. What if customerId was a varchar do we still need to have surrogate key?can you please clarify?

umeshchanchlani
Автор

During regular load then dimensions needs to be loaded first then facts by look-up on dimensions for keys

Vikasptl
Автор

Good vídeo, I'm going to share this with my team! Also the new source system could have overlapping IDs (even thou it is varchar it can contain ints)

DanielRodriguesPT
Автор

Maybe i'm missing some thing, once there is a new line in the customer dimension table how can the fact table get the new CustomerSK?

maherkebaier
Автор

If the data you are processing has or could have more than one source, and you want this representation, the information you are providing, to be used by any other process, then an independent (surrogate) key is needed to represent the composite information.

If you were to represent a multi-source dataset with the key from one of its sources, you would be changing the meaning of the source keyfield and adding a processing dependency.

I would argue that every dataset should have an independent keyset because change is inevitable in real-world processes. Certainly, for repositories that are intended to be long-lived data warehouses, independent (surrogate) keys should be the default for the reasons illustrated here, among others.

optimalchoice
Автор

I don’t think customer id is a natural key. Natural keys are derived from the columns. If the id is chosen arbitrarily, then it’s synthetic

fb-guer