Advancing Spark - Setting up Databricks Unity Catalog Environments

preview_player
Показать описание
Unity Catalog is a huge part of Databrick's platform, if you're currently using Databricks but you don't have UC enabled, you're going to be missing out on some pretty huge features in the future! But where do you start? Do you really need a Global AD Admin, and how what do they have to do? How do you manage dev, test and prod if they all have to share the same metastore?!?

In this video Simon takes two existing databricks workspaces and builds out Unity Catalog from scratch, provisioning account console access, creating a metastore, allocating access for the managed identity and locking down catalogs to their respective workspaces. Want to get started with UC, check this video out!

Рекомендации по теме
Комментарии
Автор

awesome. Thanks buddy, I was the victim of the never ending loop & after setting up account admin role, I was able to enable the unity catalog

VikramMishra-lb
Автор

Good Explanation!! I was missing the Global AD access which many of other Youtube did not explain, Thank You

knowwhyt
Автор

Thank you for the clear and practical videos! So far they've been helping me build foundational knowledge in Databricks industry standards and how things are/should be done. Looking forward to watching more of your content!

MohammadAwad-qv
Автор

Hi Simon, your videos are always on point and something tangible that I have always found very useful. Just one thing- since you have already assigned the rbac role of storage blob data contributor on the lake, for the access connector, you do not need acl on the container level.Had you not given the rbac on the lake, you would have needed acl.

saugatmukherjee
Автор

Incredible video Simon, thanks for making it simple and clear !

drummerboieva
Автор

Finally a good, solid video that explains it well. Thanks! I would love to see a follow-up where you actually land some data in Bronze and transform it to Silver in development. How does the data look like in the containers? How does the tab "catalog" show the data in Databricks? How are they related? I want to know these things but can barely find anyone explaining it well. I basically want to build an enterprise lakehouse from scratch. Thanks!

julius
Автор

Hi Simon, why do we need the ACLs as well as the RBAC for the access connector on the metastore ADLS?

LS-rvlk
Автор

Thx!! This was just what I needed. And not the first time you read my mind 😂.
Keep up the good work. Very much appreciated

fredrikovesson
Автор

Databricks should really come up with a way to enable Unity Catalog by default for a workspace without having to chase down Global Admins.

akhilannan
Автор

Great video. Thank you! I believe the ACLs are not needed when you've assigned Storage Blob Data Contributor. I've been able to set it up successfully without it.

TeeVanBee
Автор

thanks Simon.

decision question: DB solution architects have recommended we use managed tables in UC. They mention a lot of benefits that are built into UC for query optimizations, AI-driven optimizations, etc.. But the idea of having external tables live in the azure subscriptions of the various data-producing domain teams seems like the best option. How would you decide on which option to use? Or can you do some combination of both?

IsmaelByrd
Автор

I very much like the features that come with unity catalog. But at the same time I find it extremally challenging to implement this in a big organization in its current form due to 1-1 relation to AAD tenant. We have one AAD tenant used by multiple business groups that run multiple products. They are from different industries, have little to do with each other. I am an architect on one of such products. We have multiple envs with multiple lakes and DB workspaces. Sounds like a good use case for us right? Well not so fast.

There are organizational questions that are difficult to answer:
1) Who will be managing the "account"? Our AAD global admins know nothing about Databricks and they dont want to mange this stuff (give permissions, create catalogs etc.). So it has to be deletaged - but to whom? It could be me, but it means I will be able to control access other's business groups catalogs. Will they agree to that? It also means I'll be dealing with their requests all the time. So it means there has to be some "company wide Databricks admin" nominated who will be managing all this stuff. Getting that done is not easy.

2) Who will be hosting and managing metastore storage account and access connector? Since its for entire org, it falls into some "common infra / landing zone" bucket, usually managed by some central infra team. So you need to onboard them.
3) What about automation? I'd like to have an SPN that can for instance create catalogs and use it for my CI/CD. But for now, there are no granular permissions on metastore level - either you are admin or not. Having an "admin" SPN that can create and control access to all catalogs in metastore (that may belong to multiple business groups) - not only its close to impossible but its also stupid.

All these problems come down to one question - why does this have to be tied to AAD tenant? Or why can't we have multiple metastores per region - each product/product group having its own? Then everyone would take care of their own stuff and everyone would be happy!

hellhax
Автор

Don't you think that having DEV, QA, and PROD data all in the same datalake could create some performance issue? Usually DEV QA and PROD data have different licefcycles and data magnitude, workspaces could be bound to different vnets as well as ADLSs targeted as datalakes.
So it would make me more comfortable having a separate ADLS datalake for every env.
If we used external datalakes would be still have all the features of the unity catalog available?

wiwiwiii
Автор

Thanks Simon, engaging and crystal clear explanation as always!

I would like to ask you a question: how would you design workspaces/catalogs avoiding to replicate 1:1 the data among DEV-STG-PRD environments and at the same time be able to handle 2 different scenarios which are the following:
1 - new requirement driven from data consumer. For example you need to develop a new gold table.
2 - new requirement driven from data producer. For example you need to add one column to the bronze and silver tables

For scenario 1, it could be nice to work in DEV workspace but actually read prd_silver_table in order to be sure the gold logic properly works (if not on 100% of the data, maybe just the data from last month/week with PII anonymized)

For scenario 2 instead since the new column is new, of course it is not present in PRD and so it is necessary to import new data from the producers in DEV workspace into dev_bronze_table and then dev_silver_table. In this case, if you want to run a regression test on a gold_table and once again you would like to run it on subset of PRD data, how would you approach it ?

Thanks anyway for all the material!! :)

AlessandroGattolin
Автор

In the case of having 2 storage accounts, 1 for dev and 1 for prod, should I create 1 metastore in each storage account and assign the dev workspace to the dev metastore and the prod workspace to the prod workspace?

lucaslira
Автор

Hi Simon,
I having trouble is sending data to Domo using pydomo lib. As I am using external location in UC as source path the OS.list or other OS functions are not able to read the files in abfss path. Is there any solution to this ?

yuvakarthiking
Автор

Internal Server Error I'm getting this

AbhishekYadav-oqp
Автор

why on earth can you not add a group as account admin?

jacovangelder
Автор

Informative, but for anyone, I don’t recommend ever enabling UC in the UI. Use proper IaC tools like TF, or whatever

fb-guer