Advancing Spark - External Tables with Unity Catalog

preview_player
Показать описание
A very common pattern is for companies to have many different lakes, whether as part of a mesh, or the simple realities of large companies. But with Unity Catalog expecting a single lake for each metastore, how do we manage access to other cloud stores?

In this video, Simon looks at the setup of storage credentials and managed identities, followed by the creation of external, unmanaged tables.

As always, if you're setting up a Lakehouse and need a helping hand, get in touch with Advancing Analytics
Рекомендации по теме
Комментарии
Автор

Creating tables as external has saved me many times when deleting tables by mistake. Having that extra step of having to delete the data in storage isn't so bad if the data is important and hard to recover. Then if you messed up recreating the external table is very easy. Obviously if you're in a mature organization where you don't do anything manual in prod it's not as much of an issue.

alexischicoine
Автор

Great video!

One suggestion - each external location can/should be used to create many external tables.

Register an external location on a parent folder of your adls account, and when you create external tables in child directories, unity will automatically figure out you have access to do that!

paulroome
Автор

I came here looking for tips on how to build a table within the unity game engine. Not what I asked for but a great video nonetheless!

AyyyyyyyyyLmao
Автор

Any suggestions as to when to use Managed and External tables? Would it be a good idea to use managed for bronzer/silver and external for Gold layer?

ShravanKumar-yvbe
Автор

In articles we are not seeing how we need to update external locations in existing workflows after enabling unity catalog. we can not use DBFS as per recommendations, we need to use external locations, how to update existing code to point to external locations. we will use upgrade option in unity catalog to migrate external tables but how to update workflows to point external locations

KarthikPaladugu-kzrt
Автор

Great videos!
I am relatively new to Databricks and even more so for UC. So your videos have been a really great help! I am interested in implementing it to our project for a client just to get the permissions and governance stuff out of the way.

But what exactly is the best practice for storing these tables? Is it really better to save them as external tables than managed? I was told from the forums that UC would handle the file saves in ADLS plus the actual table registration in the metastore. Yet, by default it is still a managed table.

RaiOkami
Автор

Is there any video on how to setup unity catalog??

yashgoel
Автор

Hi, late to the party but I have a question: Can I have a read only access to the storage account with the Data? So one cannot modifiy the prod data?

In other words, where the metadata of the external table are saved?

dofa
Автор

Is there a solution/update to the issue around 15:00 - to me that seems like a deal breaker? I want to expose the same external table to many curated catalogs. Do managed tablea have the same limit?

BritonWells
Автор

can you create persistent views from external tables to within Unity Catalog?

palanithangaraj
Автор

can u please help me "create metastore, catalog, table automatically using python or powershell

majetisaisowmya
Автор

Always love your videos…a couple of questions:

1. How can I provide an external path using Scala saveAsTable() for UC?
2. Wouldn’t the use of external tables limit the ability to get and use the lineage tracking if you loads data from and then save data to external locations (unmanaged tables)?

alexanderowens
Автор

is this managed identity for all the workspaces i have in my tenant? or its for one specific one? if its latter, how do we know MI belongs to which workspace?

rakeshprasad
Автор

really a great video. I'm new to DataBricks Unity Catalog and I tried to replicate these steps, but I still get the error "Error in SQL statement: UnauthorizedAccessException: PERMISSION_DENIED: request not authorized"

It seems to me I did whatever I had to do:

I created a Databricks access connector in Azure (which becomes a managed identity)
I created a storage Account ADLS Gen2 (DAtalake with hierarchical namespace) plus container
On my datalake container I assigned Storage Blob Data Contributor role to the managed identity above
I created a new Databricks Premium Workspace
I created a new metastore in Unity Catalog that "binds" the access connector to the DataLake
Bound the metastore to the premium databricks workspace
I gave my Databricks user Admin permission on the above Databricks workspace
I created a new cluster in the same premium workspaces, choosing framework 11.1 and "single user" access mode
I ran the workspace, which correctly created a new catalog, assinged proper rights to it, created a schema, confirmed that I am the owner for that schema
The only (but most important) SQL command of the same notebook that fails is the one that tries to create a managed Delta table and insert two records:

CREATE TABLE IF NOT EXISTS
(columnA Int, columnB String) PARTITIONED BY (columnA);
When I run it, it starts working and in fact it starts creating the folder structure for this delta table in my storage account enter image description here

, however then it fails with the following error:

Failed to acquire a SAS token for list on due to PERMISSION_DENIED: request not authorized
Please consider that I didn't have any folder created under "unity-catalog" container before running the table creation command. So it seems that is can successfully create the folder structure, but after it creates the "table" folder, it can't acquare "the SAS token".

So I can't understand since I am an admin in this workspace and since Databricks managed identity is assigned the contributor role on the storage container, and since Databricks actually starts creating the other folders. What else should I configure?

mauromi
Автор

Only problem with abfss is that python only code like pandas and open() doesn't work with that path, we are currently migrating from mounting the storage account to abfss, and we found this limitation

blackdeckerzr