Skip to content
  • There are no suggestions because the search field is empty.

Databricks Unity - Target environment

This article proposes a possible target environment on Databricks using the Unity Catalog.

Installation and configuration of the target environment are not part of biGENIUS support.

Unfortunately, we won't be able to provide any help beyond this example in this article.

Many other configurations and installations are possible for a Databricks target environment.

Below is a possible target environment setup for the Databricks generator using the Unity catalog.

The Unity Catalog Property must be set to true:

Setup environment

The Databricks target environment needs at least the following Azure resources set in your Azure Subscription:

  • A Ressource Group: some help is available here
  • Inside the Ressource Group:
    • 2 Storage Accounts: some help is available here
      • 1 for the Source data (if you are using files as source)
      • 1 for the Target Data Lake
    • An Azure Databricks Service some help is available here
    • An Access Connector for Azure Databricks: some help is available here

In the following example of the target environment, we will use the following resources:

  • A Ressource Group named bg-databricks
  • A Storage Account for our Source Data named bgdatabrickslandingzone1
  • A Storage Account for our Target Data Lake named bgdatabricksdatalake1
  • An Azure Databricks Service named bgaas-unity
  • An Access Connector for Azure Databricks named bgaasunity

Tools to install

Please install the following tools:

  • Azure Storage Explorer: available here

Target Storage Account

We have chosen to create folders named silver and gold in our Target Storage Account:

  • Open Azure Storage Explorer
  • Connect to your Subscription
  • Open the Target Storage Account
  • Create the folders 

For this example, we have 2 Target folders for our Data Lake:

Source data

There are three ways to provide source data to a Databricks generator:

  • From Parquet and Delta files that exist in a Unity Catalog by using a direct Discovery
    • In a dedicated Project using the Generator Configuration Databricks Data Store, Data Vault, Data Vault and Mart, Dimensional, Dimensional and Mart, or Data Mart
  • From Parquet and Delta files by using the Databricks Stage Files generator as a Linked Project
    • In a dedicated Project using the Generator Configuration Databricks Stage Files
  • From any database accessed through JDBC by using the Databricks Stage JDBC generator as a Linked Project
    • In a dedicated Project using the Generator Configuration Databricks Stage JDBC

Parquet and Delta Files

If your source data is stored in Parquet or Delta files, please:

  • Create a first Project with the Databricks Stage Files generator
  • In this first Project, discover the Parquet and Delta files, create the Stage Model Object, generate, deploy, and load data in the target storage account.
  • Create a second Project with the Databricks Data Store, Data Vault, Data Vault and Mart, Dimensional, Dimensional and Mart, or Data Mart generator.
  • In this second Project, use the first Project Stage Model Object as a source by using the Linked Project feature.

 

Please upload the source Parquet files to the Source Storage Account:

  • Open Azure Storage Explorer
  • Connect to your Subscription
  • Open the Source Storage Account
  • Create one folder by Parquet Source file

The folder name should be identical to the Parquet file name.

For this example, we have 2 Parquet source files, so we need 2 folders:

Upload in each folder the corresponding Parquet Source file, for example, for the SalesOrderDetail folder:

We have the following Delta File containing Credit Card data in our target storage account named bgdatabrickslandingzone1:

Database

If your source data is stored in a database such as Microsoft SQL Server or Postgres (or any database you can access through JDBC), please:

  • Create a first Project with the Databricks Stage JDBC generator
  • In this first Project, discover the database tables, create the Stage Model Object, generate, deploy, and load data in the target storage account.
  • Create a second Project with the Databricks Data Store, Data Vault, Data Vault and Mart, Dimensional, Dimensional and Mart, or Data Mart generator.
  • In this second Project, use the first Project Stage Model Object as a source by using the Linked Project feature.

 

The source data are coming from a JDBC source.

In this example, we will use a Microsoft SQL Server database stored in Azure in a dedicated resource group:

The Azure database is AdventureWorks2019 and contains the data from the SQL Server sample database AdventureWorks2019.

To be able to access the Microsoft SQL Server from Databricks, you should check the box Allow Azure services and resources to access this server in the Server Networking configuration:

Unity Catalog

We must prepare the database and schemas in our Unity catalog.

For this example, we created:

  • A database named datalakehouse
  • A schema per layer for each Project (ex.: docu_rawvault, docu_businessvault, and docu_mart for a Data Vault Project and docu_stage for a Stage File Linked Project)

Create a Compute Cluster

To be able to execute the Notebooks from our generated artifacts, please create a Compute Cluster in Databricks:

  • Click on the Compute menu on the left-hand side:
  • Click on the Create compute button:
  • Change the following information:
    • Check "Single node"
    • Databricks runtime version: choose "13.3 LTS (Scala 2.12, Spark 3.4.1)
    • Node type: choose "Standard_DS3_v2"
    • In Advanced Options, Spark, insert all the needed config to authorize your cluster to access the Source and Target Storages Accounts. Following is an example in our environment (should not work the same in your environment):
      • Replace <your_scope> by the name of your scope storing the secret
      • Replace <your_secret> by your secret's name
fs.azure.account.auth.type.bfmsampledata.dfs.core.windows.net OAuth
fs.azure.account.oauth.provider.type.bfmsampledata.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.secret.bgdatabricksdatalake1.dfs.core.windows.net
fs.azure.account.oauth.provider.type.bgdatabricksdatalake1.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.auth.type.bgdatabricksdatalake1.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth2.client.secret
spark.hadoop.fs.azure.account.oauth2.client.endpoint <your_secret>
fs.azure.account.oauth.provider.type.bgdatabrickslandingzone1.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.auth.type.bgdatabrickslandingzone1.dfs.core.windows.net OAuth
fs.azure.account.oauth2.client.secret.bgdatabrickslandingzone1.dfs.core.windows.net
spark.master local[*, 4]
fs.azure.account.oauth2.client.id.bfmsampledata.dfs.core.windows.net
spark.databricks.cluster.profile singleNode
spark.hadoop.fs.azure.account.oauth2.client.id
spark.hadoop.fs.azure.account.auth.type OAuth
fs.azure.account.oauth2.client.endpoint.bgdatabrickslandingzone1.dfs.core.windows.net <your_secret>
spark.hadoop.fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.endpoint.bfmsampledata.dfs.core.windows.net <your_secret>
fs.azure.account.oauth2.client.id.bgdatabricksdatalake1.dfs.core.windows.net
fs.azure.account.oauth2.client.id.bgdatabrickslandingzone1.dfs.core.windows.net
fs.azure.account.oauth2.client.secret.bfmsampledata.dfs.core.windows.net
fs.azure.account.oauth2.client.endpoint.bgdatabricksdatalake1.dfs.core.windows.net <your_secret>

 

  • If you want to activate the logging:
    • Download the following script: log4j_init_script.sh
    • Import this file in your Databricks workspace (near the artifacts folder)
    • In Advanced Options, Init Scripts, select this file from your workspace:
    • It is then considered on cluster bootup.
  • Click on the Create compute button:
  • Wait until the Cluster is available:

Upload Artifacts in Databricks

If your source data is coming from a JDBC source, you should add the source schema name before the source Model Object Name in biGENIUS-X, Dataflow Modeling overview. 

Transform CreditCard in Sales.CreditCard, for example.

There are 2 possibilities to upload the generated artifacts:

Please replace the placeholders before uploading the artifacts.

Upload with the deploy.ps1 helper

To use the deploy.ps1 helper to upload the generated artifacts:

  • Install the Databricks CLI
  • You will need the following information:
    • Profile: choose a Databricks CLI profile name
    • Workspace path: 
      • Depending on the location you want to upload the artifacts
      • For example, we want to deploy them in an Artifacts folder inside our Databricks user
      • To have the path, open the Artifacts folder, click on the 3 dots icon and choose the option Copy URL/path > Full path
    • Cluster name: compute name you created in the previous chapter
    • Host: URL to access your Databricks as https://XXXX.cloud.databricks.com
  • Open Powershell and execute the deploy.ps1 script:
    • Enter the profile name:
    • Enter the workspace path:
    • Enter the cluster name:
    • If you never used the profile, enter the host:
    • You will be authenticated in the Browser:
    • All the artifacts are uploaded:

Manual upload

To manually upload the generated Artifacts from the biGENIUS-X application to the Databricks Workspace:

  • Click on the Azure Databricks Service in Azure:
  • Then click on the URL:
  • Databricks is opened:
  • Click on the Workspace menu on the left-hand side:
  • Expand the folders Workspace > Users and click on your user:
  • We have chosen to create a folder named Artifacts:

  • Import all the generated artifacts from the folder Jupyter, Helpers, and LoadControl:

It is possible to have one or several files not imported as:

It is due to Databricks itself.

Just restart the import for the relevant files, and it should work.

Upload the global helper

In the file 500_Deploy_and_Load_Databricks.ipynb, adapt the name of the XXX_Deployment_LoadControl, the XXX_Deployment, the XXX_MultithreadingLoadExecution.ipynb, and the XXX_SelectResults.ipynb to the name of your Helper files.

 

If you have already discovered your source data, modeled your project, and generated the artifacts, you're now ready to replace the placeholders in your generated artifacts, deploy these artifacts, and subsequently load the data based on the Generator you are using with the following possible load controls: