Databricks - Target environment

This article proposes a possible target environment on Databricks.

Installation and configuration of the target environment are not part of biGENIUS support.

Unfortunately, we won't be able to provide any help beyond this example in this article.

Many other configurations and installations are possible for a Databricks target environment.

Below is a possible target environment setup for a Databricks generator.

Setup environment

The Databricks target environment needs at least the following Azure resources set in your Azure Subscription:

  • A Ressource Group: some help is available here
  • Inside the Ressource Group:
    • 2 Storage Accounts: some help is available here
      • 1 for the Source data
      • 1 for the Target Data Lake
    • An Azure Databricks Service some help is available here

In the following example of the target environment, we will use the following resources:

  • A Ressource Group named bg-databricks
  • A Storage Account for our Source Data named bgdatabrickslandingzone1
  • A Storage Account for our Target Data Lake named bgdatabricksdatalake1
  • An Azure Databricks Service named bgaas-testing

Tools to install

Please install the following tools:

  • Azure Storage Explorer: available here

Target Storage Account

We have chosen to create a folder named claire-datalake in our Target Storage Account:

  • Open Azure Storage Explorer
  • Connect to your Subscription
  • Open the Target Storage Account
  • Create a folder 

For this example, we have 1 Target folder for our Data Lake:

Source data

There are three ways to provide source data to a Databricks Data Vault or Data Vault and Mart generators:

  • From Parquet files by using the Databricks Stage Files generator as a Linked Project
  • From any database accessed through JDBC by using the Databricks Stage JDBC generator as a Linked Project
  • From existing Delta files by using a direct Discovery to a Spark External Table that contains the Delta file data

Parquet Files

If your source data are stored in Parquet files, please:

  • Create a first Project with the Databricks Stage Files generator
  • In this first Project, discover the Parquet files, create the Stage Model Object, generate, deploy, and load data in the target storage account.
  • Create a second Project with the Databricks Data Vault or Databricks DataVault and Mart generators.
  • In this second Project, use the first Project Stage Model Object as a source by using the Linked Project feature.

Database

If your source data are stored in a database such as Microsoft SQL Server or Postgres (or any database you can access through JDBC), please:

  • Create a first Project with the Databricks Stage JDBC generator
  • In this first Project, discover the database tables, create the Stage Model Object, generate, deploy, and load data in the target storage account.
  • Create a second Project with the Databricks Data Vault or Microsoft Fabric DataVault and Mart generator
  • In this second Project, use the first Project Stage Model Object as a source by using the Linked Project feature.

Delta Files

You may have existing Delta Files in your target storage account that you want to use as source data.

You should manually create a Spark External Table in your target storage account containing the Delta File data.

For example, we have the following Delta File containing Credit Card data in our target storage account named bgdatabrickslandingzone1:

We will create a Spark External table named creditcard_delta in the database stage with the following code:

--Replace claire-datalake@bgdatabricksdatalake1 by your target storage account name
--Replace delta/creditcard/ by the path where the Delta file is stored

df = spark.sql(f"""
  CREATE EXTERNAL TABLE stage.creditcard_delta
    USING DELTA
    LOCATION 'abfss://claire-datalake@bgdatabricksdatalake1.dfs.core.windows.net/delta/creditcard/'
""")

Then, create a Discovery in the Databricks DataVault or DataVault and Mart Project using a JSON discovery file generated by the Discovery Companion Application with a Delta file.

Upload Artifacts in Databricks

Please now upload the generated Artifacts from the biGENIUS-X application to the Databricks Workspace.

Please replace the placeholders before uploading the artifacts.

  • Click on the Azure Databricks Service in Azure:
  • Then click on the URL:
  • Databricks is opened:
  • Click on the Workspace menu on the left-hand-side:
  • Expand the folders Workspace > Users and click on your user:
  • We have chosen to create a folder named Artifacts:

  • Import all the generated artifacts from the folder Jupyter and Helpers:

It is possible to have one or several files not imported as:

It is due to Databricks itself.

Just restart the import for the concerned files, and it should work.

In the file 500_Deploy_and_Load_DataVault_Databricks.ipynb, adapt the name of the XXX_Deployment, the XXX_SimpleLoadexecution.ipynb, and the XXX_SelectResults.ipynb by the name of your Helper files.

    Create a Compute Cluster

    To be able to execute the Notebooks from our generated artifacts, please create a Compute Cluster in Databricks:

    • Click on the Compute menu on the left-hand-side:
    • Click on the Create compute button:
    • Change the following information:
      • Check "Single node"
      • Databricks runtime version: choose "13.3 LTS (Scala 2.12, Spark 3.4.1)
      • In Advanced Options, Spark, insert all the needed config to authorize your cluster to access the Source and Target Storages Accounts. Following is an example in our environment (should not work the same in your environment):
        • Replace <your_scope> by the name of your scope storing the secret
        • Replace <your_secret> by your secret's name
    fs.azure.account.auth.type.bfmsampledata.dfs.core.windows.net OAuth
    fs.azure.account.oauth.provider.type.bfmsampledata.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    fs.azure.account.oauth2.client.secret.bgdatabricksdatalake1.dfs.core.windows.net
    fs.azure.account.oauth.provider.type.bgdatabricksdatalake1.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    fs.azure.account.auth.type.bgdatabricksdatalake1.dfs.core.windows.net OAuth
    spark.hadoop.fs.azure.account.oauth2.client.secret
    spark.hadoop.fs.azure.account.oauth2.client.endpoint <your_secret>
    fs.azure.account.oauth.provider.type.bgdatabrickslandingzone1.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    fs.azure.account.auth.type.bgdatabrickslandingzone1.dfs.core.windows.net OAuth
    fs.azure.account.oauth2.client.secret.bgdatabrickslandingzone1.dfs.core.windows.net
    spark.master local[*, 4]
    fs.azure.account.oauth2.client.id.bfmsampledata.dfs.core.windows.net
    spark.databricks.cluster.profile singleNode
    spark.hadoop.fs.azure.account.oauth2.client.id
    spark.hadoop.fs.azure.account.auth.type OAuth
    fs.azure.account.oauth2.client.endpoint.bgdatabrickslandingzone1.dfs.core.windows.net <your_secret>
    spark.hadoop.fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    fs.azure.account.oauth2.client.endpoint.bfmsampledata.dfs.core.windows.net <your_secret>
    fs.azure.account.oauth2.client.id.bgdatabricksdatalake1.dfs.core.windows.net
    fs.azure.account.oauth2.client.id.bgdatabrickslandingzone1.dfs.core.windows.net
    fs.azure.account.oauth2.client.secret.bfmsampledata.dfs.core.windows.net
    fs.azure.account.oauth2.client.endpoint.bgdatabricksdatalake1.dfs.core.windows.net <your_secret>
    • If you want to activate the logging:
      • Download the following script: log4j_init_script.sh
      • Import this file in your Databricks workspace (near the artifacts folder)
      • In Advanced Options, Init Scripts, select this file from your workspace:
      • It is then considered on cluster bootup.
    • Click on the Create compute button:
    • Wait until the Cluster is available:

    You're now ready to deploy these artifacts and subsequently load the data based on the Generator you are using.