Databricks - Load data - Apache Airflow Load Control

Before loading the data from your Source(s) System(s) to your Databricks Target System, please:

Deploy the generated Artifacts
Configure the Load control environment for Apache Airflow

You are now ready to load the data with Apache Airflow.

Create the Databricks Jobs

Airflow will use Databricks Jobs to load the data.

biGENIUS-X artifacts contain configuration files to create these Jobs.

Download the following helper: deploy_jobs.ps1.

Copy it near the LoadControl folder of your generated artifacts:

Then:

Open Powershell
Navigate to your generated artifacts folder:

--Example
cd C:\XXXXXXXXX\20240219144515

Execute the Powershell script helper with the following parameters:
- Check Apache Airflow - Load Control environment to find the host and the token

-- Replace <yourhost> by the Databricks host
-- Replace <yourtoken> by your PAT
.\deploy_jobs.ps1 -databricksAccountName <yourhost> -jobsFolderName LoadControl -databricksToken <yourtoken>

You should have the following result for each Job creation:

Load the data with Apache Airflow

To load the data with Apache Airflow:

Navigate to the Airflow homepage http://localhost:8080
Enter your credentials
You should have the following result:
Configure a connection to your target solution environment by following the steps:
- Open the menu Admin > Connections
- Click on the plus icon
- Enter the connection details:
  - Connection id: Must be databricks_default
  - Connection Type: select Databricks
  - Host: check Apache Airflow - Load Control environment to find it

- - Login: Must be token (as we are using a PAT)
  - Password: Must be the PAT value (check Apache Airflow - Load Control environment to find it)
- Check the connection by clicking on the Test button:
- You should have the following message:
- Save the connection by clicking on the Save button:
Go back to the DAGs list by clicking on the DAGs menu:
To launch a data load, click on the Trigger DAG button for the relevant DAG:
The data load started:
You can follow the data load by clicking on the DAG, then on the current execution (green bar), and finally on the Graph tab:
If you want to check the logs of the execution, click on the Audit Log tab:
You can now check that your data were correctly loaded by creating a Notebook in Databricks and executing with the following code:

-- Adapt the database and table names
-- Add as many union as target tables
df = spark.sql("""
    select '`rawvault`.`rdv_sat_creditcard_delta_satellite_result`' as table_name
    , count(*) as row_count
    , max(bg_loadtimestamp) as max_bg_loadtimestamp
    from `rawvault`.`rdv_sat_creditcard_delta_satellite_result`
    union
    select '`rawvault`.`rdv_hub_salesorderheader_hub_result`' as table_name
    , count(*) as row_count
    , max(bg_loadtimestamp) as max_bg_loadtimestamp
    from `rawvault`.`rdv_hub_salesorderheader_hub_result`
""")
df.show(n=1000, truncate=False)

Databricks - Load data - Apache Airflow Load Control - 1.7

Create the Databricks Jobs

Load the data with Apache Airflow