Databricks - Artifacts - 1.5

With the download of the generated Artifacts, a zip file was created in the default download location set for your Web Browser as:

Let's see what this zip file contains now.

Generated Artifacts content

The Generated Artifacts zip file contains the following:

  • A Helpers folder
  • A Jupyter folder
  • A LoadControl folder
  • A ParseError.log file (optional)
  • The placeholder toolkit is in the replace_placeholders.ps1 file
  • The placeholder configuration is in the replacement_config.json file

Helpers folder

The Helpers folder contains some Jupyter Notebook files helpful in loading data:

See Databricks - Load data with a native load control.

Jupyter folder

The Jupyter folder contains 1 Jupyter Notebook file by Target Model Object to deploy and 1 Jupyter Notebook  to execute the deployment based on the other files (Deployment.ipynb):

Each file contains all the needed code to create the Target Parquet file in the Target Data lake.

Example of a part of a Jupyter Notebook file:

...
   "source" : [
    "# HubLoader: CreditCard_Hub_hub loader_1\n",
    "\n",
    "spark.sql(\"\"\"\n",
    "INSERT\n",
    "INTO `{documentationsparkdatavault#rawvault#database_name}`.`rdv_hub_creditcard_hub` (\n",
    "     `bg_loadtimestamp`\n",
    "    ,`hub_hk`\n",
    "    ,`bg_sourcesystem`\n",
    "    ,`creditcardid`\n",
    ")\n",
    "SELECT\n",
    "     current_timestamp() AS `bg_loadtimestamp`\n",
    "    ,`bg_source`.`hub_hk` AS `hub_hk`\n",
    "    ,`bg_source`.`bg_sourcesystem` AS `bg_sourcesystem`\n",
    "    ,`bg_source`.`creditcardid` AS `creditcardid`\n",
    "FROM `{documentationsparkdatavault#rawvault#database_name}`.`rdv_hub_creditcard_hub_delta` AS `bg_source`\n",
    "\"\"\")\n",
    "\n"   ]
...

As you can observe, a placeholder is used for the target database name.

Before you deploy the generated code in your target environment, please Replace the placeholders in the Generated Artifacts.

LoadControl folder

The LoadControl folder contains 1 JSON file by Target Model Object, a Python file, and a Jupoyter Notebook for the multi-threading:

Each JSON file contains all the necessary code to create the Jobs in Databricks.
Example of a part of a JSON file:

...
 "tasks": [
  {
   "task_key": "RDV_HUB_CreditCard_Hub_Loader",
   "run_if": "ALL_SUCCESS",
   "notebook_task": { "notebook_path": "{myfirstprojectspark#databricks_job#notebook_task_path}RDV_HUB_CreditCard_Hub_Loader" },
   "existing_cluster_id": "{myfirstprojectspark#databricks_job#existing_cluster_id}"
  }
],...

The Python file (XXX_dag.py) contains all the code to create the Airflow pipelines to load the data creation script.

The Jupyter Notebook file (XXX_MultithreadingLoadExecution.ipynb) contains the code to load the data in several threads simultaneously.

As you can observe, placeholders are used to configure Databricks behavior.

Before you deploy the generated code in your target environment, please Replace the placeholders in the Generated Artifacts.

If the Property Configure LoadControl is set to False, only the XXX_MultithreadingLoadExecution.ipynb file will be generated.

If the Property Deploy LoadControl is set to False, all the files will be generated except the XXX_MultithreadingLoadExecution.ipynb.

ParseError.log file

If some errors occur during the generation of your Artifacts, the file ParseError.log will contain these errors.

Example of error:

ERROR LinkSourceView: Customer_Order_Link_link source view_1 ### TargetName: RDV_LNK_Customer_Order_Link_Source
Invalid amount of dataflows for ArchitectureModelObject name Customer_Order_Link_link source view_1
System.Exception: Invalid amount of dataflows for ArchitectureModelObject name Customer_Order_Link_link source view_1
   at biGENIUS.Implementation.Base.DataflowBuilder.Build(ArchitectureModelObject architectureModelObject) in C:\.sources\bgaas\bigenius.implementation.base\src\biGENIUS.Implementation.Base\StatementBuilder\DataflowBuilder.cs:line 18
   at biGENIUS.Implementation.Datavault.Mssql.SourceViewBuilder.Build(ArchitectureModelObject architectureModelObject) in C:\.sources\bgaas\bigenius.implementation.datavault.mssql\src\biGENIUS.Implementation.Datavault.Mssql\StatementBuilder\SourceViewBuilder.cs:line 22
   at biGENIUS.Implementation.Datavault.Mssql.LinkSourceView.BuildScriptInternal() in C:\.sources\bgaas\bigenius.implementation.datavault.mssql\src\biGENIUS.Implementation.Datavault.Mssql\Template\LinkSourceView.cs:line 16
   at biGENIUS.Implementation.Base.TemplateScriptBase.BuildScript(IGeneratorContext generatorContext, ArchitectureModelObject architectureModelObject, TemplateRenderingContext templateRenderingContext, StatementBuilderContext statementBuilderContext) in C:\.sources\bgaas\bigenius.implementation.base\src\biGENIUS.Implementation.Base\TemplateBase\TemplateScriptBase.cs:line 45
   at biGENIUS.Implementation.Base.GeneratorHookBase.RenderTemplate(IGeneratorContext generatorContext, Assembly assembly, ArchitectureModelObject amo) in C:\.sources\bgaas\bigenius.implementation.base\src\biGENIUS.Implementation.Base\Hook\GeneratorHookBase.cs:line 169

In this case, no Dataflow was configured for the Link named Customer_Order_Link.

So, please now define the dataflow and generate a new time.

If no errors occur during generation, the file ParseError.log will not exist in the Generated Artifact zip file.

Placeholder toolkit

It is composed when you deploy PowerShell script: replace_placeholders.ps1.

It uses a placeholder configuration file to replace the connection needed information in the generated artifacts.

For more information, please review Replace the placeholders in the Generated Artifacts.

Placeholder configuration

It is composed of a JSON file: replacement_config.json.

The placeholder toolkit uses it to replace the connection needed information in the generated artifacts.

For more information, please review Replace the placeholders in the Generated Artifacts.