Skip to content
  • There are no suggestions because the search field is empty.

Use DataHub to discover source data

This article proposes using DataHub to discover the source metadata instead of the biGENIUS-X Discovery Application.

Installation and configuration of DataHub are not part of biGENIUS support.

Unfortunately, we won't be able to provide any help beyond this example in this article.

You may not want to use the biGENIUS-X Discovery Application to discover the metadata of your source data.

Another way to produce a discovery file is to use DataHub.

DataHub is an open-source data catalog that serves as a centralized repository for managing and organizing metadata about an organization's data assets.

All the possible source types you can discover are described here.

Install DataHub

We advice to create a python virtual environment in a new directory, because it will install a lot of python packages on your machine, don't get them mixed up with other existing packages.

Create a Python virtual environment

To create a Python virtual environment:

  • Open Powershell
  • Execute the following commands:
python -m pip install virtualenv
python -m virtualenv venv # create a new venv in ./venv
  • Then activate your new virtual environment:
    • Linux/MacOS system:
source ./venv/bin/activate # activate your new venv
    • Windows system:
.\venv\Scripts\activate
  • Install the DataHub client:
pip install acryl-datahub

Discover data with DataHub

We will provide an example here to discover data from a Microsoft SQL Server database.

Specifications: https://datahubproject.io/docs/generated/ingestion/sources/mssql.

Install the source package

To discover a Microsoft SQL Server database, the following package must be installed:

pip install acryl-datahub[mssql]

You can install packages for other technologies with a similar command.

You can found the [XXX] package name to install in the datahub website.

Check a recipe example: the XXX can be replaced by the package name after type:

source:
type: mssql
...

Configure the recipe

The recipe yaml file contains the information to connect to the source.

For a Microsoft SQL Server database, an example is:

source:
  type: mssql
  config:
    # Coordinates
    host_port: localhost:1433
    database: AdventureWorks2019

    # Credentials
    username: training
    password: training

sink:
  type: file
  config:
    filename: ./mssql_example_output.json
  • type: package name, depending on the technology to discover
  • host_port: IP (or localhost) and port of the SQL Server instance
  • database: database name
  • username: username that can connect to the database
  • password: password of the user
  • filename: path to create the discovery file and target file name

You can download a recipe example here.

Discover the metadata

To start the discovery process, execute the following command:

datahub ingest -c mssql_example.yaml

You should have a similar result: 

The discovery file was created:

You can download a discovery file example here.

You can now use it in biGENIUS-X to create a Discovery.