Discover a Parquet file

Discover a Parquet file

A parquet file can be stored with or without partitions.

The discovery configuration is not the same.

Discover a Parquet file without a partition


To discover a Parquet file with the Discovery Companion, which is not stored with partitions, please first prepare the source YAML file with the following:

  • The specifications to access the source system
  • The specifications for the output file

Here is an example:

  • Parquet files are stored in a folder named parquet_files at the same path as the source YAML file

         
  • A local output folder for the Discovery JSON file is named outputs.
    The complete path is: ./outputs/parquet_discovery_output.json
source:
  type: s3
  config:
    path_specs:
      -
      include: "parquet_files/*.parquet"
sink:
  type: file
  config:
    filename: ./outputs/parquet_discovery_output.json

The output folder in your source YAML file should exist.

You can download this YAML file example here or create a .yaml file with the previous example and then adapt it:

  • The Parquet file name
  • The output path

Then, you can use it in the Discovery Companion.

Discover a Parquet file with a partition

To discover a Parquet file with the Discovery Companion, which is stored with partitions, please first prepare the source YAML file with the following:

  • The specifications to access the source system
  • The specifications for the output file

Here is an example:

  • Parquet files are stored in a folder named PARQUET, which contains a subfolder per table and then a subfolder per date partition, for example, for the BRANCH table:    
  • A local output folder for the Discovery JSON file is named outputs.
    The complete path is: ./outputs/parquet_with_partitions_discovery_output.json
source:
  type: s3
  config:
    path_specs:
       [{ include: "C:/PROJECTS/Git/test.base.data/BLACKFORESTMARKETS/DATA/PARQUET/{table}/{partition_key[0]}={partition[0]}/*.parquet",
      "table_name": "{table}"
       }]
    add_partition_columns_to_schema: True
sink:
  type: file
  config:
    filename: ./outputs/parquet_with_partitions_discovery_output.json

The output folder in your source YAML file should exist.

You can download this YAML file example here or create a .yaml file with the previous example and then adapt it:

  • The Parquet file name
  • The output path

Then, you can use it in the Discovery Companion.

 

All possible configurations are described on the DataHub official website.