Preparing data source files

Qlik Associative Big Data Index uses non-nested Parquet files stored on HDFS, S3, EFS or Linux file system instances as data sources to create the index. You need to prepare the data source files in Parquet format, and place them on a shared folder that can be accessed across all nodes.

If you want to convert CSV files to Parquet files you need to prepare the CSV file according to these format requirements.

Removing reserved characters from field names

When converting CSV to Parquet, the column/field name should be changed to remove all reserved characters. The following characters are reserved:

  • , (comma)
  • ; (semi-colon)
  • {} (curly brackets)
  • () (round brackets)
  • \n (line feed)
  • \t (tab)
  • = (equal sign)

Mapping supported data types

The table describes how data types supported by Parquet are mapped to Qlik Associative Big Data Index internal data types during the conversion.

Parquet logical type QABDI internal data type Comments
UTF8 std::string  
MAP Not supported  
MAP_KEY_VALUE Not supported  
LIST Not supported  
ENUM Not supported  
DECIMAL double  
DATE int32_t The number of days from the Unix epoch 1st January 1970
TIME_MILLIS int32_t The number of milliseconds after midnight
TIME_MICROS int64_t The number of microseconds after midnight
TIMESTAMP_MILLIS int64_t The number of milliseconds from the Unix epoch, 00:00:00.00 on 1st January 1970
TIMESTAMP_MICROS int64_t The number of milliseconds from the Unix epoch, 00:00:00.000000 on 1st January 1970
UINT8 uint8_t  
UINT16 uint16_t  
UINT32 uint32_t  
UINT64 uint64_t  
INT8 int8_t  
INT16 int16_t  
INT32 int32_t  
INT64 int64_t  
JSON Not supported  
BSON Not supported  
INTERVAL Not supported  

Mapping Parquet files to the source data model

There are two different approaches to map Parquet files to their respective model tables.

  • Use a dataset/tableset/Parquet directory structure when you can arrange Parquet files in a three-level directory hierarchy. We recommend that you use this option.

  • Use a tablemap JSON file when it is not possible to use a three-level directory structure.

Using a table/table set/Parquet folder structure

By organizing your Parquet files in a three-level folder hierarchy you can infer the table names from their respective {table_name}.table directories. This is done during the schema discovery phase in the DataScan option of task_manager.sh.

To use this option, use the source_data_path in indexing_settings.json to provide the path to the root {dataset} directory which contains all of the table-specific, {table_name}.table directories, each with their own {table1_set1}.Parquet sub-directories that contain the parquet data files.

{dataset} (directory - root folder of source data, can be named whatever) +-- {table1_name}.table (directory - the directory name will be used as the table name later in schema generation) | +-- {table1_set1}.parquet (directory, one table directory can hold multiple datasets) | | +-- parquet file | | +-- parquet file | | +-- ... | | +-- parquet file | +-- {table1_set2}.parquet (directory, another set for the same table) | +-- parquet file | +-- parquet file | +-- ... | +-- parquet file | +-- {table2_name}.table (directory, more tables) +-- ... +-- {tableN_name}.table

Using a tablemap JSON file

If the Parquet files for your tables are in a directory structure that does not match the three-level structure but where each table's data files are in different, non-overlapping directories (or directory trees), it can be easier to use a tablemap JSON file.

You need to create a JSON file that explicitly lists the root directories to search for each table. This works even if a single table's files are in multiple, unrelated directories. With this option, the schema discovery phase in the Data Scan option of task_manager.sh searches recursively for parquet data files in the directory structures of each root_dir folder associated with each table in the tablemap JSON file.

Create a JSON file according to the format here and provide the full path to that file as source_data_path in indexing_settings.json.

{"data_set_name": "dataset name", "tables": [ {"name": "table1", "root_dirs": [ {"path": "/path/to/table1/first/root_dir"}, {"path": "/path/to/table1/second/root_dir"} ] }, {"name": "table2", "root_dirs": [ {"path": "/path/to/table2/first/root_dir"}, {"path": "/path/to/table2/second/root_dir"} ] } ] }

Field mapping

You can use a field mappings file to define the table/field names to be mapped. The path to this file is set in indexing_settings.json configuration file.

Example: Field mapping sample file

{ "field_mappings": [ { "column1": "part.p_part_key", "column2": "partsupp.ps_part_key" }, { "column1": "supplier.s_supp_key", "column2": "partsupp.ps_supp_key" }, { "column1": "supplier.s_nation_key", "column2": "nation.n_nation_key" }, { "column1": "partsupp.ps_part_key", "column2": "lineitem.l_part_key" }, { "column1": "partsupp.ps_supp_key", "column2": "lineitem.l_supp_key" }, { "column1": "customer.c_cust_key", "column2": "order.o_cust_key" }, { "column1": "customer.c_nation_key", "column2": "nation.n_nation_key" }, { "column1": "nation.n_region_key", "column2": "region.r_region_key" }, { "column1": "lineitem.l_order_key", "column2": "order.o_order_key" } ] }

Did this information help you?

Thanks for letting us know. Is there anything you'd like to tell us about this topic?

Can you tell us why it did not help you and how we can improve it?