Skip to main content Skip to complementary content

Preparing data source files

Qlik Big Data Index uses non-nested Parquet files or Optimized Row Columnar (ORC) files stored on HDFS, S3, EFS or Linux file system instances as data sources to create the index. You need to prepare the data source files in Parquet format.

Recommendations

A large number of data source files can affect indexing performance.

If you have a large number of small Parquet files, you can merge them into larger files. We recommend that:

  • The number of rows per parquet file is a multiple (2x to 10x) of the size of an indexlet (16 777 216 rows).
  • The number of rows per row group should be equal to the size of an indexlet (16 777 216 rows).

If you have a large number of small ORC files, you can merge them into larger files. We recommend that:

  • The number of rows per ORC file is a multiple (2x to 10x) of the size of an indexlet (16 777 216 rows).
  • The number of rows per row group should be equal to the size of an indexlet (16 777 216 rows).

Converting CSV files to Parquet files

If you want to convert CSV files to Parquet files you need to prepare the CSV file according to these format requirements.

Removing reserved characters from field names

When converting CSV to Parquet, the column/field name should be changed to remove all reserved characters. The following characters are reserved:

  • , (comma)
  • ; (semi-colon)
  • {} (curly brackets)
  • () (round brackets)
  • \n (line feed)
  • \t (tab)
  • = (equal sign)

Mapping supported data types

The table describes how data types supported by Parquet are mapped to Qlik Big Data Index internal data types during the conversion.

Parquet logical type QABDI internal data type Comments
UTF8 std::string  
MAP Not supported  
MAP_KEY_VALUE Not supported  
LIST Not supported  
ENUM Not supported  
DECIMAL double  
DATE int32_t The number of days from the Unix epoch 1st January 1970
TIME_MILLIS int32_t The number of milliseconds after midnight
TIME_MICROS int64_t The number of microseconds after midnight
TIMESTAMP_MILLIS int64_t The number of milliseconds from the Unix epoch, 00:00:00.00 on 1st January 1970
TIMESTAMP_MICROS int64_t The number of milliseconds from the Unix epoch, 00:00:00.000000 on 1st January 1970
UINT8 uint8_t  
UINT16 uint16_t  
UINT32 uint32_t  
UINT64 uint64_t  
INT8 int8_t  
INT16 int16_t  
INT32 int32_t  
INT64 int64_t  
JSON Not supported  
BSON Not supported  
INTERVAL Not supported  

Mapping Parquet files to the source data model

There are two different approaches to map Parquet files to their respective model tables.

  • Use a dataset/tableset/Parquet directory structure when you can arrange Parquet files in a three-level directory hierarchy. We recommend that you use this option.

  • Use a tablemap JSON file when it is not possible to use a three-level directory structure.

Using a table/table set/Parquet folder structure

By organizing your Parquet files in a three-level folder hierarchy you can infer the table names from their respective {table_name}.table directories. This is done during the schema discovery phase in the DataScan option of task_manager.sh.

To use this option, use the source_data_path in indexing_settings.json to provide the path to the root {dataset} directory which contains all of the table-specific, {table_name}.table directories, each with their own {table1_set1}.Parquet sub-directories that contain the parquet data files.

{dataset} (directory - root folder of source data, can be named whatever) +-- {table1_name}.table (directory - the directory name will be used as the table name later in schema generation) | +-- {table1_set1}.parquet (directory, one table directory can hold multiple datasets) | | +-- parquet file | | +-- parquet file | | +-- ... | | +-- parquet file | +-- {table1_set2}.parquet (directory, another set for the same table) | +-- parquet file | +-- parquet file | +-- ... | +-- parquet file | +-- {table2_name}.table (directory, more tables) +-- ... +-- {tableN_name}.table

Using a tablemap JSON file

If the Parquet files for your tables are in a directory structure that does not match the three-level structure but where each table's data files are in different, non-overlapping directories (or directory trees), it can be easier to use a tablemap JSON file.

You need to create a JSON file that explicitly lists the root directories to search for each table. This works even if a single table's files are in multiple, unrelated directories. With this option, the schema discovery phase in the Data Scan option of task_manager.sh searches recursively for parquet data files in the directory structures of each root_dir folder associated with each table in the tablemap JSON file.

Create a JSON file according to the format here and provide the full path to that file as source_data_path in indexing_settings.json.

{"data_set_name": "dataset name", "tables": [ {"name": "table1", "root_dirs": [ {"path": "/path/to/table1/first/root_dir"}, {"path": "/path/to/table1/second/root_dir"} ] }, {"name": "table2", "root_dirs": [ {"path": "/path/to/table2/first/root_dir"}, {"path": "/path/to/table2/second/root_dir"} ] } ] }

Converting CSV files to ORC files

If you want to convert CSV files to ORC files you need to prepare the CSV file according to these format requirements.

Mapping supported data types

The table describes how data types supported by ORC are mapped to Qlik Big Data Index internal data types during the conversion.

ORC logical type QABDI internal data type Comments
boolean uint8_t  
tinyint int8_t  
smallint int16_t  
int int32_t  
bigint int64_t  
float float  
double double  
string std::string  
varchar std::string  
binary std::string  
timestamp int64_t The number of nanoseconds from the Unix epoch, 00:00:00.000000 on 1st January 1970
date int32_t The number of days from the Unix epoch 1st January 1970
decimal int32_t  
char not supported  
struct not supported  
list not supported  
map not supported  
union not supported  

Mapping ORC files to the source data model

There are two different approaches to map ORC files to their respective model tables.

  • Use a dataset/tableset/ORC directory structure when you can arrange ORC files in a three-level directory hierarchy. We recommend that you use this option.

  • Use a tablemap JSON file when it is not possible to use a three-level directory structure.

Using a table/table set/ORC folder structure

By organizing your ORC files in a three-level folder hierarchy you can infer the table names from their respective {table_name}.table directories. This is done during the schema discovery phase in the DataScan option of task_manager.sh.

To use this option, use the source_data_path in indexing_settings.json to provide the path to the root {dataset} directory which contains all of the table-specific, {table_name}.table directories, each with their own {table1_set1}.ORC sub-directories that contain the ORC data files.

{dataset} (directory - root folder of source data, can be named whatever) +-- {table1_name}.table (directory - the directory name will be used as the table name later in schema generation) | +-- {table1_set1}.orc (directory, one table directory can hold multiple datasets) | | +-- ORC file | | +-- ORC file | | +-- ... | | +-- ORC file | +-- {table1_set2}.orc (directory, another set for the same table) | +-- ORC file | +-- ORC file | +-- ... | +-- ORC file | +-- {table2_name}.table (directory, more tables) +-- ... +-- {tableN_name}.table

Using a tablemap JSON file

If the ORC files for your tables are in a directory structure that does not match the three-level structure but where each table's data files are in different, non-overlapping directories (or directory trees), it can be easier to use a tablemap JSON file.

You need to create a JSON file that explicitly lists the root directories to search for each table. This works even if a single table's files are in multiple, unrelated directories. With this option, the schema discovery phase in the Data Scan option of task_manager.sh searches recursively for ORC data files in the directory structures of each root_dir folder associated with each table in the tablemap JSON file.

Create a JSON file according to the format here and provide the full path to that file as source_data_path in indexing_settings.json.

{"data_set_name": "dataset name", "tables": [ {"name": "table1", "root_dirs": [ {"path": "/path/to/table1/first/root_dir"}, {"path": "/path/to/table1/second/root_dir"} ] }, {"name": "table2", "root_dirs": [ {"path": "/path/to/table2/first/root_dir"}, {"path": "/path/to/table2/second/root_dir"} ] } ] }

Field mapping

You can use a field mappings file to define the table/field names to be mapped manually. The path to this file is set in indexing_settings.json configuration file.

Example: Field mapping sample file

{ "field_mappings": [ { "column1": "part.p_part_key", "column2": "partsupp.ps_part_key" }, { "column1": "supplier.s_supp_key", "column2": "partsupp.ps_supp_key" }, { "column1": "supplier.s_nation_key", "column2": "nation.n_nation_key" }, { "column1": "partsupp.ps_part_key", "column2": "lineitem.l_part_key" }, { "column1": "partsupp.ps_supp_key", "column2": "lineitem.l_supp_key" }, { "column1": "customer.c_cust_key", "column2": "order.o_cust_key" }, { "column1": "customer.c_nation_key", "column2": "nation.n_nation_key" }, { "column1": "nation.n_region_key", "column2": "region.r_region_key" }, { "column1": "lineitem.l_order_key", "column2": "order.o_order_key" } ] }

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!