Preparing data source files
Qlik Big Data Index uses non-nested Parquet files or Optimized Row Columnar (ORC) files stored on HDFS, S3, EFS or Linux file system instances as data sources to create the index. You need to prepare the data source files in Parquet format.
Recommendations
A large number of data source files can affect indexing performance.
If you have a large number of small Parquet files, you can merge them into larger files. We recommend that:
- The number of rows per parquet file is a multiple (2x to 10x) of the size of an indexlet (16 777 216 rows).
- The number of rows per row group should be equal to the size of an indexlet (16 777 216 rows).
If you have a large number of small ORC files, you can merge them into larger files. We recommend that:
- The number of rows per ORC file is a multiple (2x to 10x) of the size of an indexlet (16 777 216 rows).
- The number of rows per row group should be equal to the size of an indexlet (16 777 216 rows).
Converting CSV files to Parquet files
If you want to convert CSV files to Parquet files you need to prepare the CSV file according to these format requirements.
Removing reserved characters from field names
When converting CSV to Parquet, the column/field name should be changed to remove all reserved characters. The following characters are reserved:
- , (comma)
- ; (semi-colon)
- {} (curly brackets)
- () (round brackets)
- \n (line feed)
- \t (tab)
- = (equal sign)
Mapping supported data types
The table describes how data types supported by Parquet are mapped to Qlik Big Data Index internal data types during the conversion.
Parquet logical type | QABDI internal data type | Comments |
---|---|---|
UTF8 | std::string | |
MAP | Not supported | |
MAP_KEY_VALUE | Not supported | |
LIST | Not supported | |
ENUM | Not supported | |
DECIMAL | double | |
DATE | int32_t | The number of days from the Unix epoch 1st January 1970 |
TIME_MILLIS | int32_t | The number of milliseconds after midnight |
TIME_MICROS | int64_t | The number of microseconds after midnight |
TIMESTAMP_MILLIS | int64_t | The number of milliseconds from the Unix epoch, 00:00:00.00 on 1st January 1970 |
TIMESTAMP_MICROS | int64_t | The number of milliseconds from the Unix epoch, 00:00:00.000000 on 1st January 1970 |
UINT8 | uint8_t | |
UINT16 | uint16_t | |
UINT32 | uint32_t | |
UINT64 | uint64_t | |
INT8 | int8_t | |
INT16 | int16_t | |
INT32 | int32_t | |
INT64 | int64_t | |
JSON | Not supported | |
BSON | Not supported | |
INTERVAL | Not supported |
Mapping Parquet files to the source data model
There are two different approaches to map Parquet files to their respective model tables.
-
Use a dataset/tableset/Parquet directory structure when you can arrange Parquet files in a three-level directory hierarchy. We recommend that you use this option.
-
Use a tablemap JSON file when it is not possible to use a three-level directory structure.
Using a table/table set/Parquet folder structure
By organizing your Parquet files in a three-level folder hierarchy you can infer the table names from their respective {table_name}.table directories. This is done during the schema discovery phase in the DataScan option of task_manager.sh.
To use this option, use the source_data_path in indexing_settings.json to provide the path to the root {dataset} directory which contains all of the table-specific, {table_name}.table directories, each with their own {table1_set1}.Parquet sub-directories that contain the parquet data files.
Using a tablemap JSON file
If the Parquet files for your tables are in a directory structure that does not match the three-level structure but where each table's data files are in different, non-overlapping directories (or directory trees), it can be easier to use a tablemap JSON file.
You need to create a JSON file that explicitly lists the root directories to search for each table. This works even if a single table's files are in multiple, unrelated directories. With this option, the schema discovery phase in the Data Scan option of task_manager.sh searches recursively for parquet data files in the directory structures of each root_dir folder associated with each table in the tablemap JSON file.
Create a JSON file according to the format here and provide the full path to that file as source_data_path in indexing_settings.json.
Converting CSV files to ORC files
If you want to convert CSV files to ORC files you need to prepare the CSV file according to these format requirements.
Mapping supported data types
The table describes how data types supported by ORC are mapped to Qlik Big Data Index internal data types during the conversion.
ORC logical type | QABDI internal data type | Comments |
---|---|---|
boolean | uint8_t | |
tinyint | int8_t | |
smallint | int16_t | |
int | int32_t | |
bigint | int64_t | |
float | float | |
double | double | |
string | std::string | |
varchar | std::string | |
binary | std::string | |
timestamp | int64_t | The number of nanoseconds from the Unix epoch, 00:00:00.000000 on 1st January 1970 |
date | int32_t | The number of days from the Unix epoch 1st January 1970 |
decimal | int32_t | |
char | not supported | |
struct | not supported | |
list | not supported | |
map | not supported | |
union | not supported |
Mapping ORC files to the source data model
There are two different approaches to map ORC files to their respective model tables.
-
Use a dataset/tableset/ORC directory structure when you can arrange ORC files in a three-level directory hierarchy. We recommend that you use this option.
-
Use a tablemap JSON file when it is not possible to use a three-level directory structure.
Using a table/table set/ORC folder structure
By organizing your ORC files in a three-level folder hierarchy you can infer the table names from their respective {table_name}.table directories. This is done during the schema discovery phase in the DataScan option of task_manager.sh.
To use this option, use the source_data_path in indexing_settings.json to provide the path to the root {dataset} directory which contains all of the table-specific, {table_name}.table directories, each with their own {table1_set1}.ORC sub-directories that contain the ORC data files.
Using a tablemap JSON file
If the ORC files for your tables are in a directory structure that does not match the three-level structure but where each table's data files are in different, non-overlapping directories (or directory trees), it can be easier to use a tablemap JSON file.
You need to create a JSON file that explicitly lists the root directories to search for each table. This works even if a single table's files are in multiple, unrelated directories. With this option, the schema discovery phase in the Data Scan option of task_manager.sh searches recursively for ORC data files in the directory structures of each root_dir folder associated with each table in the tablemap JSON file.
Create a JSON file according to the format here and provide the full path to that file as source_data_path in indexing_settings.json.
Field mapping
You can use a field mappings file to define the table/field names to be mapped manually. The path to this file is set in indexing_settings.json configuration file.
Example: Field mapping sample file