Preparing data source files

Qlik Big Data Index uses non-nested Parquet files or Optimized Row Columnar (ORC) files stored on HDFS, S3, EFS or Linux file system instances as data sources to create the index. You need to prepare the data source files in Parquet format.

Recommendations

A large number of data source files can affect indexing performance.

If you have a large number of small Parquet files, you can merge them into larger files. We recommend that:

The number of rows per parquet file is a multiple (2x to 10x) of the size of an indexlet (16 777 216 rows).
The number of rows per row group should be equal to the size of an indexlet (16 777 216 rows).

If you have a large number of small ORC files, you can merge them into larger files. We recommend that:

The number of rows per ORC file is a multiple (2x to 10x) of the size of an indexlet (16 777 216 rows).
The number of rows per row group should be equal to the size of an indexlet (16 777 216 rows).

Converting CSV files to Parquet files

If you want to convert CSV files to Parquet files you need to prepare the CSV file according to these format requirements.

Removing reserved characters from field names

When converting CSV to Parquet, the column/field name should be changed to remove all reserved characters. The following characters are reserved:

, (comma)
; (semi-colon)
{} (curly brackets)
() (round brackets)
\n (line feed)
\t (tab)
= (equal sign)

Mapping supported data types

The table describes how data types supported by Parquet are mapped to Qlik Big Data Index internal data types during the conversion.

Parquet logical type	QABDI internal data type	Comments
UTF8	std::string
MAP	Not supported
MAP_KEY_VALUE	Not supported
LIST	Not supported
ENUM	Not supported
DECIMAL	double
DATE	int32_t	The number of days from the Unix epoch 1st January 1970
TIME_MILLIS	int32_t	The number of milliseconds after midnight
TIME_MICROS	int64_t	The number of microseconds after midnight
TIMESTAMP_MILLIS	int64_t	The number of milliseconds from the Unix epoch, 00:00:00.00 on 1st January 1970
TIMESTAMP_MICROS	int64_t	The number of milliseconds from the Unix epoch, 00:00:00.000000 on 1st January 1970
UINT8	uint8_t
UINT16	uint16_t
UINT32	uint32_t
UINT64	uint64_t
INT8	int8_t
INT16	int16_t
INT32	int32_t
INT64	int64_t
JSON	Not supported
BSON	Not supported
INTERVAL	Not supported

Mapping Parquet files to the source data model

There are two different approaches to map Parquet files to their respective model tables.

Use a dataset/tableset/Parquet directory structure when you can arrange Parquet files in a three-level directory hierarchy. We recommend that you use this option.
Use a tablemap JSON file when it is not possible to use a three-level directory structure.

Using a table/table set/Parquet folder structure

By organizing your Parquet files in a three-level folder hierarchy you can infer the table names from their respective {table_name}.table directories. This is done during the schema discovery phase in the DataScan option of task_manager.sh.

To use this option, use the source_data_path in indexing_settings.json to provide the path to the root {dataset} directory which contains all of the table-specific, {table_name}.table directories, each with their own {table1_set1}.Parquet sub-directories that contain the parquet data files.

{dataset} (directory - root folder of source data, can be named whatever) +-- {table1_name}.table (directory - the directory name will be used as the table name later in schema generation) | +-- {table1_set1}.parquet (directory, one table directory can hold multiple datasets) | | +-- parquet file | | +-- parquet file | | +-- ... | | +-- parquet file | +-- {table1_set2}.parquet (directory, another set for the same table) | +-- parquet file | +-- parquet file | +-- ... | +-- parquet file | +-- {table2_name}.table (directory, more tables) +-- ... +-- {tableN_name}.table

Using a tablemap JSON file

If the Parquet files for your tables are in a directory structure that does not match the three-level structure but where each table's data files are in different, non-overlapping directories (or directory trees), it can be easier to use a tablemap JSON file.

You need to create a JSON file that explicitly lists the root directories to search for each table. This works even if a single table's files are in multiple, unrelated directories. With this option, the schema discovery phase in the Data Scan option of task_manager.sh searches recursively for parquet data files in the directory structures of each root_dir folder associated with each table in the tablemap JSON file.

Create a JSON file according to the format here and provide the full path to that file as source_data_path in indexing_settings.json.

{"data_set_name": "dataset name", "tables": [ {"name": "table1", "root_dirs": [ {"path": "/path/to/table1/first/root_dir"}, {"path": "/path/to/table1/second/root_dir"} ] }, {"name": "table2", "root_dirs": [ {"path": "/path/to/table2/first/root_dir"}, {"path": "/path/to/table2/second/root_dir"} ] } ] }

Converting CSV files to ORC files

If you want to convert CSV files to ORC files you need to prepare the CSV file according to these format requirements.

Mapping supported data types

The table describes how data types supported by ORC are mapped to Qlik Big Data Index internal data types during the conversion.

ORC logical type	QABDI internal data type	Comments
boolean	uint8_t
tinyint	int8_t
smallint	int16_t
int	int32_t
bigint	int64_t
float	float
double	double
string	std::string
varchar	std::string
binary	std::string
timestamp	int64_t	The number of nanoseconds from the Unix epoch, 00:00:00.000000 on 1st January 1970
date	int32_t	The number of days from the Unix epoch 1st January 1970
decimal	int32_t
char	not supported
struct	not supported
list	not supported
map	not supported
union	not supported

Mapping ORC files to the source data model

There are two different approaches to map ORC files to their respective model tables.

Use a dataset/tableset/ORC directory structure when you can arrange ORC files in a three-level directory hierarchy. We recommend that you use this option.
Use a tablemap JSON file when it is not possible to use a three-level directory structure.

Using a table/table set/ORC folder structure

By organizing your ORC files in a three-level folder hierarchy you can infer the table names from their respective {table_name}.table directories. This is done during the schema discovery phase in the DataScan option of task_manager.sh.

To use this option, use the source_data_path in indexing_settings.json to provide the path to the root {dataset} directory which contains all of the table-specific, {table_name}.table directories, each with their own {table1_set1}.ORC sub-directories that contain the ORC data files.

{dataset} (directory - root folder of source data, can be named whatever) +-- {table1_name}.table (directory - the directory name will be used as the table name later in schema generation) | +-- {table1_set1}.orc (directory, one table directory can hold multiple datasets) | | +-- ORC file | | +-- ORC file | | +-- ... | | +-- ORC file | +-- {table1_set2}.orc (directory, another set for the same table) | +-- ORC file | +-- ORC file | +-- ... | +-- ORC file | +-- {table2_name}.table (directory, more tables) +-- ... +-- {tableN_name}.table

Using a tablemap JSON file

If the ORC files for your tables are in a directory structure that does not match the three-level structure but where each table's data files are in different, non-overlapping directories (or directory trees), it can be easier to use a tablemap JSON file.

You need to create a JSON file that explicitly lists the root directories to search for each table. This works even if a single table's files are in multiple, unrelated directories. With this option, the schema discovery phase in the Data Scan option of task_manager.sh searches recursively for ORC data files in the directory structures of each root_dir folder associated with each table in the tablemap JSON file.

Create a JSON file according to the format here and provide the full path to that file as source_data_path in indexing_settings.json.

Field mapping

You can use a field mappings file to define the table/field names to be mapped manually. The path to this file is set in indexing_settings.json configuration file.

Example: Field mapping sample file

{ "field_mappings": [ { "column1": "part.p_part_key", "column2": "partsupp.ps_part_key" }, { "column1": "supplier.s_supp_key", "column2": "partsupp.ps_supp_key" }, { "column1": "supplier.s_nation_key", "column2": "nation.n_nation_key" }, { "column1": "partsupp.ps_part_key", "column2": "lineitem.l_part_key" }, { "column1": "partsupp.ps_supp_key", "column2": "lineitem.l_supp_key" }, { "column1": "customer.c_cust_key", "column2": "order.o_cust_key" }, { "column1": "customer.c_nation_key", "column2": "nation.n_nation_key" }, { "column1": "nation.n_region_key", "column2": "region.r_region_key" }, { "column1": "lineitem.l_order_key", "column2": "order.o_order_key" } ] }

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here