Preparing the index cluster

Qlik Associative Big Data Index uses non-nested Parquet files stored on HDFS, S3, EFS or Linux file system instances as data sources to create the index. The HDFS, S3 or EFS services are deployed on a Virtual Private Cloud (VPC) using an Amazon Web Services (AWS) environment. You need to make some preparations before you can start indexing, so make sure to review the following sections before you start.

You can use the management console to configure indexing settings, or edit the configuration files directly.

Configuring the indexing settings with the management console

Click Configure in the management console and select Service settings. The following settings are available.

Service settings
Setting Description
Output root folder

Root folder where all output, such as schemas, should be created. This must be set to a shared path that can be accessed across all nodes. The default setting is /home/output.

Symbol server async threads

The number of parallel threads that the symbol server can handle. The default setting is 1.

We recommend that you set this to the number of cores of the machine.

Create column index threads

Setting that affects how much memory is consumed when creating symbols. The default setting is 1.

We recommend that you set this to a value less than:

  • a third of the memory size in GB of the machine.
  • The value of the symbol_server_async_threads setting.
Symbol server output_folder Folder for symbol server output.
Index output folder Folder for index creation output. The default setting is /home/output/IndexOutput.
Logging settings folder (Optional) Folder for storing log files of index creation.

When you have configured the indexing settings, you can add a dataset.

Adding a dataset in the management console

You can add a dataset to index. Click Configure in the management console and select Dataset.

Dataset settings
Setting Description
Dataset name Name of the data set that should be processed. The default name is tpch, but you can change it to your preference.
Source data path

Path to the folder where your data set is located..

This must be set to a shared path that can be accessed across all nodes. The default setting is /home/data.

Field mappings file location (Optional) Field mappings file that defines the table and field names to be mapped for attribute to attribute (A2A) associations in the schema.

When you have added a dataset, you can create the index.

Changing indexing settings manually

You need to change the indexing settings in the configuration file indexing_setting.json located in the /home/ubuntu/dist/runtime/config folder. The most important settings to update are the dataset_name and source_data_path settings which point to the data set that you want to index.

Settings in indexing_setting.json
Setting Description
output_root_folder

Root folder where all output, such as schemas, should be created. This must be set to a shared path that can be accessed across all nodes. The default setting is /home/output.

dataset_name Name of the data set that should be processed.
symbol_output_folder Folder for symbol creation output. The default setting is /home/output/SymbolOutput.
index_output_folder Folder for index creation output. The default setting is /home/output/IndexOutput.
symbol_server_async_threads

The number of parallel threads that the symbol server can handle. The default setting is 1.

We recommend that you set this to the number of cores of the machine.

create_column_index_threads

Setting that affects how much memory is consumed when creating symbols. The default setting is 1.

We recommend that you set this to a value less than:

  • a third of the memory size in GB of the machine.
  • The value of the symbol_server_async_threads setting.
source_data_path

Path to the folder where your data set is located..

This must be set to a shared path that can be accessed across all nodes. The default setting is /home/data.

field_mappings_file Field mappings file that defines the table and field names to be mapped for attribute to attribute (A2A) associations in the schema.

Did this information help you?

Thanks for letting us know. Is there anything you'd like to tell us about this topic?

Can you tell us why it did not help you and how we can improve it?