Creating the index using scripts

The index is the main element of Qlik Big Data Index, containing indexlets that are persisted data and symbol tables that represent the big data. You need to create the index by executing the supplied shell scripts in the cluster.

Scripts are located in the /home/ubuntu/dist/runtime/scripts/ folder.

Changing indexing settings

You need to change the indexing settings in the configuration file indexing_setting.json located in the /home/ubuntu/dist/runtime/config folder. The most important settings to update are the dataset_name and source_data_path settings which point to the data set that you want to index.

Settings in indexing_setting.json

Setting	Description
output_root_folder	Root folder where all output, such as schemas, should be created. This must be set to a shared path that can be accessed across all nodes. The default setting is /home/output.
dataset_name	Name of the data set that should be processed.
symbol_output_folder	Folder for symbol creation output. The default setting is /home/output/SymbolOutput.
index_output_folder	Folder for index creation output. The default setting is /home/output/IndexOutput.
symbol_server_async_threads	The number of parallel threads that the symbol server can handle. The default setting is 1. We recommend that you set this to the number of cores of the machine.
create_column_index_threads	Setting that affects how much memory is consumed when creating symbols. The default setting is 1. We recommend that you set this to a value less than: a third of the memory size in GB of the machine. The value of the symbol_server_async_threads setting.
source_data_path	Path to the folder where your data set is located.. This must be set to a shared path that can be accessed across all nodes. The default setting is /home/data.
field_mappings_file	Field mappings file that defines the table and field names to be mapped for attribute to attribute (A2A) associations in the schema.

Starting indexing services

You need to start the indexing services on all nodes of the cluster before you can create the index. This is done with the shell script start_indexing_env.sh. If you do not specify any options the script is executed using the default settings in the indexing configuration file indexing_setting.json.

Syntax:

./start_indexing_env.sh [options...]

Short version	Long version	Description
-h	--help	Print help for the script
-b	--binaryfolder	Specify the folder where the indexer_tool binary is stored.
-o	--outputrootfolder	Specify the root folder for the output of the indexing services results. The IP address (the local IP address or the address specified with --useip) will be appended.
-u	--useip	Specify the IP address to start indexing services on. The default setting is the local IP address. This is only needed when multiple network interfaces are defined on the local machine.
-c	--clusterconfig	Specify the path of the folder containing indexing configuration files.

Creating the index

The index is created by executing the task_manager.sh shell script after the indexing services have been started. You perform this in three steps, using different script options.

Scan the data and generate a schema (option 1).
Register the schema (option 3).
Create the index (option a, or alternatively, options 4,5 and 6 in sequence).

These three steps are the only options required to create index, but there are further options available in the script. You can run the script in interactive mode by executing it without specifying an option.

Syntax:

./task_manager.sh [options...]

Short version	Long version	Description
-h	--help	Print help for the script
-b	--binaryfolder	Specify the folder where the indexer_tool binary is stored.
-u	--useip	Specify the IP address to start indexing services on. The default setting is the local IP address. This is only needed when multiple network interfaces are defined on the local machine.
-c	--config	Specify the path of the folder containing indexing configuration files.
-r	--run	./task_manager.sh -r <task-option> Execute the script with a specific task option. If you do not use the -r option, the script is executed in interactive mode.
-a	--acceptLicense	Accept the Qlik User License Agreement (QULA).

Example: Running in interactive mode

./task_manager.sh

You can also execute the script with a task option to run a specific operation with the -r option.

Example: Executing the script option 1 only (Scan the data and generate a schema)

./task_manager.sh -r 1

Refer to the full syntax description of the task_manager.sh task options below.

All task_manager.sh task options

Option	Description
1	Scan the data and generate a schema
2	View the generated schema
3	Register the schema
a	Start index creation This option executes options 4, 5 and 6 in sequence.
4	Add indexing task
5	Create a column index
6	Create A2A indexlets
l	List task progress
t	List all indexing tasks in JSON format.
r	Resume an indexing task with a defined task id.
s	Scan the data and generate statistics
q	Quit

Scan the data and generate a schema (option 1)

The first step will scan the Parquet files in the source data and generate schema and data_source config in JSON format.

The data scan generates attribute to attribute (A2A) associations in the schema automatically if:

multiple tables have fields with exact same names
a field mappings file to define the table/field names to be mapped is set in indexing_settings.json configuration file.

After you have generated the schema, review the schema file to add or modify A2A associations before you register the schema. The file will be located in the {output_root_folder}/config/indexer/ folder.

The ”associations” section defines the field mapping between tables for A2A creation.

Each “data” sub-section defines an association of a pair of fields from two different tables.
The ”tables” section defines the table/field structure of the dataset.

See Schema configuration sample file.

Remember to back up any manual changes. If you run the data scan again, the schema files will be overwritten.

View the generated schema (option 2)

You can view the schema that was generated. This option prints the schema.json generated by option 1.

Register the schema (option 3)

This step registers the schema that was generated from the scan of the data source, including any manual updates.

Start index creation (option a)

This step starts the index creation which is performed in three operations in a controlled sequence. You can also perform each operation step by step manually, but they must be performed in the correct sequence, and the previous operation must be completed.

Add the indexing task (option 4)

This operation adds the index task and triggers symbol creation.
Create a column index (option 5)

This operation creates indexlets for all tables and columns.
Create A2A indexlets (option 6)

This operation creates attribute to attribute (A2A) indexlets.

Listing task progress (option l)

You can get a list of all indexing tasks with progress status of symbol and indexlet creation.

Generating source data statistics (option s)

You can scan the source data to generate statistics that can be used when planning scaling of the cluster. This option is not required to create the index.

This operation can be very time consuming for a large data set.

Indexing services

The following services are started on the Indexing manager, Symbol server and Indexer server nodes when indexing is started. The cluster.json configuration file contains the settings for the IP address and port number of each service. You can use environmental variables to override the default port setting.

Service	Default port	Environment variable
Indexing Registry Service	50057	BDI_INDEXING_REGISTRY_PORT
Persistence Service	55010	BDI_PERSISTENCE_MANAGER_PORT
Index Maintenance Service	55003	BDI_INDEX_MAINTENANCE_PORT
Symbol Service	55030	BDI_SYMBOL_SERVER_PORT
Indexer Service	55040	BDI_INDEXER_SERVER_PORT

Starting the QSL cluster

When the indexing cluster is up and running and indexlets have been created, you need to start the QSL cluster services. You start the services using the start_qsl_env.sh script located in the home/ubuntu/dist/runtime/scripts/QSL_processor folder.This script calls the following scripts in sequence.

start_regex_service.sh

This starts QSL registry and executor service on the node defined as instances.qsl_processor.qsl_executor in cluster.json. This service listens on port 44000 by default. The port number can be overridden with environment variable BDI_QSL_REGEXEC_PORT.
start_worker_service.sh

This starts QSL worker services on all nodes specified as instances.qsl_processor.qsl_workers in cluster.json.

You can also start multiple worker services on a local node with the option -w {no_of_workers} where {no_of_workers}>1.
start_manager_service.sh

This starts QSL manager service on the node specified as instances.qsl_processor.qsl_manager in cluster.json. This service listens on port 55000 by default. The port number can be overridden with environment variable BDI_QSL_MANAGER_PORT.

When the QSL services have been started, the system will adjust the output cache for optimal performance in the background. We recommend that you wait 30 minutes before using the index.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here