Skip to main content Skip to complementary content

Connecting to data streams

The following streaming services are supported in Qlik Open Lakehouse projects. Event data is continuously ingested to ensure near real-time availability for downstream data integration, analytics, and AI, enabling low-latency pipelines that reflect the most current operational activity.

Streaming services such as Apache Kafka and Amazon Kinesis provide durable, high-throughput pipelines for capturing operational events as they occur. Unlike file-based sources that rely on batch ingestion, streaming sources deliver data continuously as events are produced, enabling near-real-time processing without waiting for files to be generated or scheduled. Producers publish structured or semi-structured messages that retain their schema and support partitioning. All updates and deletes for the same record must use the same partition key. Kafka and Kinesis guarantee ordering only within a single partition or shard, not across the entire topic or stream, so using a consistent partition key ensures that changes for a given record are processed in the correct sequence. Qlik also supports Amazon S3 as a streaming source for continuously ingesting event data.

Streaming ingestion versus batch ingestion

The difference between streaming and batch data sources is as follows:

  • With both sources, events are efficiently ingested every minute, supporting low-latency processing and near-real-time analytics.

  • With non-streaming sources, there is first a full load of the existing data and then changes are ingested. You can also reload the full load data from the source.

  • With streaming sources, there is no clear distinction between initial load and events later. Qlik can manage retention and also supports partitions.

Information noteStreaming tasks are billed based on compute usage (vCores x runtime) rather than data volume.

In a Qlik Open Lakehouse project, streaming sources can only be used with the Streaming landing task and Streaming transform task:

  • Streaming data is ingested using a Streaming landing task and instead of processing discrete files, the Streaming landing task reads events as they arrive, lands the data in Amazon S3, and persists events as Avro files. This approach preserves schema evolution, supports complex data types such as structs, and provides efficient storage with optimized query performance while maintaining a continuous ingestion model.

  • When you onboard data from a streaming source, a Streaming transform task is automatically added for each dataset that will be stored in Iceberg format. Optionally, the Streaming transform task can be used to standardize structures, enrich event payloads, or align data with downstream consumption models.

  • A Mirror data task enables datasets from streaming sources to be mirrored to cloud data warehouses allowing downstream systems to consume streaming events without duplicating data. For more information, see Mirroring data to a cloud data warehouse.

Data type mappings

The initial source schema is based on a sample of the data taken prior to the PREPARE phase when creating your pipeline project, and schema evolution is handled at read time. Mirror tasks and other downstream tasks that do not support STRUCT and ARRAY use a JSON type. Data can be parsed using SQL.

The following data type mappings apply to all the supported data sources, but vary according to the source file type, and the following should be noted:

  • Data types are inferred from a sample of the data being onboarded. For example, if a field contains only integer values in the sample, it is created as INT8 in the streaming landing and transform tasks. If subsequent data includes double-precision fractional values, the landing files contain those values; however, in the Streaming transform task, if the Change field data type setting is set to Ignore, the column remains INT8 and the fractional values are truncated. To avoid unintended truncation, ensure the sample data includes the full range of expected values before onboarding, or configure Change field data type to Stop task during early stages and adjust data types as needed.

  • If a field is added to a struct in the source, it is always added to the landing target. For streaming transformation, the behavior is applied according to the option chosen in Streaming transform task settings > Schema evolution > Add fields to struct (Apply to target, Ignore, Stop task).

  • If a field is missing in a specific record, or an array is empty, they are treated as null.

  • If a dataset is flattened by an array, and a record arrives where that array is empty or null, the system creates one row and the flattened field is null. It is not excluded automatically. If you want to exclude these rows, manually add a filter, for example, array_element IS NOT NULL.

  • The data types displayed in the UI reflect the selected dataset granularity. For flattened arrays, the data type of the individual element is shown rather than the array structure itself.

  • A new attribute cannot be added inside a struct within a nested JSON field, only at the root level.

  • In streaming transform tasks, flattening is supported for only a single level of an array. When flatten is applied to a multi-level array, for example, ARRAY<ARRAY<STRUCT>>, only the outer array is flattened, resulting in ARRAY<STRUCT> rather than a fully flattened STRUCT. Additionally, the current UI allows flattening to be configured only at the column level. As a result, selecting a multi-level array implicitly applies flattening to the first array level only.

JSON

In JSON files, the numeric value in the source determines the target data type:

  • INT8 is used for integer values that fit within the supported integer range and do not include a fractional component.

  • REAL8 (DOUBLE) is used when the value contains a fractional component (floating-point number).

  • STRING is used when the numeric value exceeds the maximum supported integer range.

Data types are mapped as follows:

Source data types Qlik Talend Data Integration data types
STRING STRING
NUMBER INT8
NUMBER REAL8
NUMBER STRING
BOOLEAN BOOLEAN
ARRAY ARRAY
OBJECT STRUCT

CSV, TSV, REGEX, and SPLIT

By default, all source data types are ingested to a string. Use the option, Automatically infer types, to map source and target types as follows:

Source data types Qlik data types
NUMERIC INT8/REAL8
True/TRUE/true/False/FALSE/false BOOLEAN
TIMESTAMP Timestamps in the format yyyy-MM-dd HH:mm:ss or yyyy-MM-ddTHH:mm:ssz are parsed to a datetime type. If a timezone is included, the value is parsed as a string.

Parquet

Parquet files support physical and logical data types. Physical data types define how values are stored on disk, such as INT32, DOUBLE, or BYTE_ARRAY. Logical data types provide semantic meaning on top of the physical representation, for example, identifying whether an integer value represents a date. When a logical type is attached to a Parquet column and is supported in Qlik Open Lakehouse (as listed below), the Streaming landing task uses the logical type when defining the target schema, rather than the underlying physical type. This ensures that data is interpreted correctly, preserves intended semantics such as precision, scale, and temporal meaning, and results in more accurate schemas when data is written to downstream formats.

Data sourced from Parquet files is mapped as follows:

Source data types Logical types Qlik Talend Data Integration data types
BOOLEAN   BOOLEAN
INT32   INT8
INT64   INT8
INT96   DATETIME
FLOAT   REAL8
DOUBLE   REAL8
BYTE_ARRAY   STRING (Encoded as Base64)
FIXED_LEN_BYTE_ARRAY   STRING (Encoded as Base64)
BYTE_ARRAY STRING STRING
BYTE_ARRAY ENUM STRING
INT32 DECIMAL INT8
INT64 DECIMAL INT8
FIXED_LEN_BYTE_ARRAY DECIMAL INT8/REAL8 (Encoded as Base64)
BYTE_ARRAY DECIMAL INT8/REAL8 (Encoded as Base64)
INT32 DATE DATE
INT32 TIME(MILLIS,true) INT8
INT64 TIME(MICROS,true) TIME
INT64 TIMESTAMP(MICROS,true) DATETIME
INT64 TIMESTAMP(MILLIS,true) DATETIME
NESTED TYPES   STRUCT
LIST   ARRAY
MAP   ARRAY<STRUCT>. Array of structs representing key-value pairs.

Avro

The following mappings apply to Avro files with schema registry.

Source data types Logical types Qlik Talend Data Integration data types
BOOLEAN   BOOLEAN
INT   INT8
LONG   INT8
FLOAT   REAL8
DOUBLE   REAL8
BYTES   STRING
STRING   STRING
RECORD   STRUCT
ENUM   STRING
ARRAY   ARRAY
MAP   ARRAY<STRUCT>
UNION    
FIXED   STRING
BYTES DECIMAL DECIMAL
FIXED DECIMAL DECIMAL
INT DATE DATE
INT TIME-MILLIS INT8
INT TIME-MICROS TIME
LONG TIMESTAMP-MILLIS DATETIME
LONG TIMESTAMP-MICROS DATETIME

ORC

The following mappings apply to ORC files.

Source data types Qlik Talend Data Integration data types
BOOLEAN BOOLEAN
BYTE INT8
SHORT INT8
INT INT8
LONG INT8
DATE DATE
FLOAT REAL8
DOUBLE REAL8
TIMESTAMP DATETIME
BINARY STRING
DECIMAL REAL8
STRING STRING
VARCHAR STRING
CHAR STRING
LIST ARRAY
MAP ARRAY<STRUCT>. Array of structs representing key-value pairs.
STRUCT STRUCT
UNION  

Limitations

The following limitations apply to all data sources:

  • If your files are of different types, which can occur when they originate from multiple sources or versions, the transform task created using a single sample file (for example, during onboarding) does not automatically account for those differences.

  • If you change the data types in the landing task, for example because you need to hash the data, ensure the transform data types match the new data types.

Supported sources

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!