Connecting to data streams
The following streaming services are supported in Qlik Open Lakehouse projects. Event data is continuously ingested to ensure near real-time availability for downstream data integration, analytics, and AI, enabling low-latency pipelines that reflect the most current operational activity.
Streaming services such as Apache Kafka and Amazon Kinesis provide durable, high-throughput pipelines for capturing operational events as they occur. Unlike file-based sources that rely on batch ingestion, streaming sources deliver data continuously as events are produced, enabling near-real-time processing without waiting for files to be generated or scheduled. Producers publish structured or semi-structured messages that retain their schema and support partitioning. All updates and deletes for the same record must use the same partition key. Kafka and Kinesis guarantee ordering only within a single partition or shard, not across the entire topic or stream, so using a consistent partition key ensures that changes for a given record are processed in the correct sequence. Qlik also supports Amazon S3 as a streaming source for continuously ingesting event data.
Streaming ingestion versus batch ingestion
The difference between streaming and batch data sources is as follows:
-
With both sources, events are efficiently ingested every minute, supporting low-latency processing and near-real-time analytics.
-
With non-streaming sources, there is first a full load of the existing data and then changes are ingested. You can also reload the full load data from the source.
-
With streaming sources, there is no clear distinction between initial load and events later. Qlik can manage retention and also supports partitions.
In a Qlik Open Lakehouse project, streaming sources can only be used with the Streaming landing task and Streaming transform task:
-
Streaming data is ingested using a Streaming landing task and instead of processing discrete files, the Streaming landing task reads events as they arrive, lands the data in Amazon S3, and persists events as Avro files. This approach preserves schema evolution, supports complex data types such as structs, and provides efficient storage with optimized query performance while maintaining a continuous ingestion model.
-
When you onboard data from a streaming source, a Streaming transform task is automatically added for each dataset that will be stored in Iceberg format. Optionally, the Streaming transform task can be used to standardize structures, enrich event payloads, or align data with downstream consumption models.
-
A Mirror data task enables datasets from streaming sources to be mirrored to cloud data warehouses allowing downstream systems to consume streaming events without duplicating data. For more information, see Mirroring data to a cloud data warehouse.
Limitations
The following limitations apply to all data sources:
-
If your files are of different types, which can occur when they originate from multiple sources or versions, the transform task created using a single sample file (for example, during onboarding) does not automatically account for those differences.
-
If you change the data types in the landing task, for example because you need to hash the data, ensure the transform data types match the new data types.