Streaming data
The onboarding process transfers data from the source and stores it in Iceberg tables. Changes from the streaming data sources are continuously applied to the storage tables in near real time.
Onboard data
Data is onboarded within a pipeline project and datasets are stored in the S3 location defined in the project settings.
-
In your project, click Create and then Onboard data.
-
Add a Task name and optional Description for the onboarding.
Click Next.
-
Select the source connection.
You can select an existing streaming source connection or create a new connection to the source.
For more information, see Connecting to data streams
Click Next and follow the instructions below for your data source.
Selecting data
Apache Kafka and Amazon Kinesis
The list displays the available Kafka topics or Kinesis streams from the host defined in the source connection.
When selecting your topics/streams, you can select specific datasets, or use selection rules to include or exclude groups of datasets:
-
Use % as a wildcard to define a selection criteria for the datasets.
-
%.% defines all datasets in all streams.
If topics are selected using selection rules, you can choose whether to load all datasets into the same target table or to create a separate target table for each source topic:
-
By default, the target Iceberg table name is derived from the topic name, formatted to comply with naming conventions, for example, lowercase, spaces removed, dashes replaced with underscores. In Define target dataset name, you can edit the name of the target table
-
When selection rules are used to load multiple topics into a single table, you must provide the target name.
-
When selection rules are used and the data is loaded into a separate tables (one dataset per topic), the default target names are the topic names. At this stage, you cannot edit the names in the wizard, but this can be done later in the landing task.
-
If a rule is configured to select topics for ingestion, any new topics that meet the rule criteria are also landed if the New topic > Add to target option under schema evolution in the landing task settings is checked.
Select one or more datasets, and click Add selected streams. You can see the added datasets under Explicitly selected streams. Click Next.
Amazon S3
The directory browser displays a list of all the directories located in the S3 bucket of your source connection.
-
Select the directories to include when landing data:
-
For each directory, in Add path, enter the path and file name pattern:
-
Use * as a wildcard to match any character.
-
To enter a date pattern, use <yyyy> as the four-digit year placeholder, <MM> as the two-digit month placeholder, <dd> as the two-digit day placeholder, and <HH> as the two-digit hour placeholder. For example:
-
MyDir3/<yyyy>_<MM>_<dd>_<HH>_orders.csv
-
MyDir3/<yyyy>/<MM>/<dd>/<HH>_orders.csv
-
-
-
-
Click Preview to open the Preview data dialog. A list of included and excluded files is displayed.
-
Click Validate to check the data.
-
In Define target dataset name, provide a name to map the topic to the target Iceberg table. Click Next.
Selecting the content type
Choose the source events content type.
-
Select the type of events you are ingesting in Choose the type of data events.
-
For more information, see Connecting to data streams.
The content type selected applies to all topics. You must create a new task for each content type you want to ingest.
-
Expand Verify the events are correctly loaded to confirm the data can be parsed. You must ensure data is correct at this stage, otherwise you need to recreate the pipeline and load the data again. Use Select dataset to examine specific datasets and check any warnings that may affect the loading of the data. Click the eye icon next to any struct columns to view the data.
-
Click Next.
Setting ingestion properties
Configure the settings for your pipeline:
-
Read data from
-
Start from the earliest event: ingest all historical data.
-
Start from now: ingest new data arriving from the time the pipeline starts.
-
-
Column unnesting
-
Preserve nested columns: no transformations are applied.
-
Unnest into separate columns: data is split into separate columns.
-
-
Load settings
-
Append only: generally the best option for event data as it usually has a short life-span and is not updated, for example, Orders.
-
Merge: this is best suited to data that is updated over time, for example, Customers.
-
-
Target table partition
The target table partition option applies to all tables in the pipeline. You can override this later at the table level for bespoke partitioning.
-
No partition: tables are created without any partitioning.
-
Partition by event ingestion date: tables are partitioned by the date events are ingested.
-
-
Click Next.
Summary
The summary screen provides a visual display of your pipeline:
-
Optionally, for the Streaming landing and Streaming Transform task, you can click Edit name and description to provide new values.
-
Select the option for what you want to happen After the pipeline is created.
-
When you have configured all the settings, click Create to create the pipeline project.
-
When the project is displayed, you can prepare and run each task to begin ingesting the data.
-
Prepare and run the Streaming landing task.
For more information, see Landing streaming data to Qlik Open Lakehouse.
-
Prepare and run the Streaming Transform task.
For more information, see Storing streaming datasets.
-