Onboarding data
The first step of creating a data pipeline in a Qlik Open Lakehouse project is onboarding the data. This process involves transferring data from the source and storing datasets in optimized Iceberg tables.
Onboarding is created in a single operation, but performed in two steps. The data source type, either CDC or streaming, determines the tasks in your project:
CDC sources
-
Landing the data
This involves transferring the data in continuous mini-batches from the on-premises data source to a landing area, using a Landing data task.
Landing data from data sources
You can also land data to a lakehouse, where the data is landed to S3 file storage.
-
Storing datasets
This involves reading the initial load of landing data or incremental loads, and applying the data in read-optimized format using a Storage data task.
Streaming sources
-
Landing the data
This involves continuously streaming the data from the source to a landing area, using a Streaming landing data task.
-
Storing datasets
This involves reading the initial load of landing data, and applying the data in read-optimized format using a Storage Transform data task.
Using onboarded data
When you have onboarded the data, you can use the stored datasets in several ways, including:
-
You can use the datasets in an analytics application.
-
You can mirror data to one or more cloud data warehouses, including Amazon Redshift and Snowflake, by adding a Mirror data task directly to the Storage data task for CDC sources, or the Streaming Transform task for streaming sources.
For more information, see Mirroring data to a cloud data warehouse.
-
You can transform data in your cloud data warehouse by creating a cross-project pipeline that consumes data from your onboarding project.