Creating a data pipeline
You can create a data pipeline to perform all your data integration within a data project using data tasks. Onboarding moves data into the project from data sources that are on-premises or in the cloud and store the data in ready-to-consume data sets. You can also perform transformations and create data marts to leverage your generated and transformed data sets. The data pipeline can be simple and linear, or it can be a complex pipeline consuming several data sources and generating many outputs.
All data tasks will be created in the same space as the data project that they belong to.
You can also view lineage to track data and data transformations backwards to the original source, and perform impact analysis which shows the forward-looking, downstream view of data task, dataset, or field dependencies. For more information, see Working with lineage and impact analysis in Data Integration.
This includes landing the data to a staging area, and then storing the datasets in a cloud data warehouse. Landing and Storage data tasks are created in a single step. If you need to, you can also perform landing and storage with separate tasks.
Register data that already exists on the data platform to curate and transform data, and create data marts.. This lets you use data that is onboarded with other tools than Qlik Cloud Data Integration, for example, Qlik Replicate, or Stitch.
Create reusable row-level transformations on the onboarded data based on rules and custom SQL. This creates a Transform data task.
Target data platforms
The data project is associated with a data platform that is used as target for all output.
For more information about supported data platforms, see Connecting to target platforms.
Video introduction to data projects
Example of creating a data project
The following example performs onboarding data, transforming the data and creating a data mart. This will create a simple linear data pipeline that you could expand by onboarding more data sources, create more transformations, and add the generated data tasks to the data mart.
Create a new data project.
Click Add new and then Create data project in the Qlik Cloud Data Integration Home.
Enter a name and a description for the data project, and select a space to create the data project in. All data tasks will be created in the space of the data project that they belong to.
- Select Data pipeline in Use case.
Select which data platform to use in the project.
Select a data connection to the cloud data warehouse that you want to use in the project. This will be used to land data files and store datasets and views. If you have not prepared a data connection already, create one with Add connection.
If you selected Google BigQuery, Databricks, or Microsoft Azure Synapse Analytics as data platform, you also need to connect to a staging area.
If you selected Qlik Cloud as data platform:
You can either store data in Qlik managed storage, or your own managed Amazon S3 bucket. If you want to use your own Amazon S3 bucket, you need to select a data connection to that bucket.
In both cases, you also need to select a data connection to an Amazon S3 staging area. If you use the same bucket that you defined in the previous step, make sure that you use another folder in the bucket for staging.
The data project is created, and you can create your data pipeline by adding data tasks.
Onboard the data
Click Add new and then Onboard data.
For more information, see Onboarding data.
This will create a landing data task and a storage data task. To start replicating data you need to:
Transforming the data
When the storage data task is created, go back to the data project. You can now perform transformations on the created datasets.
Click ... on the storage data task and select Transform data to create a transformation data task based on this storage data task. For instructions about transformations, see Transforming data.
Creating a data mart
You can create a data mart based on a storage data task or a transformation data task.
Click ... on the data task and select Create data mart to create a data mart data task. For instructions about creating a data mart, see:
When you have performed the first full load of the stored and transformed datasets and data marts, you can use them in an analytic app, for example. For more information about creating analytics apps, see Creating an analytics app using datasets generated by Qlik Cloud Data Integration.
You could also expand the data pipeline by onboarding more data sources, and combine them in the transformation, or in the data mart.
Operations in a data project
You can perform the same operations that are available for a data task as data project operations. This allows you to orchestrate the operations in the data pipeline.
Turn schedules on and off
Perform design operations
Start and stop execution of data tasks
Delete data tasks
Click Operations to view the status of an operation in progress, or the latest performed operation.
You can stop an operation in progress by clicking Stop operation. Data tasks that are in progress will not be stopped, but it will cancel any task that has not started yet.
Turning schedules on and off
You can control the schedules for data tasks on project level.
Click ..., and then Schedule.
You can turn the schedule on or off for all data tasks, or a selection of tasks. Only tasks with a schedule defined are displayed.Information noteThis option is not available for data projects with Qlik Cloud as data platform.
For more information about scheduling individual data tasks, see:
You can perform design operations on all data tasks in the data project, or on a selection of tasks. This makes it easier to control the dataset tasks in the data project, instead of performing the design operations individually in each task.
Click Validate to validate all tasks, or a selection of tasks. Data tasks that were changed since the last validate operation are preselected.
The data tasks are validated in pipeline order.
Click Prepare to prepare all tasks, or a selection of tasks. Data tasks that were changed since the last prepare operation are preselected.
You can select to recreate datasets that require a structure change not supported by the data platform. This can lead to data loss.
Click ..., and then Recreate to recreate the datasets from source for all tasks, or for a selection of tasks.
Running data tasks
You can initiate the execution of all data tasks in the data project, or on a selection of tasks, instead of running tasks individually. For example, you can run all tasks with a time-based schedule. This will initiate downstream tasks with an event-based schedule.
Click Run to initiate the execution of all tasks, or a selection of tasks. This initiates the run of all selected tasks, and completes as soon as they start executing.
You can select from all tasks that are ready to run. Tasks with a time based schedule and tasks that use CDC are preselected. Tasks with an event-based schedule are not preselected as they will be executed when they have data to process.
In a project with Qlik Cloud as data platform, all landing and storage tasks are preselected.Information noteAll data tasks are executed in parallel. This means that dependency checks may prevent some tasks from running.
Click Stop to stop all tasks, or a selection of tasks.
You can select from tasks that are running.
Deleting data tasks
Click Delete to delete all data tasks in the data project, or a selection of tasks.
Changing the view of a data project
There are two different views of data project. You can switch between the views by clicking Pipeline view.
The pipeline view shows the data flow of the data tasks.
You can choose how much information to show for the data tasks by clicking Layers. Toggle on or off the following information:
The card view shows a card view with information about the data task.
You can filter on asset type and owner.
Exporting and importing data projects
You can export a data project to a JSON file that contains everything required to reconstruct the data project. The exported JSON file can be imported on the same tenant, or on another tenant. You can use this, for example, to move data projects from one tenant to another, or to make backup copies of data projects.
For more information, see Exporting and importing data pipelines.
Data project settings
You can set properties that are common to the project and all included data tasks.
For more information, see Data project settings.