Creating a data pipeline
You can create a data pipeline to perform all your data integration within a project using data tasks. Onboarding moves data into the project from data sources that are on-premises or in the cloud and store the data in ready-to-consume data sets. You can also perform transformations and create data marts to leverage your generated and transformed data sets. The data pipeline can be simple and linear, or it can be a complex pipeline consuming several data sources and generating many outputs.
All data tasks will be created in the same space as the project that they belong to.
You can also view lineage to track data and data transformations backwards to the original source, and perform impact analysis which shows the forward-looking, downstream view of data task, dataset, or field dependencies. For more information, see Working with lineage and impact analysis in Data Integration.
Onboarding data
This includes landing the data to a staging area, and then storing the datasets in a cloud data warehouse. Landing and Storage data tasks are created in a single step. If you need to, you can also perform landing and storage with separate tasks.
Registering data that is already on the data platform
Register data that already exists on the data platform to curate and transform data, and create data marts.. This lets you use data that is onboarded with other tools than Qlik Talend Data Integration, for example, Qlik Replicate, or Stitch.
Transforming data
Create reusable row-level transformations on the onboarded data based on rules and custom SQL. This creates a Transform data task.
Creating and managing data marts
Create a data mart to leverage your data sets. This creates a Data mart data task.
Target data platforms
The project is associated with a data platform that is used as target for all output.
For more information about supported data platforms, see Setting up connections to targets.
Video introduction to projects
Example of creating a project
The following example performs onboarding data, transforming the data and creating a data mart. This will create a simple linear data pipeline that you could expand by onboarding more data sources, create more transformations, and add the generated data tasks to the data mart.
-
Create a new project.
In Data Integration > Projects, click Create project.
-
Enter a name and a description for the project, and select a space to create the project in. All data tasks will be created in the space of the project that they belong to.
Information noteIf you later enable version control for the project, you will not be able to change the project name while it is under version control. - Select Data pipeline in Use case.
-
Select which data platform to use in the project.
-
Select a connection to the cloud data warehouse that you want to use in the project. This will be used to land data files and store datasets and views. If you have not prepared a connection already, create one with Add connection.
If you selected Google BigQuery, Databricks, or Microsoft Azure Synapse Analytics as data platform, you also need to connect to a staging area.
-
If you selected Qlik Cloud as data platform:
You can either store data in Qlik managed storage, or your own managed Amazon S3 bucket. If you want to use your own Amazon S3 bucket, you need to select a connection to that bucket.
In both cases, you also need to select a connection to an Amazon S3 staging area. If you use the same bucket that you defined in the previous step, make sure that you use another folder in the bucket for staging.
-
Click Create.
The project is created, and you can create your data pipeline by adding data tasks.
-
-
Onboard the data
In the project, click Create and then Onboard data.
For more information, see Onboarding data.
This will create a landing data task and a storage data task. To start replicating data you need to:
-
Prepare and run the landing data task.
For more information, see Landing data from data sources.
-
Prepare and run the storage data task.
For more information, see Storing datasets.
-
-
Transforming the data
When the storage data task is created, go back to the project. You can now perform transformations on the created datasets.
Click ... on the storage data task and select Transform data to create a transformation data task based on this storage data task. For instructions about transformations, see Transforming data.
-
Creating a data mart
You can create a data mart based on a storage data task or a transformation data task.
Click ... on the data task and select Create data mart to create a data mart data task. For instructions about creating a data mart, see:
When you have performed the first full load of the stored and transformed datasets and data marts, you can use them in an analytic app, for example. For more information about creating analytics apps, see Creating an analytics app using datasets generated by Qlik Talend Data Integration.
You could also expand the data pipeline by onboarding more data sources, and combine them in the transformation, or in the data mart.
Operations in a data pipeline project
You can perform the same operations that are available for a data task as project operations. This allows you to orchestrate the operations in the data pipeline.
-
Turn schedules on and off
-
Perform design operations
-
Start and stop execution of data tasks
-
Delete data tasks
Click Operations to view the status of an operation in progress, or the latest performed operation.
You can stop an operation in progress by clicking Stop operation. Data tasks that are in progress will not be stopped, but it will cancel any task that has not started yet.
Turning schedules on and off
You can control the schedules for data tasks on project level.
-
Click ..., and then Schedule.
You can turn the schedule on or off for all data tasks, or a selection of tasks. Only tasks with a schedule defined are displayed.
Information noteThis option is not available for projects with Qlik Cloud as data platform.
For more information about scheduling individual data tasks, see:
Performing design operations
You can perform design operations on all data tasks in the project, or on a selection of tasks. This makes it easier to control the dataset tasks in the project, instead of performing the design operations individually in each task.
-
Validate
Click Validate to validate all tasks, or a selection of tasks. Data tasks that were changed since the last validate operation are preselected.
The data tasks are validated in pipeline order.
-
Prepare
Click Prepare to prepare all tasks, or a selection of tasks. Data tasks that were changed since the last prepare operation are preselected.
You can select to recreate datasets that require a structure change not supported by the data platform. This can lead to data loss.
-
Recreate
Click ..., and then Recreate to recreate the datasets from source for all tasks, or for a selection of tasks.
Running data tasks
You can initiate the execution of all data tasks in the project, or on a selection of tasks, instead of running tasks individually. For example, you can run all tasks with a time-based schedule. This will initiate downstream tasks with an event-based schedule.
-
Run
Click Run to initiate the execution of all tasks, or a selection of tasks. This initiates the run of all selected tasks, and completes as soon as they start executing.
You can select from all tasks that are ready to run. Tasks with a time based schedule and tasks that use CDC are preselected. Tasks with an event-based schedule are not preselected as they will be executed when they have data to process.
In a project with Qlik Cloud as data platform, all landing and storage tasks are preselected.
Information noteAll data tasks are executed in parallel. This means that dependency checks may prevent some tasks from running. -
Stop
Click Stop to stop all tasks, or a selection of tasks.
You can select from tasks that are running.
Deleting data tasks
-
Click Delete to delete all data tasks in the project, or a selection of tasks.
Changing the view of a project
There are two different views of project. You can switch between the views by clicking Pipeline view.
-
The pipeline view shows the data flow of the data tasks.
You can choose how much information to show for the data tasks by clicking Layers. Toggle on or off the following information:
-
Status
-
Data freshness
-
Schedule
-
-
The card view shows a card view with information about the data task.
You can filter on asset type and owner.
Viewing data
You can view a sample of the data to see and validate the shape of your data as you are designing your data pipeline.
The following permissions are required:
-
Viewing data is enabled on tenant level in Administration.
Enable Settings > Feature control > Viewing data in Data Integration.
-
You are assigned the Can view data role in the space where the connection resides.
-
You are assigned the Can view role in the space where the project resides.
To view sample data in the data pipeline view:
-
Click in the preview banner at the bottom of the pipeline view.
-
Select which data task to preview data for.
A sample of the data is displayed. You can set how many data rows to include in the sample with Number of rows.
Exporting and importing projects
You can export a project to a JSON file that contains everything required to reconstruct the project. The exported JSON file can be imported on the same tenant, or on another tenant. You can use this, for example, to move projects from one tenant to another, or to make backup copies of projects.
For more information, see Exporting and importing data pipelines.
Project settings
You can set properties that are common to the project and all included data tasks.
-
Click Settings.
For more information, see Data pipeline project settings.