Data movement
Qlik Data Movement helps customers onboard their data rapidly and securely from their on-premise and cloud-based data sources to cloud data warehouses and data lakes. An end-to-end solution for data movement, this service securely accesses data sources, automatically replicates in real-time to cloud targets, and catalogs data sets without manual scripting.
The data movement process in Qlik Talend Data Integration is managed from the Qlik Cloud hub. It initializes and monitors the process of capturing data from Enterprise and Cloud application data sources.
Data from on-premise systems or running in a customer's cloud does not pass through, nor is it stored, in Qlik Cloud, unless Qlik Cloud is the chosen destination for the data. SaaS application source data is captured by Qlik Cloud and stored transiently while data flows from source to target via Qlik Data Gateway - Data Movement. In the case of Cloud data sources, data can be pulled directly from the source without the need for a gateway.
A note on Qlik Data Gateway - Direct Access This paper will not detail the functionality of Qlik Data Gateway - Direct Access. This gateway has a different purpose and helps solve different use cases than does the Qlik Data Gateway - Data Movement. The Direct Access gateway is considered a Qlik Cloud Analytics component, where users can connect directly to on-premise data sources from an app in Qlik Cloud Analytics and load data from there. It is not technically speaking a data integration tool, which means it will not be covered here.
Qlik Data Gateway - Data Movement
A challenge for many customers when moving to SaaS is providing access to their on-premise and private cloud data sources without compromising security. Qlik's solution to this is Qlik Data Gateway - Data Movement. This allows customers to access data sources in their data center and private cloud, without exposing them to the public internet.
Qlik Data Gateway - Data Movement is a component controlled from Qlik Cloud, but physically located near your data. It initiates connections to your source and target systems, orchestrating both full loads and change data capture (CDC). For simplicity we will refer to this simply as the Data Movement Gateway going forward.
Source data is onboarded directly into and persisted to the target cloud platform by the Qlik Data Movement Gateway, removing the need to expose data sources to the internet.
When started, the Data Movement Gateway makes an outbound connection to Qlik Cloud, which then initiates a reverse tunnel back to the gateway for command and control.
From on-premise to cloud data warehousing
Delivering data from on-premise data sources to cloud data warehouses is achieved with the Data Movement Gateway:
Source — The data source types available in the Data Movement Gateway will govern the sources of data that can be delivered from. The sources are divided into two types:
Database sources: At the time of writing (April 2024) over 20 source database platforms were supported. see Connecting to databases in help for an up to date list of supported databases.
SaaS Sources: At the time of writing, 60 SaaS applications were supported. See Supported SaaS applications in help.
Target — Many target data warehouse platforms are supported by Qlik Talend Data Integration (see Connecting to cloud data platforms in your data projects in the help). Currently, those targets are:
Snowflake®
Azure Synapse Analytics®
Databricks®
Google BigQuery®
Amazon Redshift®
Microsoft SQL Server®
Microsoft Fabric®
Both staging and storage will happen in the target system when data is delivered via the pipeline. It is possible to use targets in a private cloud; these connections will be proxied via the Data Movement Gateway.
Delivering your data to Qlik Cloud
You can deliver data from on-premise and cloud data sources directly to Qlik Cloud and store as QVD files (Qlik's proprietary file format, designed for fast loading into memory) with the Data Movement Gateway.
Source — The data source types available in the Data Movement Gateway will govern what sources data can be delivered from. We are regularly adding new sources. See Data sources in the help for details on the latest available sources.
Target — There are two options for target storage of these files, Qlik-managed storage and Customer-managed storage:
The Qlik-managed storage option requires customers to bring their own Amazon S3 bucket for the staging area. This storage is configured, maintained, and financed by the customer. Qlik will however provide storage for the storage area once the files have passed staging and are stored at rest. This is recommended if your goal is to make the data available for Qlik Cloud Analytics.
The Customer-managed storage option means the customer brings their own Amazon S3 bucket for both the staging area and the storage area, which means configuring, maintaining, and financing them. This is recommended if you need to make the data available to sources in addition to Qlik Cloud Analytics.
From cloud sources to cloud data warehouses or Qlik Cloud
Delivering data from cloud sources and storing it directly in cloud data warehouses is also possible with Qlik Talend Data Integration. This still requires the Data Movement Gateway for some scenarios. This allows us to support public and private cloud data warehouses and data lakes as source and target.
SaaS Data Loader
For Many SaaS platforms and some cloud database platforms, we are able to move data without a customer-deployed gateway. This SaaS data loading capability uses cloud-managed infrastructure and supports:
- Full and incremental loads
- No gateway or driver management required
- All third party SaaS targets
- Periodic Change Data Capture (frequency varies by edition, increasing in higher tiers)
- Data Movement only (no transformation or pipeline tasks)
Use-cases not covered in the above list require a data gateway to be deployed. The SaaS data loader capability is the only method supported in Starter edition of Qlik Talend Data Integration. For other editions this is supported in edition to the Data Gateway - Data Movement.
Data Architecture patterns
A cloud data warehouse built by Qlik Talend Data Integration creates artifacts to support a number of key data warehousing patterns. These include:
Landing: Landing contains an up-to-date copy of the raw data "landed" from your sources by the gateway. This could kept up to date using either CDC or full loads. The Landing zone is designed for internal use, not consumption, and Qlik recommends against using landing tables for any downstream tasks.
Storage: Storage contains tables and external views based on the data in landing. When consuming data, the best practice is to use views which provide improved data concurrency.
Views: Live views include data from change tables that are not yet applied to the current or prior tables. This lets you see data with lower latency and reduces the costs of processing requirements in the target platform. You can access both current data (ODS) and historical data (HDS) using live views. Depending on your project settings, the following views will be available:
Current view
Live view
Changes view
History view
History live view
Tables: The following tables are created:
Current table (ODS): This table contains the replica of the data source updated with changes during the latest apply interval.
Prior table (HDS): This table contains type 2 historical data. It is only generated if History is enabled in the data task settings.
Changes table: This table contains all changes that are not yet applied to the current table. It is only generated if the landing mode Full load and CDC is used.
The benefits of delayed merge
In a cloud data warehouse architecture, data is not available for analysis until it has been physically loaded into the Data Warehouse or Data Mart. This is often an expensive (in terms of resources, if not costs) and time-consuming operation and may prevent the latency required by end users. Cloud data warehouses typically have multiple charge metrics including compute up-time, data scan and processing. This often leads to a trade-off between latency, cost and business need which can lead to the expense of running these loads regularly through the day.
Qlik Talend Data Integration however provides an alternative solution for this challenge. The combination of delayed merge with the live views created by Qlik Talend Data Integration provides a solution which supports the best of both worlds; where a use-case demands real-time access, a live view provides this by joining the change tables with the prior loaded tables, for batch use cases they can use the current views which are highly performant. Delayed merge allows users to compress data changes at wider intervals while also providing an updated copy along with a full type 2 history.
Data projects
The first step to create a data pipeline in Qlik Talend Data Integration is to create a project. A project defines the use-case, source, and onboarding of the data. This is sufficient for a complete pipeline for some use cases. Qlik Talend Data Integration supports two use cases, Replication and Data Pipeline.
Replication
A data replication task uses change data capture (CDC) to efficiently and securely move data between a source and target system. Replication can be to an RDBMS, Data Lake, or Qlik Cloud target.
Data Pipeline
Data Pipeline projects allow you to modify and enrich the data, join disparate data sets, and transform the data into the formats you need such as Stay schemas or Data vaults. Data Pipelines can be created with our visual no-code designer, with custom SQL tasks, or created for you by generative AI.
Onboarding Data
The initial focus of a data project is the onboarding the data. This involves transferring the data continuously from the on-premise or cloud data source and generating datasets in read-optimized format. Onboarding involves two steps: landing and storing
Landing – The data involves transferring the data continuously from the data source to a landing area, using a Landing data task.
Storing – The data involves generating datasets based on the landing data, using a Storage data task.
Key Concepts in a data integration project
A data integration project is how we build, run, and monitor data pipelines.
Concept | Relationship to project | Description |
---|---|---|
Data tasks | Component | Data tasks are a fit-for-purposecollection of tables or files and an associated operation on those files. It is the main unit of work within a project in the data project. Examples of data tasks include transform and data mart. |
Data spaces | Dependency | Data spaces are governed areas of your Qlik Cloud tenant that are used to manage projects and their data assets. Access to a data space is determined by membership to the space. Access to projects and their data assets inside a data space is determined by roles assigned to members of the space. This means that a user must first be a member of the data space, and second, have the required roles to create, manage, or monitor data assets and resources in a data space. Members with the roles to consume data assets can also use data assets from a data space when building apps in personal, shared, and managed spaces. |
Data Gateway | Dependency | Data Gateway is used by the landing data asset for associating a replication task with it, as well as for control and basic monitoring of this task. |
Data connection | Dependency | A data connection is used by the storage data asset for connecting to AWS S3 buckets or cloud data warehouses, for the purpose of either reading from the staging area or writing to the customer-managed storage area. |
Registered data | Component | Registered data is similar to a data task, however it does not perform and actions against the data directly. It is designed to expose data landed outside of Qlik Cloud to the data project, so it can then be used in the data pipeline. |
Did this page help you?
If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!