Dataset architecture in a cloud data warehouse
Qlik Talend Data Integration lets you create, manage, and control an analytics data pipeline, delivering data all the way to consumers. When you generate datasets in a cloud data warehouse with Qlik Talend Data Integration, storage tables, change tables and views are generated.
An intuitive and guided user interface helps you build, model and execute data pipelines. Automatically generate schemas for operational data stores (ODS) and historical data stores (HDS) without manual coding.
-
Landing
The landing task in Qlik Talend Data Integration controls the continuous landing of data from the data sources to the landing zone. The example in the diagram describes using Qlik Data Gateway - Data Movement to access data sources using CDC to keep the data up-to-date. You can also use Qlik Cloud source connections to perform full loads that can be scheduled to reload periodically.
For information about when you need to use Data Movement gateway, see When is Data Movement gateway required?
Information noteTables in the landing zone are generated for internal use by Qlik Talend Data Integration Storage data tasks. Do not use the landing tables in any other downstream processes. -
Storage
The storage task controls when data is applied to the storage tables, and creates and manages the tables and external views.
-
External views
When consuming data, the best practice is to use views. Views provide several benefits over tables including improved data concurrency.
Using live views
You can access both current data (ODS) and historical data (HDS) using live views. Live views include data from changes tables that is not yet applied to the current or prior tables. This lets you see data with lower latency without having to apply changed data frequently. The ability to delay the merge allows for reduced costs and processing requirements in the target platform.
Another benefit with live views is that the compute tier does not need to run all the time.
-
Landing can point to an explicit warehouse that can be small, as it just performs INSERT operations for fast execution.
-
The storage process, running once a day for example, can wake up a large compute layer for processing.
-
Latency can be improved, as we no longer need to apply changes throughout the day. When inserted new records are available in the changes table, they are available right away in the live views.
Schemas
Artifacts are generated in an internal schema and a data task schema.
-
The internal schema contains the physical data tables.
-
The data task schema contains the views that you can use to consume the data.
When a schema is associated with more than one data task, each data task must use a unique prefix for tables and views. You can set the prefix in the data task settings.
Only internal schemas are checked for naming conflicts. For other schemas, you must make sure there are no naming conflicts in the table names. The best practice is to name the internal schema the same name as the data task schema with _internal added. This will effectively make sure that every schema and prefix combination is unique.
Tables
The following tables are created in the internal schema.
-
Current table (ODS)
This table contains the replica of the data source updated with changes during the latest apply interval.
-
Prior table (HDS)
This table contains type 2 historical data. It is only generated if History is enabled in the data task settings.
When a source table record is updated, a new record is added to the prior table each time. The history record is a copy of the previous current record, which also includes what was updated, and when it was valid.
Use a history view or a history live view to view historical data. For more information see History view and History live view.
-
Changes table
This table contains all changes that are not yet applied to the current table. It is only generated if the landing mode Full load and CDC is used.
Views
The following views are created in the data task target schema. Which views are created depends on if you have enabled live views and history, and if you use change handling.
-
Current view
-
Live view
-
Changes view
-
History view
-
History live view
Current view
Naming: <EXTERNAL_SCHEMA>.[<PREFIX>]<TABLE_NAME>_current
The following header columns are added to the table structure.
Field | Type | Description |
---|---|---|
hdr__key_hash | varbinary(20) |
Hash of all record primary keys. Hash format is SHA1. Columns are separated by a backspace character. This column is not generated in Data mart data tasks. |
hdr__key_id | int64 |
Sequence incremented per record. This column is only generated in Data mart data tasks. |
hdr__from_timestamp | timestamp |
Timestamp in UTC
|
hdr__operation | string(1) |
Latest operation of this record.
|
hdr__inserted_timestamp | timestamp | UTC timestamp of the first time the key was added. When using full load, the start time of the full load. |
hdr__modified_timestamp | timestamp | UTC timestamp of when the last update was applied. |
Live view
Live views show a view for each selected source table which merges the table with changes from the change table. This provides queries with a live view of the data without having to wait for the next apply cycle. The changes merged from the change table view are not transaction consistent between tables.
Live views are only created if you enable Live views in the settings of the data task.
Naming: <EXTERNAL_SCHEMA>.[<PREFIX>]<TABLE_NAME>< Suffix for live views>
Field | Type | Description |
---|---|---|
hdr__key_hash | varbinary(20) |
Hash of all record primary keys. Hash format is SHA1. Columns are separated by a backspace character. This column is not generated in Data mart data tasks. |
hdr__key_id | int64 |
Sequence incremented per record. This column is only generated in Data mart data tasks. |
hdr__from_timestamp | timestamp |
Timestamp in UTC
|
hdr__operation | string(1) |
Latest operation of this record.
|
hdr__inserted_timestamp | timestamp | UTC timestamp of the first time the key was added. When using full load, the start time of the full load. |
hdr__modified_timestamp | timestamp | UTC timestamp of when the last update was applied. |
hdr__store | varchar(10) |
This indicates where the record resides.
|
Changes view
This is a view of the change table in the landing schema for each selected source table.
Naming: <EXTERNAL_SCHEMA>.[<PREFIX>]<TABLE_NAME>_changes
The following header fields are added to the table structure.
Field | Type | Description |
---|---|---|
hdr__change_identifier | string(50) |
The change identifier is a string with two parts:
|
hdr__from_timestamp | timestamp |
Timestamp in UTC
|
hdr__to_timestamp | timestamp |
Timestamp in UTC
|
hdr__operation | string(1) |
Latest operation of this record.
|
hdr__timestamp | timestamp |
Timestamp in UTC . |
hdr__key_hash | binary(20) |
Hash of all record primary keys. This column is not generated in Data mart data tasks. |
hdr__key_id | int64 |
Sequence incremented per record. This column is only generated in Data mart data tasks. |
History view
A history view is generated in the data asset schema for each selected source table if History is enabled in the data task settings. The following header fields are added.
Naming: <EXTERNAL_SCHEMA>.[<PREFIX>]<TABLE_NAME>< Suffix for history views>
Field | Type | Description |
---|---|---|
hdr__key_hash | binary(20) |
Hash of all record primary keys. This column is not generated in Data mart data tasks. |
hdr__key_id | int64 |
Sequence incremented per record. This column is only generated in Data mart data tasks. |
hdr__store | varchar(10) |
This indicates where the record resides.
|
hdr__operation | string(1) |
Latest operation of this record.
|
hdr__deleted | bit |
Indicates if the record is soft deleted, based on if hdr__operation is D or d. |
hdr__was _current_from_timestamp | timestamp |
Timestamp in UTC for first time record was current. |
hdr__was _current_to_timestamp | timestamp |
Timestamp in UTC for last time record was current. |
History live view
A history live view is generated in the data asset schema for each selected source table for each selected source table merged with changes from the change table. The following header fields are added.
Naming: <EXTERNAL_SCHEMA>.[<PREFIX>]<TABLE_NAME>< Suffix for live history views>
Field | Type | Description |
---|---|---|
hdr__key_hash | binary(20) |
Hash of all record primary keys. This column is not generated in Data mart data tasks. |
hdr__key_id | int64 |
Sequence incremented per record. This column is only generated in Data mart data tasks. |
hdr__store | varchar(10) |
This indicates where the record resides.
|
hdr__operation | string(1) |
Latest operation of this record.
|
hdr__deleted | bit |
Indicates if the record is soft deleted, based on if hdr__operation is D or d. |