Settings for cloud storage targets
You can change the default data lake landing settings according to your needs.
General
Update method
You can land data in two different modes. It is not possible to change mode once the data lake landing task is prepared.
-
Change data capture (CDC): The data lake landing tasks starts with a full load (during which all of the selected tables are landed). The landed data is then kept up-to-date using CDC (Change Data Capture) technology.
Information noteCDC (Change Data Capture) of DDL operations is not supported.When working with Data Movement gateway, changes are captured from the source in near real-time. When working without Data Movement gateway, changes are captured according to the scheduler settings. For more information, see Scheduling tasks when working without Data Movement gateway.
- Reload: Performs a full load of the data from the selected source tables to the target platform and creates the target tables if necessary. The full load occurs automatically when the task is started, but can also be performed manually or scheduled to occur periodically as needed.
Folder to use
Select one of the following, according to which bucket folder you want the files to be written to:
- Default folder: The default folder format is <your-project-name>/<your-task-name>
- Root folder: The files will be written to the bucket directly.
-
Folder: Enter the folder name. The folder will be created during the data lake landing task if it does not exist.
Information note The folder name cannot include special characters (for example, @, #, !, and so on).
Change data partitioning
In a standard landing task, changes are landed to the target in no particular order. Change Data Partitioning enables processing of Change Data from many tables in a consistent fashion. You can define the duration of partitions as well as the partitioning base time, thereby ensuring overall consistency of the partitioned data (i.e. no partial transactions, no order headers without order rows, and so on.)
Information about the partitions is recorded in the attrep_cdc_partitions Control Table in the target database. This information can be used to identify partitioned data that needs to be further processed.
The partitioning options are as follows:
-
Partition every - Specify the length (in hours and minutes) of each partition.
Information noteIt is recommended to specify a partition length in excess of one hour. Although specifying a partition length less than one hour may improve latency, creating many partitions on the target may also impact (target) performance (especially in systems with large volumes of changes).
If you resume a task from BEFORE the time that the last partition was created, the data lake landing task will write to a partition that has already been closed.
- Partition base time - Partitions are created during a 24 hour time period, which is calculated according to the specified “Partitioning base time” on the source database (in UTC time). For example, a partition interval of 8 hours with a “Partitioning base time” time of 02:00 will create the following partitions: 02:00-10:00, 10:00-18:00, 18:00-02:00 - but not necessarily in that order. For instance, if a task started at 01:00, then the timeframe of the first partition will be 18:00-02:00. Additionally, if a task started in the middle of a partition (e.g. at 04:00), its Change Data will be inserted into the 02:00-10:00 partition (even though no changes were captured before 04:00).
Data uploading
File attributes
Format
You can choose to create the target files in CSV or JSON, or Parquet format.
In a JSON file, each record is represented by a single line, as in the following example:
{ "book_id": 123, "title": "Alice in Wonderland", "price": 6.99, "is_hardcover": false }
{ "book_id": 456, "title": "Winnie the Pooh", "price": 6.49, "is_hardcover": true }
{ "book_id": 789, "title": "The Cat in the Hat", "price": 7.23, "is_hardcover": true }
- If you choose JSON or Parquet format , the following fields will be hidden as they are only relevant to CSV format: Field delimiter, Record delimiter, Null value, Quote character, Quote escape character, and Add metadata header.
- The following fields are relevant for Parquet format only: Parquet version, Parquet timestamp unit, and Parquet maximum LOB size (KB).
For information about data type mappings when using Parquet format and limitations, see Mapping from Qlik Cloud data types to Parquet
Field delimiter
The delimiter that will be used to separate fields (columns) in the target files. The default is a comma.
Example using a comma as a delimiter:
"mike","male"
Delimiters can be standard characters or a hexadecimal (hex) value. Note that the "0x
" prefix must be used to denote a hexadecimal delimiter (e.g. 0x01 = SOH
). In the Field delimiter, Record delimiter and Null value fields, the delimiter can consist of concatenated hex values (e.g. 0x0102
= SOHSTX
), whereas in the Quote character and Quote escape character fields, it can only be a single hex value.
The hexadecimal number 0x00
is not supported (i.e. only 0x01
-0xFF
are supported).
Null value
The string that will be used to indicate a null value in the target files.
Example (where \n is the record delimiter and @ is the null value):
Record delimiter
The delimiter that will be used to separate records (rows) in the target files. The default is a newline (\n
).
Example:
Quote character
The character that will be used at the beginning and end of a text column. The default is the double-quote character ("). When a column that contains column delimiters is enclosed in double-quotes, the column delimiter characters are interpreted as actual data, and not as column delimiters.
Example (where a @ is the quote character):
Quote escape character
The character used to escape a quote character in the actual data. The default is the double-quote character (").
Example (where " is the quote character and \ is the escape character):
Parquet version
Select which version to use according to which version the target platform supports. Note that Parquet version 1.0 supports the MICRO timestamp unit only, while Parquet version 2.6 supports both MICRO and NANO timestamp units.
Parquet timestamp unit
When Parquet version is set to 2.6, choose MICRO or NANO. When Parquet version is set to 1.0, only MICRO is supported.
Parquet maximum LOB size (KB)
The default maximum LOB size is 64 KB and the maximum value you can enter in this field is 10,000 KB. Handling LOB columns requires greater resources, which in turn impacts performance. Only increase this value is you are replicating LOB data larger than 64 KB and you require all the LOB data to be replicated to the target.
Maximum file size
The maximum size a file can reach before it is closed (and optionally compressed).
The maximum size a file can reach before it is closed. Smaller files might be uploaded faster (depending on the network) and improve performance when used in conjunction with the parallel execution option. However, it is generally considered bad practice to clutter the database with small files.
Compress files using
Choose one of the compression options to compress the target files or NONE (the default) to leave them uncompressed. Note that the available compressions options are determined by the selected file format.
Add metadata header
You can optionally add a header row to the data files. The header row can contain the source column names and/or the intermediate (i.e. Qlik Talend Data Integration) data types.
Example of a target file with a header row when both With column names and With data types are selected:
Position:DECIMAL(38,0),Color:VARCHAR(10)
1,"BLUE"
2,"BROWN"
3,"RED"
...
Change processing
This section describes conditional settings in Change Processing.
Apply/store changes when
- File size reaches: Specify the maximum size of Change Data to accumulate before uploading the file to the target.
- Elapsed time reaches: Elapsed time reaches x.
Metadata files
When the Create metadata files in the target folder option is selected, for each data file, a matching metadata file with a .dfm extension will be created under the specified target folder. The metadata files provide additional information about the task/data such as the source connector type, the source table name, the number of records in the data file, and so on.
For a full description of the metadata file as well as possible uses, see Metadata file description
Metadata
LOB columns
-
Include LOB columns and limit column size to (KB):
You can choose to include LOB columns in the task, and set the maximum LOB size. LOBs that are larger than the maximum size will be truncated.
Control tables
Select which of the following Control tables you want to be created on the target platform:
- Replication Status: Provides details about the current landing task including task status, amount of memory consumed by the task, number of changes not yet applied to the data platform and the position in the data source from which the data is currently being read.
- Suspended Tables: Provides a list of suspended tables, and the reason they were suspended.
- Replication History: Provides information about the task history including the number and volume of records processed during a landing task, latency at the end of a CDC task, and more.
- Change Data Partitions: Provides records of partitions created on the target database due to Change data partitioning. You can use this information to identify partitioned data that needs to be further processed.
For a detailed description of each of the Control Tables, see Control tables
Full load
Performance tuning
- Maximum number of tables to load in parallel: Enter the maximum number of tables to load into the target at one time. The default value is 5.
-
Transaction consistency timeout (seconds): Enter the number of seconds to wait for open transactions to close, before beginning the Full Load operation. The default value is 600 (10 minutes). The full load will start after the timeout value is reached even if there are transactions that are still open.
Information noteTo replicate transactions that were open when Full Load started but were only committed after the timeout value was reached, you need to reload the target tables. - Commit rate during full load: The maximum number of events that can be transferred together. The default value is 10000.
After full load completes
Create primary key or unique: Select this option if you want to delay creation of the primary key or unique index on the data platform until after full load completes.
For initial load
When moving data from a SaaS application source, you can set how to perform the initial full load:
Use cached data |
This option lets you use cached data that was read when generating metadata with Full data scan selected. This creates less overhead regarding API use and quotas, as the data is already read from the source. Any changes since the initial data scan can be picked up by Change data capture (CDC). |
Load data from source |
This option performs a new load from the data source. This option is useful if:
|
Store changes processing
The Change Table header columns provide information about the Change Processing operation such as the type of operation (for example, INSERT), the commit time, and so on. If you do not need this information, you can configure the data task to create the Change Tables without some or all of the header columns, thereby reducing their footprint in the target database. To do this, clear the check boxes for the header columns that you wish to exclude.
Note that you cannot remove additional columns or restore columns while a task is running. To change your initial selection, you first need to stop the task, then modify your selection, and finally reload the target tables.
When Change data partitioning is enabled, an extra header column named "partition_name" is added to the Change Tables and automatically selected in the UI. As this column is required, it cannot be excluded.
For a description of the header columns, see Using Change Tables.
Error handling
Data errors
Data error handling is supported with the Change Data Capture (CDC) update method only.
Data truncation errors
For data truncation errors: Select what you want to happen when a truncation occurs in one or more specific records. You can select one of the following from the list:
- Ignore: The task continues and the error is ignored.
- Suspend table: The task continues, but data from the table with the error record is moved into an error state and its data is not replicated
- Stop task: The task is stopped and manual intervention is required.
Other data errors
For other data errors: Select what you want to happen when an error occurs in one or more specific records. You can select one of the following from the list:
- Ignore: The task continues and the error is ignored.
- Suspend table: The task continues, but data from the table with the error record is moved into an error state and its data is not replicated
- Stop task: The task is stopped and manual intervention is required.
Escalating data error handling
Escalate error handling when other data errors reach (per table): Select this check box to escalate error handling when the number of non-truncation data errors (per table) reaches the specified amount. Valid values are 1-10,000.
Escalation action: Choose what should happen when error handling is escalated. Note that the available actions are dependent on the action selected from the For other data errors drop-down list described above.
-
Suspend table (default): The task continues, but data from the table with the error record is moved into an error state and its data is not landed.
- Stop task: The task is stopped and manual intervention is required.
Table errors
When encountering a table error: Select one of the following from the drop-down list:
- Suspend table (default): The task continues but data from the table with the error record is moved into an error state and its data is not replicated
- Stop task: The task is stopped and manual intervention is required.
Escalate error handling when table errors reach (per table): Select this check box to escalate error handling when the number of table errors (per table) reaches the specified amount. Valid values are 1-10,000.
Escalation action: The escalation policy for table errors is set to Stop task and cannot be changed.
Environmental
-
Maximum retry count: Select this option and then specify the maximum number of attempts to retry a task when a recoverable environmental error occurs. After the task has been retried the specified number of times, the task is stopped and manual intervention is required.
To never retry a task, clear the check box or specify "0".
To retry a task an infinite number of times, specify "-1"
-
Interval between retry attempts (seconds): Use the counter to select or type the number of seconds that the system waits between attempts to retry a task.
Valid values are 0-2,000.
-
- Increase retry interval for long outages: Select this check box to increase the retry interval for long outages. When this option is enabled, the interval between each retry attempt is doubled, until the Maximum retry interval is reached (and continues retrying according to the specified maximum interval).
- Maximum retry interval (seconds): Use the counter to select or type the number of seconds to wait between attempts to retry a task when the Increase retry interval for long outages option is enabled. Valid values are 0-2,000.
Change processing tuning
Transactional offload tuning
-
Offload transactions in progress to disk if:
Transaction data is usually kept in memory until it is fully committed to the source or target. However, transactions that are larger than the allocated memory or that are not committed within the specified time limit will be offloaded to disk.
- Total memory size for all transactions exceeds (MB): The maximum size that all transactions can occupy in memory before being offloaded to disk. The default value is 1024.
- Transaction duration exceeds (seconds): The maximum time that each transaction can stay in memory before being offloaded to disk. The duration is calculated from the time Qlik Talend Data Integration started capturing the transaction. The default value is 60.
Batch tuning
-
Minimum number of changes per transaction: The minimum number of changes to include in each transaction. The default value is 1000.
Information noteThe changes will be applied to the target either when the number of changes is equal to or greater than the Minimum number of changes per transaction value OR when the Maximum time to batch transactions before applying (seconds) value described below is reached - whichever occurs first. Because the frequency of changes applied to the target is controlled by these two parameters, changes to the source records may not immediately be reflected in the target records.
- Maximum time to batch transactions before applying (seconds): The maximum time to collect transactions in batches before declaring a timeout. The default value is 1.
Interval
This option is only available when:
- Using Data Movement gateway
- Landing data from SaaS application sources
- The task is defined with the Change data capture (CDC) update method
Read changes every (Minutes)
Set the interval between reading changes from the source in minutes. The valid range is 1 to 1,440.
Miscellaneous tuning
- Statements cache size (number of statements): The maximum number of prepared statements to store on the server for later execution (when applying changes to the target). The default is 50. The maximum is 200.
-
DELETE and INSERT when updating a primary key column: This option requires full supplemental logging to be turned on in the source database.
Schema evolution
Select how to handle the following types of DDL changes in the schema. When you have changed schema evolution settings, you must prepare the task again. The table below describes which actions are available for the supported DDL changes.
DDL change | Apply to target | Ignore | Suspend table | Stop task |
---|---|---|---|---|
Add column | Yes | Yes | Yes | Yes |
Rename column | No | No | Yes | Yes |
Rename table | No | No | Yes | Yes |
Change column data type | No | Yes | Yes | Yes |
Create table
If you used a Selection rule to add datasets that match a pattern, new tables that meet the pattern will be detected and added. |
Yes | Yes | No | No |
Character substitution
You can substitute or delete source characters in the target database and/or you can substitute or delete source characters that are not supported by a selected character set.
-
All characters must be specified as Unicode code points.
- Character substitution will also be performed on the Control tables.
-
Invalid values will be indicated by a red triangle in the top right of the table cell. Hovering your mouse cursor over the triangle will show the error message.
-
Any table-level or global transformations defined for the task will be performed after the character substitution has been completed.
-
Substitutions actions defined in the Substitute or Delete Source Characters table are performed before the substitution action defined in the Substitute or Delete Source Characters Unsupported by the Selected Character Set table.
- Character substitution does not support LOB data types.
Substituting or deleting source characters
Use the Substitute or delete source characters table to define replacements for specific source characters. This may be useful, for example, when the Unicode representation of a character is different on the source and target platforms. For example, on Linux, the minus character in the Shift_JIS character set is represented as U+2212, but on Windows it is represented as U+FF0D.
To | Do This |
---|---|
Define substitution actions. |
|
Edit the specified source or target character |
Click at the end of the row and selected Edit. |
Delete entries from the table |
Click at the end of the row and selected Delete. |
Substituting or deleting source characters unsupported by the selected character set
Use the Unsupported source characters by character set table to define a single replacement character for all characters not supported by the selected character set.
To | Do This |
---|---|
Define or edit a substitution action. |
|
Disable the substitution action. |
Select the blank entry from the Character Set drop-down list. |
Loading dataset segments in parallel
During full load, you can accelerate the loading of large datasets by splitting the dataset into segments, which will be loaded in parallel. Tables can be split by data ranges, all partitions, all subpartitions, or specific partitions.
For more information, see Loading dataset segments in parallel.
More options
These options are not exposed in the UI as they are only relevant to specific versions or environments. Consequently, do not set these options unless explicitly instructed to do so by Qlik Support or product documentation.
To set an option, simply copy the option into the Add feature name field and click Add. Then set the value or enable the option according to the instructions you received.
Scheduling tasks when working without Data Movement gateway
Data Movement gateway is not supported with a Qlik Talend Cloud Starter subscription and optional with other subscription tiers. When working without Data Movement gateway, near real-time CDC (Change data capture) is not supported. You keep the target data up-to-date by setting a scheduling interval. The schedule determines how often the target datasets will be updated with changes to the source datasets. Whereas the schedule determines the update frequency, the dataset type determines the update method. If the source datasets support CDC (Change data capture), only the changes to the source data will be replicated and applied to the corresponding target tables. If the source datasets do not support CDC (for example, Views), changes will be applied by reloading of all the source data to the corresponding target tables. If some of the source datasets support CDC and some do not, two separate sub-tasks will be created (assuming the Change data capture (CDC) update method is selected): one for reloading the datasets that do not support CDC, and the other for capturing the changes to datasets that do support CDC. In this case, to ensure data consistency, it is strongly recommended to set the same schedule for both sub-tasks.
To change the scheduling:
-
Open you data project and then do one of the following:
- In tasks view, click on a data task and select Scheduling.
- In pipeline view, click on a data task and select Scheduling.
- Open the replication task and click the Scheduling toolbar button.
- Change the scheduling settings as needed and then click OK.