Store Changes Settings
When you click Store Changes in the Task Settings dialog box, you can configure the Store Changes Settings for a replication task.
Store changes processing is ON/OFF:
Click this button to toggle Store Changes on or off. The initial setting is determined when Adding tasks. If this option is ON, changes are stored in either Change Tables or an Audit Table.
For more information about storing and applying changes, see Using an Audit table and Working with Change Tables.
Store Changes can be turned on or off at any time without affecting anything in the task. Changes that are processed and not stored as a result of change storage being turned off can be recovered only by setting the task to an earlier point in time.
If Store Changes is ON, use the following options to determine how to store changes. Changes can be stored in Change Tables or in a single Audit table. From the Store changes in drop-down list, choose either Change tables or Audit table according to your needs.
Storing changes in Change Tables
The following section describes the options that are available when storing changes in Change Tables.
-
Suffix: Type a string to use as the suffix for all Change Tables. The default value is __ct.
The Change Table names are the name of the target table with the suffix appended. For example, if you have a table called HR and use the default value, the name of the Change Table will be HR__ct.
For more information, see Working with Change Tables.
-
Header column prefix: Type a string to use as the prefix for all of the Change Table header columns. The default value is header__.
For example, the header column stream_position when using the default value is called header__stream_position.
For more information, see Change Tables.
-
DDL options: Select one of the following options to determine how to handle DDL operations on the source tables:
- Apply to change table - DDL operations to the source tables (such as a column being added) will be applied to the corresponding Replicate Change Tables only.
- Ignore - All DDL operations to source tables will be is ignored.
-
Apply to change table and target table
This is the default option. When this option is selected, DDL operations to the source tables will be applied both to the Replicate Change Tables and to the corresponding target tables.
-
On UPDATE: Select one of the following options to determine how to store UPDATEs to the source tables:
- Store before and after image - To store both the pre-UPDATE data and the post-UPDATE data.
- Store after image only - To store only the post-UPDATE data.
Change table creation:
If Change Table exists when full load starts: Select one of the following from the list to determine how you want to handle loading the Change Tables at full-load startup:
- DROP and CREATE Change Table: The table is dropped and a new table is created in its place.
-
ARCHIVE and CREATE Change Table:A copy of the existing table will be saved to the same schema before the new table is created. The archived table name will be appended with a timestamp, indicating when the archiving operation occurred (e.g. Customers___ct_20170605175601).
Information noteCurrently this option is only available for the Hadoop target endpoint.
- Delete old changes and store new changes in existing Change Table: Data is truncated and added without affecting the table metadata.
- Keep old changes and store new changes in existing Change Table: Data and metadata of the existing Change table are not affected.
Change Data Partitioning
This feature is supported with the following target endpoints only:
-
Hadoop (Hortonworks, Cloudera, and MapR)
-
File
-
Amazon S3
- Databricks (Cloud Storage)
-
Microsoft Azure HDInsight
-
Google Cloud Storage
- Microsoft Azure ADLS
- Cloudera Data Platform (CDP) Private Cloud
- Google Data Proc
- Amazon EMR
- Hortonworks Data Platform (HDP)
In a standard replication task, changes are replicated to the target in no particular order. Change Data Partitioning enables processing of Change Data from many tables in a consistent fashion. You can define the duration of partitions as well as the partitioning base time, thereby ensuring overall consistency of the partitioned data (i.e. no partial transactions, no order headers without order lines, and so on.)
The partitioned data is stored in the Replicate Change Tables. When the Change Data Partitions table is selected (in the Control tables tab), information about the partitions will be recorded in the attrep_cdc_partitions Control Table on the target database. This information can be used to identify partitioned data that needs to be further processed.
The partitioning options are as follows:
- Off - Replicate Change Data without partitioning.
-
Partition every - Specify the length (in hours and minutes) of each partition.
Information noteIt is recommended to specify a partition length in excess of one hour. Although specifying a partition length less than one hour may improve latency, creating many partitions on the target may also impact (target) performance (especially in systems with large volumes of changes).
If you resume a task from BEFORE the time that the last partition was created, Replicate will write to a partition that has already been closed.
- Partition base time - Partitions are created during a 24 hour time period, which is calculated according to the specified “Partitioning base time” on the source database (in UTC time). For example, a partition interval of 8 hours with a “Partitioning base time” time of 02:00 will create the following partitions: 02:00-10:00, 10:00-18:00, 18:00-02:00 - but not necessarily in that order. For instance, if a task started at 01:00, then the timeframe of the first partition will be 18:00-02:00. Additionally, if a task started in the middle of a partition (e.g. at 04:00), its Change Data will be inserted into the 02:00-10:00 partition (even though no changes were captured before 04:00).
-
Speed partition mode
Information noteThis feature is supported with Hadoop-based target endpoints only.
By default, Change Data Partitions for all tables in a replication task are registered on the target at the same time. As soon as the partitions are registered, information about them (such as a partition's start and end time) is also published to the Change Data Partitions Control Table. This ensures data consistency across all of the replicated tables.
In Speed partition mode, rather than waiting for all Change Data files to be uploaded (for all tables), Replicate creates and registers a partition per-table as soon as the partition’s first data file is uploaded (to its table). Creating and registering partitions before all the Change Data has been uploaded is especially useful for consuming applications such as Qlik Compose that need to process changes with a minimum of delay, as the Change Data rapidly becomes available for consumption, even if it is incomplete.
For example, if an endpoint is configured to upload Change Data files every five minutes, each batch of changes can be processed immediately, even if the Partition every interval is set to six hours.
However, despite significantly reducing latency, working in this mode may have some possible drawbacks:
-
Accumulation of a large number of files per partition, which may impact performance on the target.
It is therefore strongly recommended to clean up old partitions using the Partition Retention options (see below) provided for this purpose.
- Inconsistent results may be encountered when querying several tables due to the Change Data arriving on the target at different times.
-
Change Tables that were created before Change Data Partitioning was enabled need to be dropped or renamed so that they can be recreated with the additional "partition_name" column.
Deleting old partitions periodically
Over time, Change Data Partitioning can result in a large number of files and partitions accumulating on the target system, which may significantly impact performance. Therefore, to ensure optimal performance, best practice is to delete old partitions from time to time.
There are two ways of deleting processed partitions: periodically and ad-hoc. This topic explains how to set up periodic partition deletion. For information on ad-hoc partition deletion, see Deleting old partitions manually.
Currently, this feature is supported with the Microsoft Azure Databricks endpoint only.
Setting up periodic deletion is a two-phase process which involves specifying a deletion interval in the UI, and setting a partition retention barrier using the Enterprise Manager API.
To do this:
-
Enable the Partition Retention option.
Any partitions created while this option is disabled will not be deleted when deletion is initiated (either periodic or manual). Therefore, the Partition Retention option should only be disabled if you are absolutely certain that periodic deletion will not be initiated at any point in the future.
-
Specify a partition deletion interval in Day, Hours or Minutes.
To prevent the deletion of open partitions or recently closed partitions, the partition deletion interval must be at least double the Partition every value.
-
Set a retention barrier date by calling the relevant Enterprise Manager API method.
To ensure that only old (i.e. processed) partitions will be deleted, periodic deletion requires a retention barrier date to be set. The retention barrier date should be set by the consuming application each time it finishes processing a partition. A consuming application can be any application (Qlik or otherwise) tasked with processing the partitioned data. There may be several consuming applications, each of which sets its own retention barrier. In this case, partitions will be deleted up to the earliest retention barrier. For example, if Application A sets July 7th, 2020 as a barrier, Application B sets August 7th, 2020 as a barrier, and Application C sets September 7th, 2020 as a barrier, partitions will be deleted up to July 7th, 2020.
As soon as a retention barrier is set, periodic deletion will start to run according to the defined interval, until any of the following occurs:
- The Partition Retention option is disabled
- An API call removes the retention barrier
- All partitions until the retention barrier have been deleted
For more information on setting and removing a partition retention barrier using the API, refer to the Enterprise Manager Developer's Guide.
Limitations and considerations
- If a connection error occurs during the partition deletion operation, the operation will fail (with a warning). The operation will continue from the point of failure the next time that deletion is performed (either periodically or manually).
- If a specific object (partition or storage) cannot be deleted due to other errors (i.e. not connection errors), a warning will be written to the log and deletion of the remaining objects will continue. Partitions that cannot be deleted due to such errors must be deleted directly from the database by the DBA.
- To prevent deletion of open partitions, the last partition in a table will not be deleted, even if meets the criteria for deletion.
- Deletion of partitions in renamed tables is not supported.
- Reloading a target with the Drop and Create tables option (the default) will delete both the tables and their partitions, regardless of any partition deletion policy. However, reloading a target with the Do nothing option, will leave the partitions untouched. In this case, partitions created prior to the target reload cannot be deleted using the Replicate partition deletion tools.
Selecting Change Table header columns
The Change Table header columns provide information about the Change Processing operation such as the type of operation (e.g. INSERT), the commit time, and so on. If you do not need this information, you can configure Replicate to create the Change Tables without some or all of the header columns, thereby reducing their footprint in the target database. To do this, clear the check boxes next to the header columns that you wish to exclude.
Note that you cannot remove additional columns or restore columns while a task is running. To change your initial selection, you first need to stop the task, then modify your selection, and finally reload the target tables.
When Change Data Partitioning is enabled, an extra header column named "partition_name" is added to the Change Tables and automatically selected in the UI. As this column is required, it cannot be excluded.
For a description of the header columns, see Change Tables.
Storing changes in an Audit table
The following section describes the options that are available for storing changes in an Audit table.
- If your source tables contain records exceeding 4096 characters, to prevent truncation of these records, you must turn on the Task Settings' Replicate LOB columns option (even if the source tables do not contain any LOB columns).
- LOB columns with unlimited size are not supported in the CHANGE_RECORD and BU_CHANGE_RECORD columns. The other fields will be recorded but the LOB will have a NULL value.
- For a description of the audit table structure, see Using an Audit table.
-
Audit table schema: Specify a schema if you do not want the Audit table to be created under the target endpoint's default schema.
The default schema are as follows:
Default schema by endpoint Endpoint Default Schema Pivotal Greenplum
Public
Amazon Redshift
Public
Oracle
The connected user’s user name.
Teradata
The endpoint name.
All others
The user’s default schema.
-
Audit table tablespace: This option is only available when the task's target endpoint is Oracle. Enter the tablespace name on the target where you want the Audit table to be created. If you do not enter any information in this field, then the tables will created in the default permanent tablespace.
-
Audit table name: Specify a name for the Audit table.
The default value is attrep__audit_table.
Audit table creation:
If audit table exists when the target is reloaded: Select one of the following to determine how you want to handle the Audit table when the target is reloaded:
- DROP and CREATE audit table: The Audit table is dropped and a new table is created in its place.
-
ARCHIVE and CREATE audit table: A copy of the existing table will be saved to the same schema before the new table is created. The archived table name will be appended with a timestamp, indicating when the archiving operation occurred (e.g. attrep_audit_table_20170605175601).
Information noteCurrently this option is only available for the Hadoop target endpoint.
- Delete old changes and store new changes in existing audit table: Data is truncated and added without affecting the Audit table metadata.
- Keep old changes and store new changes in existing audit table: Data and metadata of the existing Audit table are not affected.
For a description of the audit table structure, see Using an Audit table.