Loading dataset segments in parallel
During full load, you can accelerate the loading of large datasets by splitting the dataset into segments, which will be loaded in parallel. Tables can be split by data ranges, all partitions, all subpartitions, or specific partitions.
Each dataset segment is loaded using a separate sub-task. Therefore, when deciding how to split a dataset, you should always take into account the availability of database and network resources. Splitting a table into too many segments might impact database performance and overburden network capacity. If you choose the Data ranges segmentation method with a partitioned dataset, best practice is to set up the ranges so that each range covers one or more entire partitions. This will expedite data retrieval and minimize the impact on database processing resources.
Supported data sources and targets
The data task must be defined with a combination of the following data source and target connectors.
Supported data source connectors
- IBM DB2 for LUW
-
IBM DB2 for z/OS
Information noteTable segmentation by partitions or subpartitions is not supported with the IBM DB2 for z/OS source connector.
- Microsoft SQL Server (log based)
- MySQL
- Oracle
-
PostgreSQL
Information noteTable segmentation by partitions or subpartitions is not supported with the PostgreSQL source connector.
- SAP (Application)Information note
Tables with the SAP (Application) source endpoint are client dependent by default. The MANDT column is automatically taken directly from the connector.
- SAP HANA (Database)
Supported target connectors
- Amazon Redshift
- Amazon S3
- Google Cloud BigQuery
- Google Cloud Storage
- Microsoft Fabric Data Warehouse
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
- Snowflake
Setting up parallel load
You can split a table using one of two methods: Data ranges or Partitions. The Data ranges method splits the table by data ranges while the Partitions method splits the table according to partitions.
Using the data ranges method
To define segment boundaries by data range:
-
In the Datasets tab, click themenu to the right of the dataset and select Settings.
The Parallel load dialog opens.
-
Select the Data ranges segmentation method.
-
Click Select columns.
The Table segmentation columns dialog opens
-
For all supported data sources, the Unique Index column is automatically selected. Select which additional columns with data you want to use to delineate the ranges, and then click OK.
Information note-
It is recommended to use columns that do not allow NULLs and that are not updated during normal source database operations (for example, a primary key column or a date column with a fixed date). Using a column that is updated during full load might cause duplicates in the target database.
-
It is recommended to use indexed source columns as this will optimize data retrieval performance.
-
Up to ten columns can be selected
-
Records with null values will not be replicated
-
The following data types cannot be used to define segments by ranges: DOUBLE, FLOAT, and LOB (BLOB, CLOB, NCLOB)
-
-
Click Add data range to add a data range segment.
The Add data range dialog will open with your selected columns.
-
For each column, enter the upper data range of the segment.
Information noteValues in DATE columns must be entered in the format supported by the source. For example, for an Oracle source, the correct format would be:
-
ALTER SESSION SET NLS_DATE_FORMAT:
'YYYY-MM-DD HH24:MI:SS' (specifying YYYY-MM-DD only is also valid)
-
ALTER SESSION SET NLS_TIMESTAMP_FORMAT:
'YYYY-MM-DD HH24:MI:SS.FF9'
-
ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT:
'YYYY-MM-DD HH24:MI:SS.FF9 TZH:TZM'
-
-
Click OK to save your settings and close the dialog.
Each of the selected columns will appear with its specified range.
- Repeat steps 3-6 as required to add columns and data ranges.
- Click Validate to validate that the data you entered corresponds to the source column data type and that all of the defined segments contain values.
With the Data ranges segmentation method, all of the table data will be replicated, even if data ranges are not defined for all of the columns.
To edit a data range:
- Click the menu at the end of the row and select Edit.
- Edit the data range as required and click OK to save your changes.
To delete a data range:
- Click the menu at the end of the row and select Delete.
- When prompted to confirm the deletion, click Delete.
Usage example
Let's assume that the following segments are defined:
Column_1 | Column_2 | Column_3 |
---|---|---|
10 |
30 |
105 |
20 |
20 |
120 |
100 |
12 |
99 |
In this case, the following "WHERE" clauses will be created for each load segment:
- Segment 1:((COL1 < 10) OR ((COL1 = 10) AND (COL2 < 30)) OR ((COL1 = 10) AND (COL2 = 30) AND (COL3 < 105)))
- Segment 2:NOT ((COL1 < 10) OR ((COL1 = 10) AND (COL2 < 30)) OR ((COL1 = 10) AND (COL2 = 30) AND (COL3 < 105))) AND ((COL1 < 20) OR ((COL1 = 20) AND (COL2 < 20)) OR ((COL1 = 20) AND (COL2 = 20) AND (COL3 < 120)))
- Segment 3:NOT ((COL1 < 20) OR ((COL1 = 20) AND (COL2 < 20)) OR ((COL1 = 30) AND (COL2 = 20) AND (COL3 < 120))) AND ((COL1 < 100) OR ((COL1 = 100) AND (COL2 < 12)) OR ((COL1 = 100) AND (COL2 = 12) AND (COL3 < 99)))
- Segment 4:NOT ((COL1 < 100) OR ((COL1 = 100) AND (COL2 < 12)) OR ((COL1 = 100) AND (COL2 = 12) AND (COL3 < 99)))
Using the partitions method
You can define segment boundaries by all of the table partitions, all of the table's subpartitions (if the data source support subpartitions), or by specific partitions.
- This method can only be used if the dataset is already partitioned.
- If you have a task that existed before this feature was released, and you either did not start the task yet or want to parallel reload one of more of the tables using this method, you first need to refresh the metadata, as described in Registering data that is already on the data platform(The metadata refresh process described in this topic is the same for all types of data task).
Using all partitions
To define segment boundaries by all of the table partitions or subpartitions:
-
In the Datasets tab, click themenu to the right of the dataset and select Settings.
The Parallel load dialog opens.
-
Select the Partitions segmentation method.
-
Select one the following:
-
Use all main partitions
-
Use all subpartitions
Information noteThis option will be disabled if the data source does not support subpartitions.
-
- Click OK.
Using specific partitions
To define segment boundaries by specific partitions:
-
In the Datasets tab, click themenu to the right of the dataset and select Settings.
The Parallel load dialog opens.
-
Select the Partitions segmentation method.
-
Select Specify partitions.
Information noteWhen Specify partitions is selected, only the specified partitions will be replicated.
-
Click Add partition.
The Add partition dialog opens.
-
Specify the name of an existing partition or subpartition.
If you are adding a subpartition, select the Subpartition check box.
- Click OK to save your settings.
- Repeat steps 4-6 as required to add partitions or subpartitions.
To edit a partition:
- Click the menu at the end of the row and select Edit.
- Edit the partition information as required and click OK to save your changes.
To delete a partition:
- Click the menu at the end of the row and select Delete.
- When prompted to confirm the deletion, click Delete.
Adjusting the number of segments that can be loaded in parallel
You can increase or decrease the number of segments that will be loaded in parallel. The value is inherited from the Full load > Performance tuning > Maximum number of tables to load in parallel field in the data task settings. The current value is displayed in the "Up to <n> segments can be loaded in parallel" string in the Parallel load dialog. Increasing the number might boost performance when a dataset is split into many segments, but will also put a greater strain on database resources and network capacity.