tRuleSurvivorship properties for Apache Spark Batch

These properties are used to configure tRuleSurvivorship running in the Spark Batch Job framework.

The Spark Batch tRuleSurvivorship component belongs to the Data Quality family.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

Basic settings

Properties	Description
Schema and Edit schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word `line` when naming the fields. This component provides two read-only columns: SURVIVOR: this column is of type Boolean. It indicates whether a record is the survivor (true) or not (false). There will be only one survivor for each group . CONFLICT: when more than one record meet a given business rule, this column presents them. When a survivor record is created, the CONFLICT column does not show the conflicting columns if the conflicts have been resolved by the conflict rules. Built-In: You create and store the schema locally for this component only. Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.
Group identifier	Select the column whose content indicates the required group identifiers from the input schema.
Rule package name	Type in the name of the rule package you want to create with this component.
Generate rules and survivorship flow	Once you have defined all of the rules of a rule package or modified some of them with this component, click the icon to generate this rule package into the Survivorship Rules node of Rules Management under Metadata in the Repository of the Integration perspective of Talend Studio. Information noteNote: This step is necessary to validate these changes and take them into account at runtime. If the rule package of the same name exists already in the Repository, these changes will overwrite it once validated, otherwise the Repository one takes the priority during execution. Information noteWarning: In a rule package, two rules cannot use the same name.
Rule table	Complete this table to create a complete survivor validation flow. Basically, each given rule is defined as an execution step, so in the top-down order within this table, these rules form a sequence and thus a flow takes shape. The columns of this table are: Order: From the list, select the execution order of the rules you are creating so as to define a survivor validation flow. The types of order may be: Sequential: a Sequential rule is an execution step of the survivor validation flow. For example, the first rule on the top of this Rule table will be the first step and from this rule down, the second Sequential rule will be the second step. The first rule on the top must be a Sequential rule. Multi-condition: a Multi-condition rule is an additional rule to a given execution step. It is always added to the last Sequential rule above it in this table and then at this step, both of these two rules become necessary to respect. For example, having defined the first Sequential rule, you define a Multi-condition rule below; then both of them will become the rules of the first step. Multi-target: as each step, once executed, validates a record field value from a given Reference column and select the corresponding value as the best from a given Target column, a Multi-target rule allows you to add one more Target column to the same step. You need to define each Reference column and Target column manually in this table. Rule Name: Type in the name of each rule you are creating. This column is only available to the Sequential rules as they define the steps of the survivor validation flow. Do not use special characters in rule names, otherwise the Job may not run correctly. Rule names are case insensitive. Reference column: Select the column you need to apply a given rule on. They are the columns you have defined in the schema of this component. This column is not available to the Multi-target rules as they define only the Target column. Function: Select the type of validation operation to be performed on a given Reference column. The available types include: None: no validation operation is performed. Most common: it validates the most frequent field value in each duplicates group. Most recent or Most ancient: the former validates the earliest date value and the latter the latest date value in each duplicates group. The relevant reference column must be of the Date type. Most Complete: it validates the field when the record it belongs to has the least empty fields. Longest or Shortest: the former validates the longest field value and the latter the shortest in each duplicates group. Largest or Smallest: the former validates the largest numerical value and the latter the smallest numerical value in a duplicates group. Match regex: it validates the field when this field complies to the regular expression given in the Value column. Expression: it validates the field when it complies to the expression that you enter in the Value column. The expression value must be written with the Drools language. Value: enter the expression of interest corresponding to the Match regex or the Expression function you have selected in the Function column. Target column: when a step is executed, it validates a record field value from a given Reference column and selects the corresponding value as the best from a given Target column. Select this Target column from the schema columns of this component. Ignore blanks: Select the check boxes which correspond to the names of the columns for which you want the blank value to be ignored.
Define conflict rule	Select this check box to be able to create rules to resolve conflicts in the Conflict rule table.
Conflict rule table	Complete this table to create rules to resolve conflicts. The columns of this table are: Rule name: Type in the name of each rule you are creating. Do not use special characters in rule names, otherwise the Job may not run correctly. Conflicting column:When a step is executed, it validates a record field value from a given Reference column and selects the corresponding value as the best from a given Conflicting column. Select this Conflicting column from the schema columns of this component. Function: Select the type of validation operation to be performed on a given Conflicting column. The available types include those in the Rule table and the following ones: Fill empty: This function fills empty fields width the specified value. Remove duplicate: This function removes the field value in the Reference column if the same field value has been survived in the Conflicting column. Not match regex: This function validates the field when this field does not comply to the regular expression given in the Value column. Survive as: When a field value from the Reference column is survived, this function selects the corresponding field value as the best from the Conflicting column. Value: enter the expression of interest corresponding to the Match regex or the Expression function you have selected in the Function column. Reference column: Select the column you need to apply a given conflicting rule on. They are the columns you have defined in the schema of this component. Ignore blanks: Select the check boxes which correspond to the names of the columns for which you want the blank value to be ignored. Disable: Select the check box to disable the corresponding rule.

Advanced settings

Properties	Description
Set the number of partitions by GID	Enter the number of partitions you want to split each group into.

Global Variables

Variables	Description
Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl+Space to access the variable list and choose the variable to use from it. For more information about variables, see Using contexts and variables.

Variables

Description

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl+Space to access the variable list and choose the variable to use from it.

For more information about variables, see Using contexts and variables.

Usage

Usage guidance	Description
Usage rule	This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.
Spark Connection	In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files: Yarn mode (Yarn client or Yarn cluster): When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab. When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab. When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab. When using on-premises distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration. Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration Apache Spark Batch or tS3Configuration Apache Spark Batch. If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem). This connection is effective on a per-Job basis. When the Job is running on Spark 3.X with Databricks, go to the Databricks cluster and select the Databricks runtime version 10.1 (includes Apache Spark 3.2.0, Scala 2.12) or greater. Earlier versions are not supported. Check that the Scala version is 2.12.13 or greater.

Usage guidance

Description

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Spark Connection

In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:

Yarn mode (Yarn client or Yarn cluster):
- When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab.
- When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab.
- When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab.
- When using on-premises distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration.
Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration Apache Spark Batch or tS3Configuration Apache Spark Batch.

If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem).

This connection is effective on a per-Job basis.

When the Job is running on Spark 3.X with Databricks, go to the Databricks cluster and select the Databricks runtime version 10.1 (includes Apache Spark 3.2.0, Scala 2.12) or greater. Earlier versions are not supported.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here