tMatchModel properties for Apache Spark Batch
These properties are used to configure tMatchModel running in the Spark Batch Job framework.
The Spark Batch tMatchModel component belongs to the Data Quality family.
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
Basic settings
Define a storage configuration component |
Select the configuration component to be used to provide the configuration information for the connection to the target file system such as HDFS. If you leave this check box clear, the target file system is the local system. The configuration component to be used must be present in the same Job. For example, if you have dropped a tHDFSConfiguration component in the Job, you can select it to write the result in a given HDFS system. |
Schema and Edit Schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Click Sync columns to retrieve the schema from the previous component connected in the Job. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
|
|
Built-In: You create and store the schema locally for this component only. |
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. |
Matching key |
Select the columns on which you want to base the match computation. |
Matching label column |
Select the column from the input flow which holds the label you set manually on the suspect pairs of records. If you select the Integration with Data Stewardship check box, this list does not appear. In this case, the matching label column is the TDS_ARBITRATION_LEVEL column, which holds the label(s) you set on the suspect pairs of records set using Talend Data Stewardship. |
Matching model location |
Select the Save the model on file system check box and in the Folder field, set the path to the local folder where you want to generate the matching files. If you want to store the model in a specific file system, for example S3 or HDFS, you must use the corresponding component in the Job and select the Define a storage configuration component check box in the component basic settings. The button for browsing does not work with the Spark Local mode; if you are using the other Spark Yarn modes that Talend Studio supports with your distribution, ensure that you have properly configured the connection in a configuration component in the same Job. Use the configuration component depending on the filesystem to be used. |
Generate feature importance report | Select this check box to generate a report that contains a summary of the model and the settings. For more information, see Feature importance report. You can save the report on:
|
Integration with Data Stewardship |
Select this check box to set the connection parameters to the Talend Data Stewardship server. If you select this check box, tMatchModel uses the sample suspect records labeled in a Grouping campaign defined on the Talend Data Stewardship server, which means this component can be used as a standalone component. |
Data Stewardship Configuration |
Available when the Integration with Data Stewardship check box is selected.
|
Advanced settings
Max token number for phonetic comparison |
Set the maximum number of the tokens to be used in the phonetic comparison. When the number of tokens exceeds what has been defined in this field, no phonetic comparison is done on the string. |
Random Forest hyper-parameters tuning |
Number of trees range: Enter a range for the decision trees you want to build. Each decision tree is trained independently using a random sample of features. Increasing this range can improve the accuracy by decreasing the variance in predictions, but will increase the training time. Maximum tree-depth range: Enter a range for the decision tree depth at which the training should stop adding new nodes. New nodes represent further tests on features on internal nodes and possible class labels held by leaf nodes. Generally speaking, a deeper decision tree is more expressive and thus potentially more accurate in predictions, but it is also more resource consuming and prone to overfitting. |
Checkpoint Interval |
Set the frequency of checkpoints. It is recommended to leave the default value (10). Before setting a value for this parameter, activate checkpointing and set the checkpoint directory in the Spark Configuration tab of the Run view. For further information about checkpointing, see Logging and checkpointing the activities of your Apache Spark Job. |
Cross-validation parameters |
Number of folds: Enter a numeric value of bins which are used as separate training and test datasets. Evaluation metric type: Select a type from the list. For further information, see Precision and recall. |
Random Forest parameters |
Subsampling rate: Enter the numeric value to indicate the fraction of the input dataset used for training each tree in the forest. The default value 1.0 is recommended, meaning to take the whole dataset for test. Subset Strategy: Select the strategy about how many features should be considered on each internal node in order to appropriately split this internal node (actually the training set or subset of a feature on this node) into smaller subsets. These subsets are used to build child nodes. Each strategy takes a different number of features into account to find the optimal point among these features for split. This point could be, for example, the age 35 of the categorical feature age.
|
Max Bins |
Enter the numeric value to indicate the maximum number of bins used for splitting features. The continuous features are automatically transformed to ordered discrete features. |
Minimum information gain |
Enter the minimum number of information gain to be expected from a parent node to its child nodes. When the number of information gain is less than this minimum number, node split is stopped. The default value of the minimum number of information gain is 0.0, meaning that no further information is obtained by splitting a given node. As a result, the splitting could be stopped. For further information about how the information gain is calculated, see Impurity and Information gain from the Spark documentation. |
Min instance per Node |
Enter the minimum number of training instances a node should have to make it valid for further splitting. The default value is 1, which means when a node has only 1 row of training data, it stops splitting. |
Impurity |
Select the measure used to select the best split from each set of splits.
For further information about how each of the measures is calculated, see Impurity measures from the Spark documentation. |
Set a random seed |
Enter the random seed number to be used for bootstrapping and choosing feature subsets |
Data Stewardship Configuration |
Available when the Integration with Data Stewardship check box in the Basic settings is selected. Campaign Name: Displays the technical name of the campaign once it is selected in the basic settings. However, you can modify the field value to replace it with a context parameter for example and pass context variables to the Job at runtime. This technical name is always used to identify a campaign when the Job communicates with Talend Data Stewardship whatever is the value in the Campaign field. Batch Size: Specify the number of records to be processed in each batch. Do not change the default value unless you are facing performance issues. Increasing the batch size can improve the performance but setting a too high value could cause Job failures. |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. |
Spark Batch Connection |
In the Spark
Configuration tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |