tModelEncoder properties for Apache Spark Batch
These properties are used to configure tModelEncoder running in the Spark Batch Job framework.
The Spark Batch tModelEncoder component belongs to the Machine Learning family.
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
Basic settings
Schema and Edit Schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. An output column must be named differently from any of the input columns, because the successive transformations from the input side to the output side take place in the same DataFrame (the Spark term for a schema-based data collection) and thus the output columns are actually added to the same DataFrame alongside the input columns. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
|
Transformation table |
Complete this table using columns from the input and the output schemas and the feature-processing algorithms to be applied on these columns. The algorithms available in the Transformation column varies depending on the type of the input schema columns to be processed. For further information about the algorithms available for each type of input data, see ML feature-processing algorithms in Talend. |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs. |
Spark Connection |
In the Spark
Configuration tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |