tExtractDelimitedFields properties for Apache Spark Batch
These properties are used to configure tExtractDelimitedFields running in the Spark Batch Job framework.
The Spark Batch tExtractDelimitedFields component belongs to the Processing family.
The component in this framework is available in all Talend products with Big Data and Talend Data Fabric.
Basic settings
Schema and Edit Schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Click Edit
schema to make changes to the schema.
Information noteNote: If you
make changes, the schema automatically becomes built-in.
|
Built-In: You create and store the schema locally for this component only. |
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. |
|
Prev.Comp.Column list |
Select the column you need to extract data from. |
Die on error |
Select the check box to stop the execution of the Job when an error occurs. |
Field separator |
Enter a character, a string, or a regular expression to separate fields for the transferred data. |
CSV options |
Select this check box to include CSV specific parameters such as
Escape char and Text enclosure.
Information noteImportant: With Spark version 2.0
and onward, special characters must be escaped, that is "\\" and
"\"" instead of "\" and
""".
|
Advanced settings
Custom Encoding |
You may encounter encoding issues when you process the stored data. In that situation, select this check box to display the Encoding list. Then select the encoding to be used from the list or select Custom and define it manually. |
Advanced separator (for number) |
Select this check box to change the separator used for numbers. By default, the thousands separator is a comma (,) and the decimal separator is a period (.). |
Trim all columns |
Select this check box to remove the leading and trailing whitespaces from all columns. When this check box is cleared, the Check column to trim table is displayed, which lets you select particular columns to trim. |
Check column to trim |
This table is filled automatically with the schema being used. Select the check box(es) corresponding to the column(s) to be trimmed. |
Check each row structure against schema |
Select this check box to check whether the total number of columns in each row is consistent with the schema. If not consistent, an error message will be displayed on the console. |
Check date |
Select this check box to check the date format strictly against the input schema. |
Decode String for long, int, short, byte Types |
Select this check box if any of your numeric types (long, integer, short, or byte type), will be parsed from a hexadecimal or octal string. |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs. |
Spark Connection |
In the Spark
Configuration tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |