Storage
|
To connect to an HDFS installation, select the Define a storage configuration component
check box and then select the name of the component to use from those
available in the drop-down list.
This option requires you to have previously configured the
connection to the HDFS installation to be used, as described in the
documentation
for the tHDFSConfiguration
component.
If you leave the Define a
storage configuration component check box unselected,
you can only convert files locally.
|
Configure Component
|
To configure the component, click the [...] button and, in the Component Configuration window, perform
the following actions.
-
Click the Select button next to the Record Map field and then, in
the Select a Map dialog box
that opens, select the map you want to use and then click
OK.
This map must have been previously created in
Talend Data Mapper
.
Note that the input and output representations
are those defined in the map, and cannot be changed in the
component.
-
Click Next.
-
Tell the component where each new record begins.
In order for you to be able to do so, you need to fully
understand the structure of your data.
Exactly how you do this varies depending on the
input representation being used, and you will be presented with
one of the following options.
-
Select an appropriate
record delimiter for your data. Note that you must
specify this value without quotes.
-
Separator lets you specify a
separator indicator, such as \n, to identify a new
line.
Supported
indicators are \n for a Unix-type new line, \r\n for
Windows and \r for Mac, and \t for tab characters.
-
Start/End with lets you specify the
initial characters that indicate a new record,
such as <root, or the characters that indicate
where a record ends. This can also be a regular
expression.
Start
with also supports new lines, \n for a
Unix-type new line, \r\n for Windows and \r for Mac,
and \t for
tab characters.
-
Sample
File: To test the signature with a
sample file, click the [...] button, browse to the file you
want to use as a sample, click Open, and then click
Run to test
your sample.
Testing the
signature lets you check that the total number of
records and their minimum and maximum length
corresponds to what you expect based on your
knowledge of the data. This step assumes you have
a local subset of your data to use as a
sample.
-
If your input representation is COBOL or
Flat with positional and/or binary encoding properties,
define the signature for the input record structure:
-
Input Record root
corresponds to the root element in your input
record.
-
Minimum Record
Size corresponds to the size in bytes
of the smallest record. If you set this value too
low, you may encounter performance issues, since
the component will perform more checks than
necessary when looking for a new record.
-
Maximum Record
Size corresponds to the size in bytes
of the largest record, and is used to determine
how much memory is allocated to read the
input.
-
Sample from Workspace or
Sample from File System: To
test the signature with a sample file, click the
[...]
button, and then browse to the file you want to
use as a sample.
Testing the signature lets you
check that the total number of records and their
minimum and maximum length corresponds to what you
expect based on your knowledge of the data. This
step assumes you have a local subset of your data
to use as a sample.
-
Footer Size
corresponds to the size in bytes of the footer, if
any. At runtime, the footer will be ignored rather
than being mistakenly included in the last record.
Leave this field empty if there is no footer.
-
Click the Next button to open
the Signature
Parameters window, select the fields
that define the signature of your record input
structure (that is, to identify where a new record
begins), update the Operation and Value columns as
appropriate, and then click Next.
-
In the Record
Signature Test window that opens, check
that your Records are correctly delineated by
scrolling through them with the
Back and
Next buttons and performing
a visual check, and then click
Finish.
|
Input
|
Click the [...]
button to define the path to where the input file is stored.
|
Output
|
Click the [...]
button to define the path to where the output files will be stored.
|
Action
|
From the drop-down list, select:
|
Open Map Editor
|
Click the [...]
button to open the map for editing in the Map
Editor of
Talend Data Mapper
.
For more information, see
Talend Data Mapper User Guide.
|
Die on error
|
This check box is selected by default.
Clear the check box to skip any rows on error and complete the
process for error-free rows.
If you opt to clear the check box, you can perform any of these options:
-
connection. In the output component, ensure
that you add a fixed metadata with the following columns:
- inputRecord: contains the rejected
input record during the transformation.
- recordId: refers to the record
identifier. For a text or binary input, the recordId
specifies the start offset of the record in the
input file. For an AVRO input, the recordId
specifies the timestamp when the input was
processed.
- errorMessage: contains the
transformation status with details of the cause of
the transformation error.
-
If the check box is unselected, you can retrieve the rejected
records in a file. One of these mechanisms triggers this
feature: (1) a context variable
(Connect the tHMapFile component to an output
component, for example tAvroOutput, using a talend_transform_reject_file_path)
and (2) a system variable set in the Advanced Job parameters
(spark.hadoop.talend.transform.reject.file.path).
When you set the file path on the Hadoop Distributed File
System (HDFS), no further configurations are needed. When
you set the file on Amazon S3 or any other Hadoop-compatible
file systems, add the associated Spark advanced
configuration parameter.
In case of errors at runtime, tHMapFile checks if one of
the mechanisms exists and, if so, appends the rejected
record to the designated file. The reject file content
includes the concatenation of the rejected records without
any additional metadata.
If the file system you use does not support
appending to a file, a separate file is created for each
rejection. The file uses the provided file path as the
prefix and adds a suffix that is the offset of the input
file and the size of the rejected record.
Information noteNote: Any errors while trying to store the reject are logged and the
processing continues.
|
Merge result to single
file
|
By default, the tHMapFile creates several part files. Select this check
box to merge these files into a single file.
The following options are used to manage the source and
the target files:
-
Merge File
Path: enter the path to the file which will
contain the merged content from all parts.
-
Remove source
dir: select this check box to remove the source
files after the merge.
-
Override target file: select
this check box to override the file already existing in the
target location. This option does not override the folder.
-
Include
Header: select this check box to add the CSV
header to the beginning of the merged file. This option is only
used for CSV outputs. For other representations, it has no
effect on the target file.
Information noteWarning: Using this option with an Avro output creates an
invalid Avro file. Since each part starts with an Avro Schema header,
the merged file would have more than one Avro Schema, which is
invalid.
|