Streaming execution allows you to process unlimited amounts of data. Without
streaming execution, the entire input of the transformation is stored into memory before the
transformation is executed, which limits the amount of data to be transformed to what may fit
in the available memory. If you disable streaming execution for input files larger than 500MB,
it results in error.
Note that this documentation only applies to tHMap and cMap components. For Spark Batch
components such as tHMapFile or tHMapInput, you do not need to enable the streaming
execution.
Components settings
For components processing large file, such as tFileInputRaw, you must select the
Stream the file parameter in the Basic
settings view. This avoids reading the entire input file into a String or
a byte array in memory:
For tHMap, you must always use tFileOutputRaw with it. When you use tHMap, you must
select the InputStream (single column) parameter in the
Basic settings view. This avoids storing the entire output into
memory, and ensures that tFileOutputRaw starts writing without waiting for all outputs
to be produced by tHMap:
How streaming execution works
The streaming execution of tHMap works by accumulating blocks of input data and then
executing the transformation on each block separately.
You specify that the transformation is to stream by checking the Stream
Input property on the SimpleLoop function:
In doing so, transformations (either xQuery or DSQL) are executed at every 1000th looping
element, or at each block. By default, a block count is at 1000. You can change this
behavior by using a context variable called
transform_streaming_block_count, and adding a positive numeric value:
Limitations
In addition, the following lists some important information when you select the
Stream Input property:
Only one SimpleLoop function in your map may have the
Stream Input property enabled. If you enable more than one,
only the highest level one will be taken into account, and the others are ignored.
The looping element on which you enable the Stream Input
property must not have a looping sibling. This applies to any ancestor of the looping
element too. Therefore, a map with multiple outputs cannot be streamed.
If you select the Stream Input property on the
SimpleLoop function, you cannot use sort keys, since the sort
action cannot be performed while streaming.
If you select Stream Input property on the
SimpleLoop function, and you also select a distinct child
element, the input is already sorted by the child element such that the distinct
calculation can be done without further sorting.
Here is an example of an output that summarizes the information above:
root
|-row(0:*) can stream
|-a
|-b(0:*) can stream (if row does not stream)
|-c
|-d(0:*) cannot stream (has sibling loop h)
|-e
|-f(0:*) cannot stream (parent d has sibling loop h)
|-g(0:*) cannot stream (grand parent d has sibling loop h)
|-h(0:0) cannot stream (has sibling loop d)
Did this page help you?
If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!