Skip to main content Skip to complementary content

Streaming execution

Streaming execution allows you to process unlimited amounts of data. Without streaming execution, the entire input of the transformation is stored into memory before the transformation is executed, which limits the amount of data to be transformed to what may fit in the available memory. If you disable streaming execution for input files larger than 500MB, it results in error.

Note that this documentation only applies to tHMap and cMap components. For Spark Batch components such as tHMapFile or tHMapInput, you do not need to enable the streaming execution.

Components settings

  • For components processing large file, such as tFileInputRaw, you must select the Stream the file parameter in the Basic settings view. This avoids reading the entire input file into a String or a byte array in memory:
    tFileInputRaw Basic settings view.
  • For tHMap, you must always use tFileOutputRaw with it. When you use tHMap, you must select the InputStream (single column) parameter in the Basic settings view. This avoids storing the entire output into memory, and ensures that tFileOutputRaw starts writing without waiting for all outputs to be produced by tHMap:
    tHMap Basic settings view.

How streaming execution works

The streaming execution of tHMap works by accumulating blocks of input data and then executing the transformation on each block separately.

You specify that the transformation is to stream by checking the Stream Input property on the SimpleLoop function:
Expression SimpleLoop Properties dialog box.
In doing so, transformations (either xQuery or DSQL) are executed at every 1000th looping element, or at each block. By default, a block count is at 1000. You can change this behavior by using a context variable called transform_streaming_block_count, and adding a positive numeric value:
Context view of a Job.

Limitations

In addition, the following lists some important information when you select the Stream Input property:
  • Only one SimpleLoop function in your map may have the Stream Input property enabled. If you enable more than one, only the highest level one will be taken into account, and the others are ignored.
  • The looping element on which you enable the Stream Input property must not have a looping sibling. This applies to any ancestor of the looping element too. Therefore, a map with multiple outputs cannot be streamed.
  • If you select the Stream Input property on the SimpleLoop function, you cannot use sort keys, since the sort action cannot be performed while streaming.
  • If you select Stream Input property on the SimpleLoop function, and you also select a distinct child element, the input is already sorted by the child element such that the distinct calculation can be done without further sorting.
Here is an example of an output that summarizes the information above:
root
  |-row(0:*)              can stream
    |-a
    |-b(0:*)              can stream (if row does not stream)
      |-c
      |-d(0:*)            cannot stream (has sibling loop h)
        |-e
        |-f(0:*)          cannot stream (parent d has sibling loop h)
          |-g(0:*)        cannot stream (grand parent d has sibling loop h)
      |-h(0:0)            cannot stream (has sibling loop d)

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!