Handling Lookups
When implementing a join (including Inner Join and Left Outer Join) in a tMap between different data sources, there is always only one main flow and one or more lookup flows connected to the tMap. All the records of the lookup flow need to be loaded before processing each record of the main flow. Three types of lookup loading models are provided suiting various types of business requirement and the performance needs: Load once, Reload at each row, and Reload at each row (cache).
- Load once: it loads once (and only once) all the records from the lookup flow either in the memory or in a local file before processing each record of the main flow in case the Store temp data option is set to true. This is the default setting and the preferred option if you have a large set of records in the main flow to be processed using a join to the lookup flow.
Reload at each row: it loads all the records of the lookup flow for each record of the main flow. Generally, this option increases the Job execution time due to the repeated loading of the lookup flow for each main flow record. However, this option is preferred in the following situations:
- The lookup data flow is constantly updated and you want to load the latest lookup data for each record of the main flow to get the latest data after the join execution;
- There are very few data from the main flow while a large amount of data from a database table in the lookup flow. In this case, it might cause an OutOfMemory exception if you use the Load once option. You can use dynamic variable settings such as where clause to update the lookup flow on the fly as it gets loaded, before the main flow join is processed. For an example, refer to Reloading data at each row.
Note that Reload at each row in a Spark Streaming Job is supported by the Lookup Input components only such as tMongoDBLookupInput.
Reload at each row (cache): it functions like the Reload at each row model, all the records of the lookup flow are loaded for each record of the main flow. However, this model can't be used with the Store temp data on disk option. The lookup data are cached in memory, and when a new loading occurs, only the records that are not already exist in the cache will be loaded, in order to avoid loading the same records twice. This option optimizes the processing time and helps improve processing performance of the tMap component. Note that you can not use Reload at each row (cache) and Store temp data at the same time.
Note that Reload at each row (cache) in a Spark Streaming Job is supported by the Lookup Input components only such as tMongoDBLookupInput.
Note that when your lookup is a database table, the best practice is to open the connection to the database in the beginning of your Job design in order to optimize performance.