tReservoirSampling Standard properties
These properties are used to configure tReservoirSampling running in the Standard Job framework.
The Standard tReservoirSampling component belongs to the Data Quality family.
The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real-Time Big Data Platform, Talend Data Services Platform, and in Talend Data Fabric.
Basic settings
Schema and Edit schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Click Sync columns to retrieve the schema from the previous component in the Job. |
|
Built-In: You create and store the schema locally for this component only. |
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. |
Sample Size |
Set how many rows to sample from the input flow. |
Advanced settings
Seed for random generator |
Set a random number if you want to extract the same sample in different executions of the Job. Repeating the execution with a different value for the seed will result in a different data samples being extracted. Keep this field empty if you want to extract a different data sample each time you execute the Job. |
tStat Catcher Statistics |
Select this check box to collect log data at the component level. |
Usage
Usage rule |
This component helps you to test profiling analyses on a sample data and have results similar to the results on the full dataset. tReservoirSampling can not be used in Map/Reduce Jobs for the time being. |