ML feature-processing algorithms in Talend
HashingTF
As a text-processing algorithm, HashingTF converts input data into fixed-length feature vectors to reflect the importance of a term (a word or a sequence of words) by calculating the frequency that these words in the input data appear.This algorithm is available in Spark Batch and Spark Streaming Jobs.
In a Spark Batch Job, it is typically used along with the IDF (Inverse document frequency) algorithm to make the weight calculation more reliable. In the context of a Talend Spark Job, you need to put a second tModelEncoder to apply the Inverse document frequency algorithm on the output of the HashingTF computation.
The data must be already segmented before being sent to the HashingTF computation; therefore, if the data to be used has not been segmented, you need to use another tModelEncoder to apply the Tokenizer algorithm or the Regex tokenizer algorithm to prepare the data.
For further details about the HashingTF implementation in Spark, see HashingTF from the Spark documentation.
- type of the input column: Object
- type of the output column: Vector
Parameter | Description |
---|---|
numFeatures |
The number of features that define the dimension of the feature vector. For example, you can enter numFeatures=220 to define the dimension. If you do not put any parameter, the default value, 220, is used. The output vectors are sparse vectors. For example, a document reading "tModelEncoder transforms your data to features." can be transformed to (3,[0,1,2],[1.0,2.0,3.0]), if you put numFeatures=3. For further information about how to read a sparse vector, see Local vector. |
For further information about the Spark API of HashingTF, see ML HashingTF.
It can be used to prepare data for the Classification or the Clustering components from the Machine Learning family in order to create sentiment analysis model.
Inverse document frequency
As a text-processing algorithm, Inverse document frequency (IDF) is often used to process the output of the HashingTF computation in order to downplay the importance of the terms that appear in too many documents.This algorithm is available in Spark Batch Jobs.
It requires a tModelEncoder component performing the HashingTF computation to provide input data.
For further details about the IDF implementation in Spark, see IDF from the Spark documentation.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
minDocFreq |
The minimum number of documents that should contain a term. This number is the threshold to indicate that a term becomes relevant to the IDF computation. For example, if you put minDocReq=5, when only 4 documents contain a term, this term is considered irrelevant and no IDF is actually applied on it. For further details about the Spark API of this IDF algorithm, see ML feature IDF. |
It can be used to prepare data for the Classification or the Clustering components from the Machine Learning family in order to create sentiment analysis model.
Word2Vector
Word2Vector transforms a document into a feature vector, for use in other learning computations such as text similarity calculation.This algorithm is available in Spark Batch Jobs.
For further details about the Word2Vector implementation in Spark, see Word2Vec from the Spark documentation.
- type of the input column: List
- type of the output column: Vector
Parameter | Description |
---|---|
maxIter |
Maximum number of iterations for obtaining the optimal result. For example, maxIter=5. |
minCount |
Minimum number of times a token should appear to be included in the vocabulary of the Word2Vector model. The default is minCount=5. |
numPartitions |
Number of partitions. |
seed |
The random seed number. |
stepSize |
Size of the Step for each iteration. This defines the learning rate. |
vectorSize |
Size of each feature vector. The default is vectorSize=100, with which 100 numeric values are calculated to identify a document. |
If you need to set several parameters, separate these parameters using semicolons (;), for example, maxIter=5;minCount=4.
For further information about the Spark API of Word2Vector, see Word2Vec.
It can be used to prepare data for the Classification or the Clustering components from the Machine Learning family in order to, for example, find similar user comments about a product.
CountVectorizer
CountVectorizer extracts the most frequent terms from a collection of text documents and convert these terms into vectors of token counts.
This algorithm is available in Spark Batch Jobs.
It requires a tModelEncoder component performing the Tokenizer or the Regex tokenizer computation to provide input data of the List type.
For further information about the CountVectorizer implementation in Spark, see CountVectorizer.
- type of the input column: List
- type of the output column: Vector
Parameter | Description |
---|---|
minDF |
The minimum number of documents in which a term should appear so as to be included in the vocabulary built by CountVectorizer. The default value is minDF=1. If you put a value between 0 and 1, it means a fraction of the documents. |
minTF |
The threshold used to ignore the rare terms in a document. A term with frequency less than the value of minTF will be ignored. The default value is minTF=1. |
vocabSize |
The maximum size of each vocabulary vector built by CountVectorizer. The default value is 218 . |
For further information about the Spark API of CountVectorizer, see CountVectorizer.
It is often used to process text in terms of text mining for the Classification or the Clustering components.
Binarizer
Using the given threshold, Binarizer transforms a feature to a binary feature of which the value is distributed be either 1.0 or 0.0.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
For further details about the Binarizer implementation in Spark, see Binarizer from the Spark documentation.
- type of the input column: Double
- type of the output column: Double
Parameter | Description |
---|---|
threshold |
The threshold used to binarize continuous features. The features greater than the threshold are binarized to 1.0 and the features equal to or less than the threshold are binarized to 0.0. The default is threshold=0.0. |
For further information about the Spark API of Binarizer, see ML Binarizer.
It can be used to prepare data for the Classification or the Clustering components from the Machine Learning family in order to, for example, estimate a user comment indicates this user's satisfaction or dissatisfaction.
Bucketizer
Bucketizer segments continuous features to a column of feature buckets using the boundary values you define.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
For further details about the Bucketizer implementation in Spark, see Bucketizer from the Spark documentation.
- type of the input column: Double
- type of the output column: Double
Parameter | Description |
---|---|
splits |
The parameter used to segment continuous features into buckets. A bucket is a half-open range [x,y) defined by the boundary values (x and y) you give except the last bucket, which also includes y. For example, you can put splits=Double.NEGATIVE_INFINITY, -0.5, 0.0, 0.5, Double.POSITIVE_INFINITY to segment values such as -0.5, 0.3, 0.0, 0.2. The Double.NEGATIVE_INFINITY and the Double.POSITIVE_INFINITY are recommended when you do not know the upper bound and the lower bound of the target column. |
For further information about the Spark API of Bucketizer, see Bucketizer.
It can be used to prepare categorical data for training classification or clustering models.
Discrete Cosine Transform (DCT)
Discrete Cosine Transform in Spark implements the one-dimensional DCT-II to transform a real-valued vector in the time domain to another real-valued vector of the same length in the frequency domain, That is to say, the input data is converted into a series of cosine waves oscillating at different frequencies.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
-
a tModelEncoder component using algorithms such as Vector assembler
-
a tMatchModel component.
For further information about the DCT implementation in Spark, see DCT.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
inverse |
The boolean to indicate whether to perform the inverse DCT calculations (when inverse=true) or the forward DCT calculations (when inverse=false). By default, it is inverse=false. |
For further information about the Spark API of Discrete Cosine Transform, see DCT.
It is widely used to process images and audios for training related classification or clustering models.
MinMaxScaler
MinMaxScaler rescales each feature vector into a fixed range.
-
a tModelEncoder component using algorithms such as such as Vector assembler
-
a tMatchModel component.
This algorithm is available in Spark Batch Jobs.
For further information about the MinMaxScaler implementation in Spark, see MinMaxScaler.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
min |
The lower bound of each vector after the transformation. By default, it is min=0. |
max |
The upper bound of each vector after the transformation. By default, it is max=1. |
For further information about the Spark API of MinMaxScaler, see MinMaxScaler.
It is used to normalize features to fit within a certain range. This is typically used in image processing, for example, to normalize data about pixel intensities.
N-gram
N-gram converts a tokenized string (often words) to an array of comma-separated n-grams. Within each program, words are separated by space. For example, when creating 2-grams, the string Good morning World will be converted to (good morning, morning world).
This algorithm is available in Spark Batch and Spark Streaming Jobs.
It requires a tModelEncoder component performing the Tokenizer or the Regex tokenizer computation to provide input data of the List type.
For further information about the N-gram implementation in Spark, see NGram.
- type of the input column: List
- type of the output column: List
Parameter | Description |
---|---|
n |
The minimum length of each n-gram. By default, it is n=1, that is to say, 1-gram or unigram. |
For further information about the Spark API of N-gram, see NGram.
It if often used in natural language processing such as speech recognition to prepare data for the related Classification or the Clustering models.
Normalizer
Normalizer normalizes each vector of the input data to have unit norm so as to improve the performance of learning computations.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
For further information about the Normalizer implementation in Spark, see Normalizer from the Spark documentation.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
pNormValue |
The p-norm value used to standardize the feature vectors from the input flow to unit norm. By default, it is pNormValue=2, meaning to use the Euclidean norm. |
For further information about the Spark API of Normalizer, see Normalizer.
It can be used to normalize of the result of the TF-IDF computation in order to eventually improve the performance of text classification (by tLogicRegressionModel for example) or text clustering.
One hot encoder
One hot encoder enables the algorithms that expect continuous features to use categorical features by mapping the column of label indices of the categorical features to a column of binary code.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
You can use another tModelEncoder component with the String indexer algorithm to create this column of label indices.
For further information about the OneHotEncoder implementation in Spark, see OneHotEncoder from the Spark documentation.
- type of the input column: Double
- type of the output column: Vector
Parameter | Description |
---|---|
dropLast |
The boolean parameter used to determine whether to drop the last category. The default is dropLast=true, meaning that the last category is dropped with the result that the output vector for this category contains only 0 and each vector uses one bit less of storage space. This configuration allows you to save the storage space for the output vectors. |
For further information about the Spark API of One hot encoder, see OneHotEncoder.
It can be used to provide feature data to the Classification or the Clustering components, such as tLogicRegressionModel.
PCA
PCA implements an orthogonal transformation to convert vectors of correlated features into vectors of linearly uncorrelated features. This can project high-dimensional feature vectors to low-dimensional feature vectors.
This algorithm is available in Spark Batch Jobs.
-
a tModelEncoder component using algorithms such as such as Vector assembler
-
a tMatchModel component.
For further information about the PCA implementation in Spark, see PCA from the Spark documentation.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
k |
The number of principal components to be generated. This value determine the dimension of the feature vectors to be outputted. For example, k=3 means the 3-dimensional feature vectors will be outputted. For further information about PCA and its principal components, see Principal component analysis. |
For further information about the Spark API of PCA, see PCA.
Typically, it can be used to prepare features for resolving clustering problems.
Polynomial expansion
Polynomial expansion expands the input features so as to improve the performance of learning computations.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
For further details about the Polynomial expansion implementation in Spark, see PolynomialExpansion from the Spark documentation.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
degree |
The polynomial degree to expand. A higher-degree expansion of features often means more accuracy in the model you need to create, but note that too high a degree can lead to overfitting in the result of the predictive analysis based on the same model. The default is degree=2, meaning to expand the input features into a 2-degree polynomial space. |
For further information about the Spark API of Polynomial expansion, see Polynomial expansion.
It can be used to process feature data for the Classification or the Clustering components, such as tLogicRegressionModel.
QuantileDiscretizer
QuantileDiscretizer reads a column of continuous features, analyzes a sample of data from these feature data, and accordingly outputs a column of categorical features that group the data of the continuous features into roughly equal parts.
This algorithm is available in Spark Batch Jobs but is not compatible with Spark 2.2 on EMR 5.8.
For further information about the QuantileDiscretizer implementation in Spark, see QuantileDiscretizer from the Spark documentation.
- type of the input column: Double
- type of the output column: Double
Parameter | Description |
---|---|
numBuckets |
The maximum number of buckets into which you want to group the input data. This value must be greater than or equal to 2. The default value is numBuckets=2. |
For further information about the Spark API of QuantileDiscretizer, see Quantile Discretizer.
It can be used to prepare categorical data for training classification or clustering models.
Regex tokenizer
Regex tokenizer performs advanced tokenization based on given regex patterns.
For further details about the RegexTokenizer implementation in Spark, see RegexTokenizer from the Spark documentation.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
- type of the input column: String
- type of the output column: Object and List
Parameter | Description |
---|---|
gaps |
The boolean parameter used to indicate whether the regex splits the input text using one or more whitespace characters (when gaps=true) or repetitively matches a token (when gaps=false). By default, this parameter is set to be true and the default delimiter is \\s+, which matches one or more characters. |
pattern |
The parameter used to set regex pattern that matches tokens out of the input text. |
minTokenLength |
The parameter used to filter matched tokens using a minimal length. The default value is 1, so as to avoid returning empty strings. |
If you need to set several parameters, separate these parameters using semicolons (;), for example, gaps=true;minTokenLength=4.
For further information about the Spark API of Regex tokenizer, see RegexTokenizer.
It is often used to process text in terms of text mining for the Classification or the Clustering components, such as tRandomForestModel, in order to create, for example, a spam filtering model.
Tokenizer
Tokenizer breaks input text (often sentences) into individual terms (often words).Note that these words are all convert to lowercase.
For further details about the Tokenizer implementation in Spark, see Tokenizer from the Spark documentation.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
- type of the input column: String
- type of the output column: Object and List
You do not need to set any additional parameters for Tokenizer.
For further information about the Spark API of Tokenizer, see Tokenizer.
It is often used to process text in terms of text mining for the Classification or the Clustering components, such as tRandomForestModel, in order to create, for example, a spam filtering model.
SQLTransformer
SQLTransformer allows you to implement feature transformation using Spark SQL statements. It is subject to the limitations indicated in the Spark documentation.For further information about these limitations and the SQLTransformer implementation in Spark, see SQLTransformer from Spark documentation.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
-
type of the input column: All types. You need to select the column to be used in your SQL statement
-
type of the output column: All types: you need define it depending on your SQL statement
Parameter | Description |
---|---|
statement |
The Spark SQL statement to be used to select and/or transform input data. |
For further information about the Spark API of SQLTransformer, see SQLTransformer.
It allows you to be flexible to extract and transform data to prepare features for other Machine learning algorithms or directly perform queries in the results of other Machine Learning algorithms.
Note that you can also efficiently perform Spark SQL queries using tSqlRow or join data using tMap to prepare the data.
Standard scaler
Standard scaler standardizes each input vector to have unit standard deviation (unit variance), a common case of normal distribution. The standardized data can improve the convergence rate and prevent features with very large variances from exerting overly large influence during model training.For further details about the StandardScaler implementation in Spark, see StandardScaler from the Spark documentation.
This algorithm is available in Spark Batch Jobs.
-
type of the input column: Vector
-
type of the output column: Vector
Parameter | Description |
---|---|
withMean |
The boolean parameter used to indicate whether to center each vector of feature data with mean (that is to say, subtract the mean of the feature numbers from each of these numbers) before scaling. Centering the data will build a dense output and so when the input data is sparse, it will raise exception. By default, this parameter is set to be false, meaning that no centering occurs. |
withStd |
The boolean parameter used to indicate whether to scale the input data to have unit standard deviation. By default, withStd is set to be true, meaning to normalize the input feature vectors to have unit standard deviation. |
If you need to set several parameters, separate these parameters using semicolons (;), for example, withMean=true;withStd=true.
Note that if you set both parameters to be false, Standard scaler will actually do nothing.
For further information about the Spark API of Standard scaler, see StandardScaler.
It can be used to prepare data for the Classification or the Clustering components, such as tKMeanModel.
StopWordsRemover
StopWordsRemover filters out stop words from the input word strings.It requires a tModelEncoder component performing the Tokenizer or the Regex tokenizer computation to provide input data of the List type.
For further details about the StopWordsRemover implementation in Spark, see StopWordsRemover from the Spark documentation.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
-
type of the input column: List
-
type of the output column: List
Parameter | Description |
---|---|
caseSensitive |
The boolean to indicate whether filtering out stop words is case-sensitive. By default, the value is caseSensitive=false. |
stopWords |
It defines the list of stop words to be used for the filtering. By default, it is stopWords=English. You can consult the list of these default stop words on stop_words. If you need to use a custom list of stop words, you can directly enter them with this parameter, for example, stopWords=the,stop,words. |
For further information about what is a stop word, see Stop words on wikipedia.
For further information about the Spark API of StopWordsRemover, see StopWordsRemover.
It removes the most common words that often do not carry as much meaning in order to avoid as many noises as possible in text processing.
String indexer
String indexer generates indices for categorical features (string-type labels). These indices can be used by other algorithms such as One hot encoder to build equivalent continuous features.The indices are ordered by frequencies and the most frequent label gets the index 0.
For further details about the StringIndexer implementation in Spark, see StringIndexer from the Spark documentation.
This algorithm is available in Spark Batch Jobs.
-
type of the input column: String
-
type of the output column: Double
You do not need to set any additional parameters for String indexer.
For further information about the Spark API of String indexer, see StringIndexer.
String indexer, along with One hot encoder, enables algorithms that expects continuous features to use categorical features.
Vector indexer
Vector indexer identifies categorical feature columns based on your definition of the maxCategories parameter and indexes the categories from each of the identified columns, starting from 0. The other columns are declared as continuous feature columns and are not indexed.For further details about the VectorIndexer implementation in Spark, see VectorIndexer from the Spark documentation.
This algorithm is available in Spark Batch Jobs.
-
type of the input column: Vector
-
type of the output column: Vector
Parameter | Description |
---|---|
maxCategories |
The parameter used to set the threshold indicating whether a vector column represents categorical features or continuous features. For example, if you put maxCategories=2, the columns that contain more than 2 distinct values will be declared as continuous feature column and the other columns as categorical feature column. The default is maxCategories=20. |
For further information about the Spark API of Vector indexer, see VectorIndexer.
Vector indexer gives indexes to categorical features so that algorithms such as the Decision Trees computations run by tRandomForestModel, can handle the categorical features appropriately.
Vector assembler
Vector assembler combines selected input columns into one single vector column that can be used by other algorithms or machine learning computations that expect vector features.Note that Vector assembler does not re-calculate the features taken from different columns. It only combines these feature columns into one single vector but keep the features as they are.
When you select Vector assembler, the Input column column of the Transformation table in the Basic settings view of tModelEncoder is deactivated and you need to use the inputCols parameter in the Parameters column to select the input columns to be combined.
For further details about the VectorAssembler implementation in Spark, see VectorAssembler from the Spark documentation.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
-
type of the input column: numeric types, boolean type and vector type
-
type of the output column: Vector
Parameter | Description |
---|---|
inputCols |
The parameter used to indicate the input columns to be combined into one single vector column. For example, you can put inputCols=id,country_code to combine the id column and the country_code column. |
For further information about the Spark API of Vector assembler, see VectorAssembler.
Vector assembler prepares feature vectors for the Logistic Regression computations or the Decision Tree computations run by components such as tLogisticRegressionModel and tRandomForestModel.
ChiSqSelector
ChiSqSelector determines feature relevance to given feature categories based on a Chi-Squared test of independence and then selects the features the most relevant to those categories.For further details about the ChiSqSelector implementation in Spark, see ChiSqSelector from Spark documentation.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
-
type of the input column: Vector, Double and List
-
type of the output column: Vector
Parameter | Description |
---|---|
featuresCol |
The input column that provides features to be selected by ChiSqSelector. The type of this column should be Vector. |
labelCol |
The input column that provides categories for the features to be used. The type of this input column should be Double. |
numTopFeatures |
The number of the features that ChiSqSelector defines as the most relevant to the feature categories and so selects to output. By default, this parameter is numTopFeatures=50, meaning to select the first 50 of the most relevant features. |
For example, in an analysis of loan validation, a column called features and a column called label have been prepared: the former column carries features about the borrower candidates such as address, age and income, and the latter column the categories that indicate whether to validate the loan for each candidate. You need to put featuresCol=features,labelCol=label to make use of these columns by ChiSqSelector.
In addition, if you want to select only the top 1 most relevant feature, put numTopFeatures=1 and then the income feature will be selected; if you put numTopFeatures=2 instead, the top 2 most relevant features will be selected, that is to say, the income and the age features.
For further information about the Spark API of ChiSqSelector, see ChiSqSelector.
It can be used to yield the features with the most predictive power for training classification or clustering models.
RFormula
Rformula allows you to generate feature vectors along with their feature labels. It is subject to the limitations indicated in the Spark documentation.For further information about these limitations and the RFormula implementation in Spark, see Rformula from Spark documentation.
This algorithm is available in Spark Batch Jobs but is not compatible with Spark 2.2 on EMR 5.8.
-
type of the input column: String types and numeric types. You need to use R formula to select the columns to be used.
-
type of the output column: Vector for features and Double for labels
Parameter | Description |
---|---|
featuresCol |
The output column used to carry the feature data. |
labelCol |
The output column used to carry the feature labels. |
formula |
The R formula to be applied. |
For example, if you put featuresCol=features;labelCol=label;formula=clicked ~ country + hour in the Parameters column of the Transformation table, you need to add the features column and the label column to the output schema and set the former column to the type Vector and the latter to the type Double. Then during the transformation, the R formula defined using the formula parameter is applied on the input columns, features are generated into the features column and feature labels into the label column. This example is based on the one explained in Rformula from Spark documentation.
For further information about the Spark API of RFormula, see R Formula.
It allows you to apply R formulas to prepare feature data.
VectorSlicer
VectorSlicer reads a feature vector, selects features from this vector based on the values of the indices parameter and writes the selected features into a new vector in the output column.-
a tModelEncoder component using algorithms such as such as Vector assembler
-
a tMatchModel component.
For further information about the VectorSlicer implementation in Spark, see VectorSlicer from Spark documentation.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
-
type of the input column: Vector
-
type of the output column: Vector
Parameter | Description |
---|---|
indices |
The numeric indices of the features to be selected and outputted. For example, if you need to select the second and the third features of the following vector, [0.0, 10.0, 0.5], you need to put indices=1,2. Then the vector [10.0, 0.5] will be written in the output column. The features in the output vector is ordered according to their indices. |
For further information about the Spark API of VectorSlicer, see VectorSlicer. But note that the names parameter is not supported by tModelEncoder.
It allows you to be precise in selecting the features you want to use.