This step requires four substeps: transforming the message to words, calculating
the weight of a word in each message, downplaying the weight of the irrelevant words in
each message, and combining feature vectors.
Procedure
-
To transform messages to words:
-
Double-click the tModelEncoder
component labeled Tokenize to open its Component view. This component tokenize the SMS
messages into words.
-
Click the Sync columns button to
retrieve the schema from the preceding one.
-
Click the [...] button next to
Edit schema to open the schema
editor.
-
On the output side, click the [+]
button to add one row and in the Column
column, rename it to sms_tokenizer_words. This column is
used to carry the tokenized messages.
-
In the Type column, select Object for this
sms_tokenizer_words row.
-
Click OK to validate these
changes.
-
In the Transformations table, add one
row by clicking the [+] button and then
proceed as follows.
- In the Input column column, select the column
that provides data to be transformed to features. In this scenario, it is
sms_contents.
- In the Output column column, select the column
that carry the features. In this scenario, it is
sms_tokenizer_words.
- In the Transformation column, select the
algorithm to be used for the transformation. In this scenario, it is
Regex tokenizer.
- In the Parameters column, enter the parameters
you want to customize for use in the algorithm you have selected. In this
scenario, enter pattern=\\W;minTokenLength=3.
Using this transformation, tModelEncoder splits each input message by whitespace, selects
only the words contains at least 3 letters and put the result of the
transformation in the sms_tokenizer_words column. Thus currency symbols, numeric
values, punctuations and words such as a,
an or to are excluded from
this column.
-
To calculate the weight of a word in each message:
-
Double-click the tModelEncoder
component labeled tf to open its Component view.
-
Repeat the operations described previously over the tModelEncoder labeled
Tokenize to add the
sms_tf_vect column of the Vector type to the output schema and define the transformation
as displayed in the image above.
In this transformation, tModelEncoder uses HashingTF to convert the already tokenized SMS messages into
fixed-length (15 in this scenario) feature vectors to
reflect the importance of a word in each SMS message.
-
To downplay the weight of the irrelevant words in each message:
-
Double-click the tModelEncoder
component labeled tf_idf to open its Component view.
In this process,
tModelEncoder
reduces the weight of the words that appear very often but in too many
messages, because a word like this often brings no meaningful information for
text analysis, such as the word
the.
-
Repeat the operations described previously over the tModelEncoder labeled
Tokenize to add the
sms_tf_idf_vect column of the Vector type to the output schema and define the
transformation as displayed in the image above.
In this transformation, tModelEncoder uses Inverse
Document Frequency to downplay the weight of the words that
appears in 5 or more than 5 messages.
-
To combine feature vectors:
-
Double-click the tModelEncoder
component labeled features_assembler to open its
Component view.
-
Repeat the operations described previously over the tModelEncoder labeled Tokenizer to add the features_vect column of the
Vector type to the output schema and
define the transformation as displayed in the image above.
Note that the parameter to be put in the Parameters column is inputCols=sms_tf_idf_vect,num_currency,num_numeric,num_exclamation.
In this transformation, tModelEncoder combines all feature vectors into one single
feature column.