This step requires four substeps: transforming the message to words, calculating
            the weight of a word in each message, downplaying the weight of the irrelevant words in
            each message, and combining feature vectors.
      
         Procedure
- 
               To transform messages to words:
               
                  - 
                     Double-click the tModelEncoder
                        component labeled Tokenize to open its Component view. This component tokenize the SMS
                        messages into words.
                     
                  
 
                  - 
                     Click the Sync columns button to
                        retrieve the schema from the preceding one. 
                  
 
                  - 
                     Click the [...] button next to
                        Edit schema to open the schema
                        editor.
                  
 
                  - 
                     On the output side, click the [+]
                        button to add one row and in the Column
                        column, rename it to sms_tokenizer_words. This column is
                        used to carry the tokenized messages.
                     
                  
 
                  - 
                     In the Type column, select Object for this
                        sms_tokenizer_words row.
                  
 
                  - 
                     Click OK to validate these
                        changes.
                  
 
                  - 
                     In the Transformations table, add one
                        row by clicking the [+] button and then
                        proceed as follows.
                     
                        
                           - In the Input column column, select the column
                              that provides data to be transformed to features. In this scenario, it is
                              sms_contents.
 
                           - In the Output column column, select the column
                              that carry the features. In this scenario, it is
                              sms_tokenizer_words.
 
                           - In the Transformation column, select the
                              algorithm to be used for the transformation. In this scenario, it is
                              Regex tokenizer.
 
                           - In the Parameters column, enter the parameters
                              you want to customize for use in the algorithm you have selected. In this
                              scenario, enter pattern=\\W;minTokenLength=3.
 
                        
                      
                     Using this transformation, tModelEncoder splits each input message by whitespace, selects
                        only the words contains at least 3 letters and put the result of the
                        transformation in the sms_tokenizer_words column. Thus currency symbols, numeric
                        values, punctuations and words such as a,
                        an or to are excluded from
                        this column.
                   
               
             
- 
               To calculate the weight of a word in each message:
               
                  - 
                     Double-click the tModelEncoder
                        component labeled tf to open its Component view.
                     
                  
 
                  - 
                     Repeat the operations described previously over the tModelEncoder labeled
                        Tokenize to add the
                        sms_tf_vect column of the Vector type to the output schema and define the transformation
                        as displayed in the image above.
                     
                        
                        In this transformation, tModelEncoder uses HashingTF to convert the already tokenized SMS messages into
                           fixed-length (15 in this scenario) feature vectors to
                           reflect the importance of a word in each SMS message.
                      
                   
               
             
- 
               To downplay the weight of the irrelevant words in each message:
               
                  - 
                     Double-click the tModelEncoder
                        component labeled tf_idf to open its Component view. 
                     
In this process, 
tModelEncoder
                        reduces the weight of the words that appear very often but in too many
                        messages, because a word like this often brings no meaningful information for
                        text analysis, such as the word 
the.
                      
                   
                  - 
                     Repeat the operations described previously over the tModelEncoder labeled
                        Tokenize to add the
                        sms_tf_idf_vect column of the Vector type to the output schema and define the
                        transformation as displayed in the image above.
                     
                        
                        In this transformation, tModelEncoder uses Inverse
                              Document Frequency to downplay the weight of the words that
                           appears in 5 or more than 5 messages.
                      
                   
               
             
- 
               To combine feature vectors:
               
                  - 
                     Double-click the tModelEncoder
                        component labeled features_assembler to open its
                        Component view.
                     
                  
 
                  - 
                     Repeat the operations described previously over the tModelEncoder labeled Tokenizer to add the features_vect column of the
                        Vector type to the output schema and
                        define the transformation as displayed in the image above.
                     
Note that the parameter to be put in the Parameters column is inputCols=sms_tf_idf_vect,num_currency,num_numeric,num_exclamation.
                     
                     In this transformation, tModelEncoder combines all feature vectors into one single
                        feature column.