Skip to main content Skip to complementary content

Creating a classification model using Random Forest

This scenario explains how to create a classification model using Random Forest.

Arranging the data flow

Procedure

  1. In the Integration perspective of Talend Studio, create an empty Spark Batch Job, named rf_model_creation for example, from the Job Designs node in the Repository tree view.
    For further information about how to create a Spark Batch Job, see Creating a Spark Job.
  2. In the workspace, enter the name of the component to be used and select it from the list that appears.
    In this scenario, the components are tHDFSConfiguration, tFileInputDelimited, tRandomForestModel component, and 4 tModelEncoder components.
    It is recommended to label the four tModelEncoder components to different names so that you can easily recognize the task each of them is used to complete. In this scenario, they are labeled Tokenize, tf, tf_idf and features_assembler, respectively.
  3. Except tHDFSConfiguration, connect the other components using the Row > Main link.
    A 7-component Job using the tRandomForestModel component.

Configuring the connection to the file system to be used by Spark

See the procedure in the Getting Started Guide.

Reading the training set

Procedure

  1. Double-click tFileInputDelimited to open its Component view.
  2. Select the Define a storage configuration component check box and select the tHDFSConfiguration component to be used.
    tFileInputDelimited uses this configuration to access the training set to be used.
  3. Click the [...] button next to Edit schema to open the schema editor.
  4. Click the [+] button five times to add five rows and in the Column column, rename them to label, sms_contents, num_currency, num_numeric and num_exclamation, respectively.
    The label and the sms_contents columns carries the raw data which is composed of the SMS text messages in the sms_contents column and the labels indicating whether a message is spam in the label column.
    The other columns are used to carry the features added to the raw datasets as explained previously in this scenario. These three features are the number of currency symbols, the number of numeric values and the number of exclamation marks found in each SMS message.
  5. In the Type column, select Integer for the num_currency, num_numeric and num_exclamation columns.
  6. Click OK to validate these changes.
  7. In the Folder/File field, enter the directory where the training set to be used is stored.
  8. In the Field separator field, enter \t, which is the separator used by the datasets you can download for use in this scenario.

Transforming SMS text messages to feature vectors using tModelEncoder

This step requires four substeps: transforming the message to words, calculating the weight of a word in each message, downplaying the weight of the irrelevant words in each message, and combining feature vectors.

Procedure

  1. To transform messages to words:
    1. Double-click the tModelEncoder component labeled Tokenize to open its Component view. This component tokenize the SMS messages into words.
    2. Click the Sync columns button to retrieve the schema from the preceding one.
    3. Click the [...] button next to Edit schema to open the schema editor.
    4. On the output side, click the [+] button to add one row and in the Column column, rename it to sms_tokenizer_words. This column is used to carry the tokenized messages.
    5. In the Type column, select Object for this sms_tokenizer_words row.
    6. Click OK to validate these changes.
    7. In the Transformations table, add one row by clicking the [+] button and then proceed as follows.
      1. In the Input column column, select the column that provides data to be transformed to features. In this scenario, it is sms_contents.
      2. In the Output column column, select the column that carry the features. In this scenario, it is sms_tokenizer_words.
      3. In the Transformation column, select the algorithm to be used for the transformation. In this scenario, it is Regex tokenizer.
      4. In the Parameters column, enter the parameters you want to customize for use in the algorithm you have selected. In this scenario, enter pattern=\\W;minTokenLength=3.
      Using this transformation, tModelEncoder splits each input message by whitespace, selects only the words contains at least 3 letters and put the result of the transformation in the sms_tokenizer_words column. Thus currency symbols, numeric values, punctuations and words such as a, an or to are excluded from this column.
  2. To calculate the weight of a word in each message:
    1. Double-click the tModelEncoder component labeled tf to open its Component view.
    2. Repeat the operations described previously over the tModelEncoder labeled Tokenize to add the sms_tf_vect column of the Vector type to the output schema and define the transformation as displayed in the image above.

      In this transformation, tModelEncoder uses HashingTF to convert the already tokenized SMS messages into fixed-length (15 in this scenario) feature vectors to reflect the importance of a word in each SMS message.

  3. To downplay the weight of the irrelevant words in each message:
    1. Double-click the tModelEncoder component labeled tf_idf to open its Component view.
      In this process, tModelEncoder reduces the weight of the words that appear very often but in too many messages, because a word like this often brings no meaningful information for text analysis, such as the word the.
    2. Repeat the operations described previously over the tModelEncoder labeled Tokenize to add the sms_tf_idf_vect column of the Vector type to the output schema and define the transformation as displayed in the image above.

      In this transformation, tModelEncoder uses Inverse Document Frequency to downplay the weight of the words that appears in 5 or more than 5 messages.

  4. To combine feature vectors:
    1. Double-click the tModelEncoder component labeled features_assembler to open its Component view.
    2. Repeat the operations described previously over the tModelEncoder labeled Tokenizer to add the features_vect column of the Vector type to the output schema and define the transformation as displayed in the image above.
      Note that the parameter to be put in the Parameters column is inputCols=sms_tf_idf_vect,num_currency,num_numeric,num_exclamation.
      In this transformation, tModelEncoder combines all feature vectors into one single feature column.

Training the model using Random Forest

Procedure

  1. Double-click tRandomForestModel to open its Component view.
  2. From the Label column list, select the column that provides the classes to be used for classification. In this scenario, it is label, which contains two class names: spam for junk messages and ham for normal messages.
  3. From the Features column list, select the column that provides the feature vectors to be analyzed. In this scenario, it is features_vect, which combines all features.
  4. Select the Save the model on file system check box and in the HDFS folder field that is displayed, enter the directory you want to use to store the generated model.
  5. In the Number of trees in the forest field, enter the number of decision trees you want tRandomForestModel to build. You need to try different numbers to run the current Job to create the classification model several times; after comparing the evaluation results of every model created on each run, you can decide the number you need to use. In this scenario, put 20.
    An evaluation Job will be presented in one of the following sections.
  6. Leave the other parameters as is.

Selecting the Spark mode

See the procedure in the Getting Started Guide.

Executing the Job to create the classification model

Procedure

Press F6 to run this Job.

Results

The model file is created in the directory you have specified in the tRandomForestModel component.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!