Skip to main content Skip to complementary content

Training the decision tree model

This section explains how to train your decision tree model and how to run the Job on the Hadoop cluster.

Procedure

  1. Add a tDecisionTreeModel component to the workspace.
  2. Connect tModelEncoder to tDecisionTreeModel with a Main row.
  3. Double-click tDecisionTreeModel to open the Basic settings.
  4. In Storage, select the Define a storage configuration component check box and choose the HDFS storage.
  5. Choose the schema you created earlier.
  6. In Features Column, choose MyFeatures.
  7. In Label Column, choose MyLabels.
  8. In Model location, select the Save the model on file system (only for Spark 1.4 or higher) check box and enter the path to the HDFS file system.
    In this example: /user/puccini/machinelearning/decisiontrees/marketing/decisiontree.model.
  9. Leave the default value for the rest of the settings.
    Configuration of the tDecisionTreeModel component.

    Here is the Job configuration.

    A Job using the tHDFSConfiguration, tFileInputDelimited, tModelEncoder, tDecisionTreeModel components.
  10. Click the Run tab and go to Spark Configuration.
  11. Select the Use local mode check box.
  12. If you want to run the Job on the Hadoop cluster:
    1. Clear the Use local mode check box.
    2. Click Spark Configuration.
    3. Add the following Advanced properties.
      Property Value
      "spark.driver.extraJavaOptions" "-Dhdp.version=2.4.0.0-169"
      "spark.yarn.am.extraJavaOptions" "-Dhdp.version=2.4.0.0-169"
      The value is specific to the distribution and version of Hadoop. This tutorial uses Hortonworks 2.4 V3, which is 2.4.0.0-169. Your entry for this parameter will be different if you do not use Hortonworks 2.4 V3.
      Information noteImportant: When running the code on the cluster, it is crucial to ensure that there is unfettered access between the two systems. In this example, make sure that the Hortonworks cluster can communicate with your instance of Talend Studio. This is necessary because Spark, even though it is running on the cluster, still needs to reference the Spark drivers shipped with Talend. Moreover, if you deploy a Spark Job into a production environment, it will be run from a Talend Job server (edge node). You also need to ensure that there is unfettered communication between it and the cluster.

      For more information on the ports needed by each service, see the Spark Security documentation.

    4. Select the Advanced settings tab.
    5. Select the Use specific JVM arguments check box.
    6. Add a new JVM argument that indicates the version of Hadoop.
      The new JVM argument is the string you added as Value in the Advanced settings: "-Dhdp.version=2.4.0.0-169".
    7. Select the Basic Run tab, then click Run.
      When it is complete, you are prompted by a message indicating success in the console.
    8. Navigate to the HDFS directory, Ambari in this example, to verify that the model was created and persisted to HDFS.
      Model created in the HDFS directory.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!