Machine Learning 101 - Decision Trees
Overview
This hands on tutorial demonstrates the basics of developing a machine learning routine using Talend and Spark. Specifically, decision tree learning will be leveraged for classification of real-life bank marketing data. Upon completion, you will have a working knowledge of how machine learning is integrated into a Talend workflow and some re-usable code snippets.
The source data used in this tutorial was retrieved from the UCI Machine Learning Repository. Irvine, CA: University of California, Schools of Information and Computer Science. It is available in the public domain and is attributed to: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. "A Data-Driven Approach to Predict the Success of Bank Telemarketing." Decision Support Systems, Elsevier, 62:22-31, June 2014: Bank Marketing dataset.
Prerequisites
- Hortonworks 2.4 (HDP) installed and configured. You can also use Hortonworks sandbox, a virtual machine (VM) that you can download. For more information, see Create HDFS Metadata - Hortonworks.
- Basic knowledge of:
- Hadoop ecosystem's tools and technologies.
- Hadoop Distributed File System (HDFS) and Spark.
- Working knowledge of Talend Studio and Talend Big Data Platform.
- Talend Big Data Platform installed and configured.