Sampling of Machine Learning Repository data
This section details a sample of the data used in this tutorial.
This tutorial is not intended to teach data science or to detail a formal data analysis, but it is helpful to see a sample of the data.
For more information about this dataset, see UCI Machine Learning Repository.
The datasets contains ten variables, nine independent and one dependent:
- Independent: age, jobtype, maritalstatus, educationlevel, indefault, hasmortgage, haspersonalloan, numcampaigncalls ,priorcampaignoutcome
- Dependent: conversion
The independent variables, also known as feature variables, are used to predict an outcome. The dependent variable, or target variable, is what you want to predict. The sampling of data above demonstrates tuples that contain features and a target variable, both of which are needed to train your decision tree model. This type of training is called supervised learning, because the data contains both an output vector of features and a known output value.
The following steps use the training data to build a decision tree model using Spark's Machine Learning Library (MLlib). In simple terms, the goal is to determine how well the features can predict the target variable conversion using the training data, which comprises 1546 data points.
You also need to understand the overall shape and distribution of the data to ensure downstream assumptions are as accurate as possible. The following are summary statistics for the training dataset used in this article.
The levels (yes, no, failure, etc.) are reported for each categorical variable. For numerical data, the quartiles are reported. The target variable conversion has two levels, yes and no, and you can see that no appears a lot more often than yes. This imbalance presents some challenges when building a classifier model like the decision tree you are building. However, these challenges and associated mitigations are out of scope for this tutorial and will not be discussed. For more information, see Decision tree accuracy: effect of unbalanced data.
What needs to be mentioned is that the model you build will predict (conversion = no) as either being true or false. The interpretation of (conversion = no) as false in the context of the model is that (conversion = yes) is true.