In order to evaluate your machine learning models, you need to be able to make sense of the model scores and metrics. In some cases, understanding how each field and value influences the predicted outcome—why something happens—might be more important than making predictions.
Why model scoring is important
The purpose of the different model scores is to understand the strengths of the model. This will increase your confidence in the usability of the model and show what improvements can be made. If scoring is very high or very low, it could indicate that there is an issue with the data being fed to the model.
Scoring a model is a challenging task because there are several metrics that describe different things about the model. To know if it is a good model, you need to combine business domain knowledge with an understanding of the various scoring metrics and data that the model was trained with. What could look like a terrible score in one use case, might be a great score and generate a high return on investment in another use case.
The most important metric: A car analogy
Which metric is most important? That depends on how you plan to use the model. There is not a single metric that can tell you everything you want to know.
As an analogy, think about buying a car. There are a lot of different metrics to consider such as fuel efficiency, horsepower, torque, weight, and acceleration. We might want them all be great, but we must make trade-offs depending on how we plan to use the car. A commuter might want a car with high fuel efficiency even if it means low torque, while a boat owner might choose high torque even if it means lower fuel efficiency.
A model can be thought of the same way. We want all of the metrics to be high—and we might be able to improve them with more data and better features—but there are always constraints and trade-offs to be made. Some scores matter more depending on what you intend to do with the model.
Is the model a good fit?
Determining if a model is a good fit for the use case and good to be put into production, ultimately boils down to the question: "Is the model accurate enough to make a positive return on investment without unacceptable consequences?" The following four questions can help you to break it down.
Is the model informing a human decision or automating it?
The required accuracy depends on whether you will use the model to automate or inform decisions. For example, a model can be trained to determine how much money employees should make. In this case, accuracy will probably need to be higher if the model is automating the decision compared to if it's only informing a decision. If managers use it to discover whether an employee is underpaid or overpaid, they can then use their own discretion to determine if the model is in error or not.
Is there a quantifiable cost to a false positive or a false negative?
Are you able to quantify the cost of a false outcome? Take that cost into account when you determine the level of accuracy required to consider the model a good fit.
Using the same example as above, say that the model is simply informing. However, the manager trusts the model and doesn’t give an employee a pay raise because the model outputs that the employee would be overpaid if a raise was given. The employee then resigns to work elsewhere. What was the cost of losing that employee? If the reverse happened, what would the cost have been of falsely giving a raise?
How much better is the model than random?
For regression problems, determine what the error would be if you always assumed the average value of the target column. How much better is the model compared to that?
For classification problems, take the rate of the positive class squared and add it to the rate of the negative class squared to get random accuracy. How much better is the model accuracy than that?
Is the model better than making an ultimatum?
Depending on if there is a cost associated with errors, consider whether the model is better than an ultimatum. For example, say that a firm is doing free consultations that are expensive and time consuming ($6,000) but makes good money when a deal closes ($60,000). The firm currently operates under the assumption that 100 percent of consultations will close. However, they would make better profit if they could determine which consultations they should do and which they shouldn’t do. What does the model accuracy need to be in order for the firm to use the model output instead of the ultimatum that 100 percent of the deals will close?