A Framework to Think About Forecast Accuracy
Since one of the main values of forecasting is to assist better planning and decision-making, that could understandably lead one to think “We want 100% accuracy, and if that’s impossible, then at least 99%!”. This “magical number” threshold, while a good thing to aspire too, can hold you back from accepting immediate value on the current iteration of your forecasting model.
A better framework to think about forecast accuracy is being “less wrong” than the alternatives. If the machine learning model is closer to the truth than the human forecast or base model alternative, (regardless of if it is 80% accurate or 99% accurate) then you can act on that information to make “less wrong” decisions. The value of these better decisions adds up over time, regardless of if they are perfect or not.
It turns out that some of the most valuable things to forecast, are also the hardest to forecast because of inherent volatility and complexity. The math just needs to be done if the return on forecasting investment is justified.
The Different Kinds of Accuracy
Common but problematic forecast performance metrics
% Accuracy
If you are forecasting one item, and it carries a lot of importance this metric may be ok. But if you are forecasting multiple items and want to know how you’re doing across all forecasts in general, % accuracy can be problematic. If the distribution of value of the items you’re forecasting is skewed (many small value and a few large value items) then giving the average % accuracy can be skewed by the many small value things.
% Error
For similar reasons to the above argument, % error is not a great choice if you are forecasting multiple items.
A less wrong performance metric
Weighted Mean Absolute Percent Error (WMAPE)
While this is a lot of words, it’s a simple concept. It’s measuring error, so a lower number is better, and it is considering performance over multiple items, but weights them according to their value. Let’s imagine your forecasting 4 products, 3 of them are small and do $1 a month and the fourth does $100 a month. Lets say you have 50% accuracy (50% error) on the 3 small items and 90% accuracy (10% error) on the $100 item. Your WMAPE considers the higher value of the $100 item and your final WMAPE is much closer to 10% than it is to 50%.
If you were using simple Mean Percent Error, you would be much closer to 50% (error) as the final representation of performance. Being better on the big items has the chance to save and make more money than the small items, so they should be given more importance in evaluation.
Should expected “accuracy” be one number, or a range you expect to fall in?
If you knew a forecast had a 5% WMAPE, you might confidently make a bold decision based on that forecast. What if you had the additional information that, while the WMAPE was 5% on average, it had a range between 2% and 20%? That might be important information to temper a bold decision by understanding the variance of forecast performance. Conversely, if the range of WMAPE was 2% to 8%, you might still feel strongly about a bold decision and make it, but now with more assurance.
How to get a fair estimate of expected “accuracy” so you can meter your decision making
The gold standard for estimating model performance is a time series cross validation, as seen below. You train the model on older data and forecast into the future. You save these multiple back-tests to calculate the WMAPE. The key here is that in any given iteration of the validation, the model should not get to train on data that it will be tested on. By preserving the mystery of the future data, you more accurately simulate how a model will perform in the future. The “test” set should mirror the distance you want to forecast into the future in production.
