Forecasting: Exploratory Data Analysis

jamesyoung3

4 years ago

Forecasting EDA

James Young

Summary

This notebook demonstrates an example exploratory data analaysis that would be conducted at the start of a forecasting project, that is, after the data has been gathered, cleaned, and verified to be accurate.

First, we want to learn what it is we are forecasting. We want to learn the dimensions of the data (unique time series, observations, frequency of observations, and structural relationships). We also want to look at the distribution of data at these different levels. Since forecasting is a time-series task, we also want to visualize the sales of these series over time. While working through this exploration we want to look for trends, patterns, and things that seem out of place and might need further questioning.

The findings of this EDA and next steps in a forecasting project are discussed at the end.

Requirements

These are the packages used in this EDA example.

library(tidyverse)
library(ggplot2)
library(patchwork)

The Data

This data represents the sales of items at Walmart over different states, stores, categories, and departments. Below we start to gain understanding of the structure of this data, with dimensions of 30,490 rows and 1,947 columns. We will continue to explore what these rows and columns represent below.

## [1] 30490  1947

When glimpsing the first 10 columns we see the data is composed of integers and factors. The factors are what make the hierarchical levels and groupings within those levels, and the integers show the start of the daily sales for those factors. The id factors have their counts of unique instances shown in the table below.

## Rows: 30,490
## Columns: 10
## $ id       <fct> HOBBIES_1_001_CA_1_evaluation, HOBBIES_1_002_CA_1_evaluati...
## $ item_id  <fct> HOBBIES_1_001, HOBBIES_1_002, HOBBIES_1_003, HOBBIES_1_004...
## $ dept_id  <fct> HOBBIES_1, HOBBIES_1, HOBBIES_1, HOBBIES_1, HOBBIES_1, HOB...
## $ cat_id   <fct> HOBBIES, HOBBIES, HOBBIES, HOBBIES, HOBBIES, HOBBIES, HOBB...
## $ store_id <fct> CA_1, CA_1, CA_1, CA_1, CA_1, CA_1, CA_1, CA_1, CA_1, CA_1...
## $ state_id <fct> CA, CA, CA, CA, CA, CA, CA, CA, CA, CA, CA, CA, CA, CA, CA...
## $ d_1      <int> 0, 0, 0, 0, 0, 0, 0, 12, 2, 0, 0, 0, 0, 0, 4, 5, 0, 0, 0, ...
## $ d_2      <int> 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, ...
## $ d_3      <int> 0, 0, 0, 0, 0, 0, 0, 0, 7, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0...
## $ d_4      <int> 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0...

id’s	item_id’s	dept_id’s	cat_id’s	store_id’s	state_id’s
30490	3049	7	3	10	3

Looking at the last few columns we see that they are more intger type columns. These integer columns each represent a day. The values in these integer columns likely represent units sold rather than dollar value, but it would be useful to confirm with the data owner.

## Rows: 30,490
## Columns: 8
## $ d_1934 <int> 0, 2, 2, 0, 0, 0, 0, 6, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 17, 0,...
## $ d_1935 <int> 0, 1, 0, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 0, ...
## $ d_1936 <int> 0, 1, 0, 0, 1, 1, 0, 15, 0, 2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0,...
## $ d_1937 <int> 0, 0, 0, 1, 0, 0, 1, 5, 0, 1, 0, 1, 0, 2, 0, 5, 0, 0, 7, 0, ...
## $ d_1938 <int> 3, 0, 2, 3, 0, 0, 0, 4, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 5, 0, ...
## $ d_1939 <int> 3, 0, 3, 0, 2, 5, 1, 1, 0, 0, 0, 0, 0, 1, 4, 7, 1, 1, 5, 0, ...
## $ d_1940 <int> 0, 0, 0, 2, 1, 2, 1, 40, 1, 0, 0, 1, 1, 1, 5, 12, 2, 1, 14, ...
## $ d_1941 <int> 1, 0, 1, 6, 0, 0, 0, 32, 0, 1, 0, 0, 1, 3, 4, 6, 0, 0, 11, 0...

A quick check for missing values reveals there are not any observations with missing data. That makes this a little easier.

NAs <- data %>% filter(is.na(.))
dim(NAs)

## [1]    0 1947

Checking for zero, the first day alone has over 23,000 of the items have zeros out of the 30,490 total items.

zeros <- data %>% filter(d_1 == 0)
dim(zeros)

## [1] 23511  1947

The function below will allow rolling-up to the desired level to view the distribution of sales between between different groups.

hist_plot <- function(level) {
  p1 <- data %>% 
        group_by(get(level)) %>% 
        summarise(across(where(is.numeric), sum, na.rm = T)) %>% 
        mutate(mean_units = rowMeans(.[,-1]),
               log_mean_units = log(mean_units)) %>%
        ungroup() %>%
        select(mean_units) %>% 
        ggplot(aes(x=mean_units)) + 
        geom_histogram(color="black", fill="white") +
        labs(title = level)
p2 <- data %>% 
        group_by(get(level)) %>% 
        summarise(across(where(is.numeric), sum, na.rm = T)) %>% 
        mutate(mean_units = rowMeans(.[,-1]),
               log_mean_units = log(mean_units)) %>%
        ungroup() %>% 
        select(log_mean_units) %>% 
        ggplot(aes(x=log_mean_units)) + 
        geom_histogram(color="black", fill="white") +
        labs(title = level)
p1 + p2
}

Here I use my function to create histograms of sales performance at the desired levels. The mean_units histogram represents the average units sold per day for a given group. The log_mean_units histogram is useful, especially at lower levels, for normalizing the distribution and making it easier to view at that level. At higher levels such as store_id, dept_id, and cat_id, there is not as extreme of a spread as there is at the item_id level. These item_id observations are more granular

hist_plot("store_id")

hist_plot("dept_id")

hist_plot("cat_id")

Looking at Sales Over Time

Since we are trying to forecast future sales, another useful way to look at this data is plotting sales of different groups over time. First, I will transition the data into long format, making it easier to work with. Second, I will create a function that allows us to compress the the observation frequency from daily to any desired interval greater than a day. This can allow smoothing of the noise to see more of the signal.

long <- data %>% 
        group_by(state_id, store_id, cat_id, dept_id) %>% 
        summarise(across(where(is.numeric), sum, na.rm = T)) %>% 
        pivot_longer(cols = !ends_with("id"), names_to = "day", values_to = "sales") %>% 
        mutate(day = as.numeric(gsub("d_", "", day)))
time_series <- function(group1,  x, var) {
long %>%
group_by(get(group1), state_id, gr=cut(day, breaks= seq(1, 1941, by = x), labels = F) ) %>%
summarise(sales= sum(sales)) %>%
arrange(as.numeric(gr)) %>%
filter(!is.na(gr)) %>%
ungroup() %>%
mutate( group1 = get(group1)) %>%
ggplot(aes(x=gr, y=sales, group=group1, colour= group1 )) +
geom_line() +
facet_grid(state_id~.) +
labs(x = var) +
scale_colour_discrete(group1)
}

State Totals

time_series('state_id', 365, "annual")

time_series('state_id', 180, "bi-annual")

time_series('state_id', 30, "monthly")

time_series('state_id', 7, "weekly")

time_series('state_id', 1, "daily")

Department Totals By State

time_series('dept_id', 365, "annual")

time_series('dept_id', 180, "bi-annual")

time_series('dept_id', 30, "monthly")

time_series('dept_id', 7, "weekly")

time_series('dept_id', 1, "daily")

Category Totals By State

time_series('cat_id', 365, "annual")

time_series('cat_id', 180, "bi-annual")

time_series('cat_id', 30, "monthly")

time_series('cat_id', 7, "weekly")

time_series('cat_id', 1, "daily")

Getting Down to the More Granular Levels

long2 <- data %>% 
        group_by(item_id) %>% 
        summarise(across(where(is.numeric), sum, na.rm = T)) %>% 
        pivot_longer(cols = !ends_with("id"), names_to = "day", values_to = "sales") %>% 
        mutate(day = as.numeric(gsub("d_", "", day)))
nrow(long2)

## [1] 5918109

long3 <- semi_join(long2, long2 %>% group_by(item_id) %>% 
  summarise(tot = sum(sales)) %>% 
  mutate(rank = rank(tot)) %>% 
  filter(rank < 5) )
time_series2 <- function(group1,   x, var) {
long3 %>% 
group_by(get(group1),  gr=cut(day, breaks= seq(1, 1941, by = x), labels = F) ) %>% 
     summarise(sales= sum(sales)) %>%
     arrange(as.numeric(gr)) %>% 
         filter(!is.na(gr)) %>%
    ungroup() %>% 
    mutate(item_id = `get(group1)`) %>% 
    ggplot(aes(x=gr, y=sales, group=item_id, colour= item_id )) +
    geom_line() +
    labs(x = var) 

}
time_series2('item_id', 7, "weekly")

time_series2('item_id', 30, "monthly")

time_series2('item_id', 180 , "bi-annualy")

Conclusions

Exploratory data analysis is where I like to start in data science projects. Here are the findings from this EDA.

Data Structure

There is a grouping or hierarchical structure to this sales data, depending on how the end-user wants it organized. Unique time series at each level were discussed in the exploration above.
The label we are forecasting seem to be in units sold rather than dollar value, although it would be good to confirm with the data owner.
The measurement rate is daily data, we can compress this data (as we did in some plots) to longer intervals, such as weekly or monthly, but additional challenges and issues would prevent easily interpolating this data to forecast at shorter intervals such as hourly.

Histogram Plot Insights

When plotting histograms of the mean daily units sold, I found that at the more granular levels (item_id) there are a few products that have much higher sales in units than the bulk of items. This is common to have a few high performers and proportionally more low performers. This doesn’t necessarily imply a difference in how well these different performers can be forecast.
The difference in sales between groups decreases as we roll-up to higher levels.

Time Series Plot Insights

By looking at different levels and different time aggregations I was able to visualize trends and seasonality over different time-spans.
We can also notice that in the daily data there seems to be a drop to 0 units sold at reccurent intervals.

Hypothetical Next Steps

Set Expectations and Consider Possibilities
- It is difficult to know ahead of time how accurate a forecast can be, a base model must be built first to find a starting point. Continuing work can be done after the base model is built to further improve upon the starting point.
- Lay out the possibilities… eg. “This is a grouped/hierarchical structure. Based on the observation granularity, there are”x” levels and “x” groups at each level that can be forecasted. We can forecast at these different time horizons (1 month ahead, 1 quarter ahead, etc.). Are any of these possibilities useful?”
- If this is useful then move on to the following steps.
Ask the Questions Developed in the EDA
- Why does data go to zero approximately once every year?
- Do you have spread sheets tracking product lifecycle transitions?
- Do you have additional data such as promotional events that can be linked to their respective items at the appropriate times?
- What does zero mean? Is that missing data or a true zero? If it is missing, what are your thoughts on dealing with it? Do you want to experiment and see what works emperically best on validation data?
Determine the Goals
- Working with the end-user(s), determine…
- What levels are important to forecast?
- Is accuracy at some levels more important than other levels?
- What metric(s) do you want to use to define accuracy? MAPE, MPE, MAE, ME?
- What is/are the desired forecasting period(s)?
- How frequently are forecast updates desired?
- How should the results be delivered (spreadsheet, dashboard, other)?
Creating a base model
- The first step is to get a functioning base model with acceptable accuracy.
- Common first models include models such as ets, hw, ses, arima, arfima, bats, and tbats.
- These models should be tested against each other and also against the control models such as random walk, mean, naive, and/or seasonal naive model to validate that the more sophisticated models consistently deliver better performance than the control models.
- Select the best model between these options to use as a base model in production.
Deliver The Results
- Meeting the agreed upon parameters (time horizon, update frequency, other), deliver the forecasting results.
Continuous Improvement
- After a functioning base model is in production, constructing more accurate models is the next step. This is where creativity (but still with rigorous validation) comes in.
- One approach that frequently improves accuracy is ensembling of models.
- There can also be post-forecast adjustments from the base model outputs such as applying xgboost to the forecasts by taking advantage of structural relationships with other forecasts.
- Filters and flags can be applied to deal with product life cycle changes.
- Newer forecasting models can be tried such as prophet, nnetar, or lstm amongst others.
- External (how is GDP trending?) and internal (item promotion) data can be used as regressors in models like arima and nnetar.
- Building systems that adaptively update based on trailing performance indicators.