Site icon James Young

Forecasting: Exploratory Data Analysis

Forecasting EDA

James Young

Summary

This notebook demonstrates an example exploratory data analaysis that would be conducted at the start of a forecasting project, that is, after the data has been gathered, cleaned, and verified to be accurate.

First, we want to learn what it is we are forecasting. We want to learn the dimensions of the data (unique time series, observations, frequency of observations, and structural relationships). We also want to look at the distribution of data at these different levels. Since forecasting is a time-series task, we also want to visualize the sales of these series over time. While working through this exploration we want to look for trends, patterns, and things that seem out of place and might need further questioning.

The findings of this EDA and next steps in a forecasting project are discussed at the end.

Requirements

These are the packages used in this EDA example.

library(tidyverse)
library(ggplot2)
library(patchwork)

The Data

This data represents the sales of items at Walmart over different states, stores, categories, and departments. Below we start to gain understanding of the structure of this data, with dimensions of 30,490 rows and 1,947 columns. We will continue to explore what these rows and columns represent below.

## [1] 30490  1947

When glimpsing the first 10 columns we see the data is composed of integers and factors. The factors are what make the hierarchical levels and groupings within those levels, and the integers show the start of the daily sales for those factors. The id factors have their counts of unique instances shown in the table below.

## Rows: 30,490
## Columns: 10
## $ id       <fct> HOBBIES_1_001_CA_1_evaluation, HOBBIES_1_002_CA_1_evaluati...
## $ item_id  <fct> HOBBIES_1_001, HOBBIES_1_002, HOBBIES_1_003, HOBBIES_1_004...
## $ dept_id  <fct> HOBBIES_1, HOBBIES_1, HOBBIES_1, HOBBIES_1, HOBBIES_1, HOB...
## $ cat_id   <fct> HOBBIES, HOBBIES, HOBBIES, HOBBIES, HOBBIES, HOBBIES, HOBB...
## $ store_id <fct> CA_1, CA_1, CA_1, CA_1, CA_1, CA_1, CA_1, CA_1, CA_1, CA_1...
## $ state_id <fct> CA, CA, CA, CA, CA, CA, CA, CA, CA, CA, CA, CA, CA, CA, CA...
## $ d_1      <int> 0, 0, 0, 0, 0, 0, 0, 12, 2, 0, 0, 0, 0, 0, 4, 5, 0, 0, 0, ...
## $ d_2      <int> 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, ...
## $ d_3      <int> 0, 0, 0, 0, 0, 0, 0, 0, 7, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0...
## $ d_4      <int> 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0...
id’sitem_id’sdept_id’scat_id’sstore_id’sstate_id’s
30490304973103

Looking at the last few columns we see that they are more intger type columns. These integer columns each represent a day. The values in these integer columns likely represent units sold rather than dollar value, but it would be useful to confirm with the data owner.

## Rows: 30,490
## Columns: 8
## $ d_1934 <int> 0, 2, 2, 0, 0, 0, 0, 6, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 17, 0,...
## $ d_1935 <int> 0, 1, 0, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 0, ...
## $ d_1936 <int> 0, 1, 0, 0, 1, 1, 0, 15, 0, 2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0,...
## $ d_1937 <int> 0, 0, 0, 1, 0, 0, 1, 5, 0, 1, 0, 1, 0, 2, 0, 5, 0, 0, 7, 0, ...
## $ d_1938 <int> 3, 0, 2, 3, 0, 0, 0, 4, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 5, 0, ...
## $ d_1939 <int> 3, 0, 3, 0, 2, 5, 1, 1, 0, 0, 0, 0, 0, 1, 4, 7, 1, 1, 5, 0, ...
## $ d_1940 <int> 0, 0, 0, 2, 1, 2, 1, 40, 1, 0, 0, 1, 1, 1, 5, 12, 2, 1, 14, ...
## $ d_1941 <int> 1, 0, 1, 6, 0, 0, 0, 32, 0, 1, 0, 0, 1, 3, 4, 6, 0, 0, 11, 0...

A quick check for missing values reveals there are not any observations with missing data. That makes this a little easier.

NAs <- data %>% filter(is.na(.))
dim(NAs)
## [1]    0 1947

Checking for zero, the first day alone has over 23,000 of the items have zeros out of the 30,490 total items.

zeros <- data %>% filter(d_1 == 0)
dim(zeros)
## [1] 23511  1947

The function below will allow rolling-up to the desired level to view the distribution of sales between between different groups.

hist_plot <- function(level) {
  p1 <- data %>% 
        group_by(get(level)) %>% 
        summarise(across(where(is.numeric), sum, na.rm = T)) %>% 
        mutate(mean_units = rowMeans(.[,-1]),
               log_mean_units = log(mean_units)) %>%
        ungroup() %>%
        select(mean_units) %>% 
        ggplot(aes(x=mean_units)) + 
        geom_histogram(color="black", fill="white") +
        labs(title = level)
p2 <- data %>% 
        group_by(get(level)) %>% 
        summarise(across(where(is.numeric), sum, na.rm = T)) %>% 
        mutate(mean_units = rowMeans(.[,-1]),
               log_mean_units = log(mean_units)) %>%
        ungroup() %>% 
        select(log_mean_units) %>% 
        ggplot(aes(x=log_mean_units)) + 
        geom_histogram(color="black", fill="white") +
        labs(title = level)
p1 + p2
}

Here I use my function to create histograms of sales performance at the desired levels. The mean_units histogram represents the average units sold per day for a given group. The log_mean_units histogram is useful, especially at lower levels, for normalizing the distribution and making it easier to view at that level. At higher levels such as store_id, dept_id, and cat_id, there is not as extreme of a spread as there is at the item_id level. These item_id observations are more granular

hist_plot("store_id")
hist_plot("dept_id")
hist_plot("cat_id")

Looking at Sales Over Time

Since we are trying to forecast future sales, another useful way to look at this data is plotting sales of different groups over time. First, I will transition the data into long format, making it easier to work with. Second, I will create a function that allows us to compress the the observation frequency from daily to any desired interval greater than a day. This can allow smoothing of the noise to see more of the signal.

long <- data %>% 
        group_by(state_id, store_id, cat_id, dept_id) %>% 
        summarise(across(where(is.numeric), sum, na.rm = T)) %>% 
        pivot_longer(cols = !ends_with("id"), names_to = "day", values_to = "sales") %>% 
        mutate(day = as.numeric(gsub("d_", "", day)))
time_series <- function(group1,  x, var) {
long %>%
group_by(get(group1), state_id, gr=cut(day, breaks= seq(1, 1941, by = x), labels = F) ) %>%
summarise(sales= sum(sales)) %>%
arrange(as.numeric(gr)) %>%
filter(!is.na(gr)) %>%
ungroup() %>%
mutate( group1 = get(group1)) %>%
ggplot(aes(x=gr, y=sales, group=group1, colour= group1 )) +
geom_line() +
facet_grid(state_id~.) +
labs(x = var) +
scale_colour_discrete(group1)
}

State Totals

time_series('state_id', 365, "annual")
time_series('state_id', 180, "bi-annual")
time_series('state_id', 30, "monthly")
time_series('state_id', 7, "weekly")
time_series('state_id', 1, "daily")

Department Totals By State

time_series('dept_id', 365, "annual")
time_series('dept_id', 180, "bi-annual")
time_series('dept_id', 30, "monthly")
time_series('dept_id', 7, "weekly")
time_series('dept_id', 1, "daily")

Category Totals By State

time_series('cat_id', 365, "annual")
time_series('cat_id', 180, "bi-annual")
time_series('cat_id', 30, "monthly")
time_series('cat_id', 7, "weekly")
time_series('cat_id', 1, "daily")

Getting Down to the More Granular Levels

long2 <- data %>% 
        group_by(item_id) %>% 
        summarise(across(where(is.numeric), sum, na.rm = T)) %>% 
        pivot_longer(cols = !ends_with("id"), names_to = "day", values_to = "sales") %>% 
        mutate(day = as.numeric(gsub("d_", "", day)))
nrow(long2)
## [1] 5918109
long3 <- semi_join(long2, long2 %>% group_by(item_id) %>% 
  summarise(tot = sum(sales)) %>% 
  mutate(rank = rank(tot)) %>% 
  filter(rank < 5) )
time_series2 <- function(group1,   x, var) {
long3 %>% 
group_by(get(group1),  gr=cut(day, breaks= seq(1, 1941, by = x), labels = F) ) %>% 
     summarise(sales= sum(sales)) %>%
     arrange(as.numeric(gr)) %>% 
         filter(!is.na(gr)) %>%
    ungroup() %>% 
    mutate(item_id = `get(group1)`) %>% 
    ggplot(aes(x=gr, y=sales, group=item_id, colour= item_id )) +
    geom_line() +
    labs(x = var) 

}
time_series2('item_id', 7, "weekly")
time_series2('item_id', 30, "monthly")
time_series2('item_id', 180 , "bi-annualy")

Conclusions

Exploratory data analysis is where I like to start in data science projects. Here are the findings from this EDA.

Data Structure

Histogram Plot Insights

Time Series Plot Insights

Hypothetical Next Steps

Exit mobile version