Forecasting With Exogenous Regressors
Consider a multivariate time series
Exogenous regressors
For example, one can consider the task of predicting the sales of a specific item at a store. Endogenous variables
To be more concrete, let’s show this with some real data.
[1]:
# This is the same dataset used in the custom dataset tutorial
import os
from ts_datasets.forecast import CustomDataset
csv = os.path.join("..", "..", "data", "walmart", "walmart_mini.csv")
dataset = CustomDataset(rootdir=csv, index_cols=["Store", "Dept"], test_frac=0.10)
ts, md = dataset[0]
display(ts)
Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday | |
---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||
2010-02-05 | 24924.50 | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | False |
2010-02-12 | 46039.49 | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | True |
2010-02-19 | 41595.55 | 39.93 | 2.514 | NaN | NaN | NaN | NaN | NaN | 211.289143 | 8.106 | False |
2010-02-26 | 19403.54 | 46.63 | 2.561 | NaN | NaN | NaN | NaN | NaN | 211.319643 | 8.106 | False |
2010-03-05 | 21827.90 | 46.50 | 2.625 | NaN | NaN | NaN | NaN | NaN | 211.350143 | 8.106 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2012-09-28 | 18947.81 | 76.08 | 3.666 | 3666.27 | 7.64 | 1.65 | 1417.96 | 4744.28 | 222.981658 | 6.908 | False |
2012-10-05 | 21904.47 | 68.55 | 3.617 | 8077.89 | NaN | 18.22 | 3617.43 | 3626.14 | 223.181477 | 6.573 | False |
2012-10-12 | 22764.01 | 62.99 | 3.601 | 2086.18 | NaN | 8.11 | 602.36 | 5926.45 | 223.381296 | 6.573 | False |
2012-10-19 | 24185.27 | 67.97 | 3.594 | 950.33 | NaN | 4.93 | 80.25 | 2312.85 | 223.425723 | 6.573 | False |
2012-10-26 | 27390.81 | 69.16 | 3.506 | 2585.85 | 31.75 | 6.00 | 1057.16 | 1305.01 | 223.444251 | 6.573 | False |
143 rows × 11 columns
[2]:
from merlion.utils import TimeSeries
# Get the endogenous variables X and split them into train & test
endog = ts[["Weekly_Sales", "Temperature", "CPI", "Unemployment"]]
train = TimeSeries.from_pd(endog[md.trainval])
test = TimeSeries.from_pd(endog[~md.trainval])
# Get the exogenous variables Y
exog = TimeSeries.from_pd(ts[["IsHoliday", "MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5"]])
Here, our task is to predict the weekly sales. We would like our model to also account for variables which may have an impact on consumer demand (i.e. temperature, consumer price index, and unemployment), as knowledge of these variables could improve the quality of our sales forecast. This would be a multivariate forecasting problem, covered here.
In principle, we could add markdowns and holidays to the multivariate model. However, as a retailer, we know a priori which days are holidays, and we ourselves control the markdowns. In many cases, we can get better forecasts by providing the future values of these variables in addition to the past values. Moreover, we may wish to model how changing the future markdowns would change the future sales. This is why we should model these variables as exogenous regressors instead.
All Merlion forecasters support an API which accepts exogenous regressors at both training and inference time, though only some models actually support the feature. Using the feature is as easy as specifying an optional argument exog_data
to both train()
and forecast()
. We show how to use the feature for the popular Prophet
model below, and demonstrate that adding exogenous regressors can improve the quality of the forecast.
[3]:
from merlion.evaluate.forecast import ForecastMetric
from merlion.models.forecast.prophet import Prophet, ProphetConfig
# Train a model without exogenous data
model = Prophet(ProphetConfig(target_seq_index=0))
model.train(train)
pred, err = model.forecast(test.time_stamps)
smape = ForecastMetric.sMAPE.value(test, pred, target_seq_index=model.target_seq_index)
print(f"sMAPE (w/o exog) = {smape:.2f}")
# Train a model with exogenous data
exog_model = Prophet(ProphetConfig(target_seq_index=0))
exog_model.train(train, exog_data=exog)
exog_pred, exog_err = exog_model.forecast(test.time_stamps, exog_data=exog)
exog_smape = ForecastMetric.sMAPE.value(test, exog_pred, target_seq_index=exog_model.target_seq_index)
print(f"sMAPE (w/ exog) = {exog_smape:.2f}")
12:31:47 - cmdstanpy - INFO - Chain [1] start processing
12:31:47 - cmdstanpy - INFO - Chain [1] done processing
12:31:48 - cmdstanpy - INFO - Chain [1] start processing
12:31:48 - cmdstanpy - INFO - Chain [1] done processing
sMAPE (w/o exog) = 8.21
sMAPE (w/ exog) = 7.67
Before we wrap up this tutorial, we note that the exogenous variables contain a lot of missing data:
[4]:
display(exog.to_pd())
IsHoliday | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | |
---|---|---|---|---|---|---|
time | ||||||
2010-02-05 | 0.0 | NaN | NaN | NaN | NaN | NaN |
2010-02-12 | 1.0 | NaN | NaN | NaN | NaN | NaN |
2010-02-19 | 0.0 | NaN | NaN | NaN | NaN | NaN |
2010-02-26 | 0.0 | NaN | NaN | NaN | NaN | NaN |
2010-03-05 | 0.0 | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... |
2012-09-28 | 0.0 | 3666.27 | 7.64 | 1.65 | 1417.96 | 4744.28 |
2012-10-05 | 0.0 | 8077.89 | NaN | 18.22 | 3617.43 | 3626.14 |
2012-10-12 | 0.0 | 2086.18 | NaN | 8.11 | 602.36 | 5926.45 |
2012-10-19 | 0.0 | 950.33 | NaN | 4.93 | 80.25 | 2312.85 |
2012-10-26 | 0.0 | 2585.85 | 31.75 | 6.00 | 1057.16 | 1305.01 |
143 rows × 6 columns
Behind the scenes, Merlion models will apply an optional exog_transform
to the exogenous variables, and they will then resample the exogenous variables to the same timestamps as the endogenous variables. This resampling is achieved using the exog_missing_value_policy
and exog_aggregation_policy
, which can be specified in the config of any model which accepts exogenous regressors. We can see the default values for each of these parameters by inspecting the config:
[5]:
print(f"Default exog_transform: {exog_model.config.exog_transform}")
print(f"Default exog_missing_value_policy: {exog_model.config.exog_missing_value_policy}")
print(f"Default exog_aggregation_policy: {exog_model.config.exog_aggregation_policy}")
Default exog_transform: Identity()
Default exog_missing_value_policy: MissingValuePolicy.ZFill
Default exog_aggregation_policy: AggregationPolicy.Mean
So in this case, we first apply the identity transform to the exogenous data. Then, we impute missing values by filling them with zeros (ZFill
), and we downsample the exogenous data by taking the Mean
of any relevant windows.