Forecasting With Exogenous Regressors

Consider a multivariate time series \(X^{(1)}, \ldots, X^{(t)}\), where each \(X^{(i)} \in \mathbb{R}^d\) is a d-dimensional vector. In multivariate forecasting, our goal is to predict the future values of the k’th univariate \(X_k^{(t+1)}, \ldots, X_k^{(t+h)}\).

Exogenous regressors \(Y^{(i)}\) are a set of additional variables whose values we know a priori. The task of forecasting with exogenous regressors is to predict our target univariate \(X_k^{(t+1)}, \ldots, X_k^{(t+h)}\), conditioned on - The past values of the time series \(X^{(1)}, \ldots, X^{(t)}\) - The past values of the exogenous regressors \(Y^{(1)}, \ldots, Y^{(t)}\) - The future values of the exogenous regressors \(Y^{(t+1)}, \ldots, Y^{(t+h)}\)

For example, one can consider the task of predicting the sales of a specific item at a store. Endogenous variables \(X^{(i)} \in \mathbb{R}^4\) may contain the number of units sold (the target univariate), the temperature outside, the consumer price index, and the current unemployemnt rate. Exogenous variables \(Y^{(i)} \in \mathbb{R}^6\) are variables that the store has control over or prior knowledge of. They may include whether a particular day is a holiday, and various information about the sort of markdowns the store is running.

To be more concrete, let’s show this with some real data.

[1]:

# This is the same dataset used in the custom dataset tutorial
import os
from ts_datasets.forecast import CustomDataset
csv = os.path.join("..", "..", "data", "walmart", "walmart_mini.csv")
dataset = CustomDataset(rootdir=csv, index_cols=["Store", "Dept"], test_frac=0.10)
ts, md = dataset[-1]
display(ts)

	Weekly_Sales	Temperature	Fuel_Price	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5	CPI	Unemployment	IsHoliday
Date
2010-02-05	39602.47	40.19	2.572	NaN	NaN	NaN	NaN	NaN	210.752605	8.324	False
2010-02-12	37984.44	38.49	2.548	NaN	NaN	NaN	NaN	NaN	210.897994	8.324	True
2010-02-19	38889.43	39.69	2.514	NaN	NaN	NaN	NaN	NaN	210.945160	8.324	False
2010-02-26	41137.74	46.10	2.561	NaN	NaN	NaN	NaN	NaN	210.975957	8.324	False
2010-03-05	39883.50	47.17	2.625	NaN	NaN	NaN	NaN	NaN	211.006754	8.324	False
...	...	...	...	...	...	...	...	...	...	...	...
2012-09-28	37104.67	79.45	3.666	7106.05	1.91	1.65	1549.10	3946.03	222.616433	6.565	False
2012-10-05	36361.28	70.27	3.617	6037.76	NaN	10.04	3027.37	3853.40	222.815930	6.170	False
2012-10-12	35332.34	60.97	3.601	2145.50	NaN	33.31	586.83	10421.01	223.015426	6.170	False
2012-10-19	35721.09	68.08	3.594	4461.89	NaN	1.14	1579.67	2642.29	223.059808	6.170	False
2012-10-26	34260.76	69.79	3.506	6152.59	129.77	200.00	272.29	2924.15	223.078337	6.170	False

143 rows × 11 columns

[2]:

from merlion.utils import TimeSeries

# Get the endogenous variables X and split them into train & test
endog = ts[["Weekly_Sales", "Temperature", "CPI", "Unemployment"]]
train = TimeSeries.from_pd(endog[md.trainval])
test = TimeSeries.from_pd(endog[~md.trainval])

# Get the exogenous variables Y
exog = TimeSeries.from_pd(ts[["IsHoliday", "MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5"]])

The earliest univariate starts at 2010-02-05 00:00:00, but the latest univariate starts at 2011-11-11 00:00:00, a difference of 644 days 00:00:00. This is more than 10% of the length of the shortest univariate (350 days 00:00:00). You may want to check that the univariates cover the same window of time.
Stack (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/traitlets/config/application.py", line 845, in launch_instance
    app.start()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 612, in start
    self.io_loop.start()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 199, in start
    self.asyncio_loop.run_forever()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 596, in run_forever
    self._run_once()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 1890, in _run_once
    handle._run()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/tornado/ioloop.py", line 688, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/tornado/gen.py", line 814, in inner
    self.ctx_run(self.run)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/tornado/gen.py", line 775, in run
    yielded = self.gen.send(value)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 374, in dispatch_queue
    yield self.process_one()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/tornado/gen.py", line 250, in wrapper
    runner = Runner(ctx_run, result, future, yielded)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/tornado/gen.py", line 741, in __init__
    self.ctx_run(self.run)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/tornado/gen.py", line 775, in run
    yielded = self.gen.send(value)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 358, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/tornado/gen.py", line 234, in wrapper
    yielded = ctx_run(next, result)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 261, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/tornado/gen.py", line 234, in wrapper
    yielded = ctx_run(next, result)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 536, in execute_request
    self.do_execute(
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/tornado/gen.py", line 234, in wrapper
    yielded = ctx_run(next, result)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 302, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 539, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2898, in run_cell
    result = self._run_cell(
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 2944, in _run_cell
    return runner(coro)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 68, in _pseudo_sync_runner
    coro.send(None)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3169, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3361, in run_ast_nodes
    if (await self.run_code(code, result,  async_=asy)):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-f4b6cbd5939f>", line 9, in <module>
    exog = TimeSeries.from_pd(ts[["IsHoliday", "MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5"]])
  File "/Users/abhatnagar/Desktop/Merlion/merlion/utils/time_series.py", line 794, in from_pd
    return cls(df=df, freq=freq, check_aligned=check_aligned)
  File "/Users/abhatnagar/Desktop/Merlion/merlion/utils/time_series.py", line 493, in __init__
    logger.warning(

Here, our task is to predict the weekly sales. We would like our model to also account for variables which may have an impact on consumer demand (i.e. temperature, consumer price index, and unemployment), as knowledge of these variables could improve the quality of our sales forecast. This would be a multivariate forecasting problem, covered here.

In principle, we could add markdowns and holidays to the multivariate model. However, as a retailer, we know a priori which days are holidays, and we ourselves control the markdowns. In many cases, we can get better forecasts by providing the future values of these variables in addition to the past values. Moreover, we may wish to model how changing the future markdowns would change the future sales. This is why we should model these variables as exogenous regressors instead.

All Merlion forecasters support an API which accepts exogenous regressors at both training and inference time, though only some models actually support the feature. Using the feature is as easy as specifying an optional argument exog_data to both train() and forecast(). We show how to use the feature for the popular Prophet model below, and demonstrate that adding exogenous regressors can improve the quality of the forecast.

[3]:

from merlion.evaluate.forecast import ForecastMetric
from merlion.models.forecast.prophet import Prophet, ProphetConfig

# Train a model without exogenous data
model = Prophet(ProphetConfig(target_seq_index=0))
model.train(train)
pred, err = model.forecast(test.time_stamps)
smape = ForecastMetric.sMAPE.value(test, pred, target_seq_index=model.target_seq_index)
print(f"sMAPE (w/o exog) = {smape:.2f}")

# Train a model with exogenous data
exog_model = Prophet(ProphetConfig(target_seq_index=0))
exog_model.train(train, exog_data=exog)
exog_pred, exog_err = exog_model.forecast(test.time_stamps, exog_data=exog)
exog_smape = ForecastMetric.sMAPE.value(test, exog_pred, target_seq_index=exog_model.target_seq_index)
print(f"sMAPE (w/ exog)  = {exog_smape:.2f}")

17:50:59 - cmdstanpy - INFO - Chain [1] start processing
17:50:59 - cmdstanpy - INFO - Chain [1] done processing
17:50:59 - cmdstanpy - INFO - Chain [1] start processing
17:50:59 - cmdstanpy - INFO - Chain [1] done processing

sMAPE (w/o exog) = 3.98
sMAPE (w/ exog)  = 3.18

Before we wrap up this tutorial, we note that the exogenous variables contain a lot of missing data:

[4]:

display(exog.to_pd())

	IsHoliday	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5
Date
2010-02-05	False	NaN	NaN	NaN	NaN	NaN
2010-02-12	True	NaN	NaN	NaN	NaN	NaN
2010-02-19	False	NaN	NaN	NaN	NaN	NaN
2010-02-26	False	NaN	NaN	NaN	NaN	NaN
2010-03-05	False	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...
2012-09-28	False	7106.05	1.91	1.65	1549.10	3946.03
2012-10-05	False	6037.76	NaN	10.04	3027.37	3853.40
2012-10-12	False	2145.50	NaN	33.31	586.83	10421.01
2012-10-19	False	4461.89	NaN	1.14	1579.67	2642.29
2012-10-26	False	6152.59	129.77	200.00	272.29	2924.15

143 rows × 6 columns

Behind the scenes, Merlion models will apply an optional exog_transform to the exogenous variables, and they will then resample the exogenous variables to the same timestamps as the endogenous variables. This resampling is achieved using the exog_missing_value_policy and exog_aggregation_policy, which can be specified in the config of any model which accepts exogenous regressors. We can see the default values for each of these parameters by inspecting the config:

[5]:

print(f"Default exog_transform:            {type(exog_model.config.exog_transform).__name__}")
print(f"Default exog_missing_value_policy: {exog_model.config.exog_missing_value_policy}")
print(f"Default exog_aggregation_policy:   {exog_model.config.exog_aggregation_policy}")

Default exog_transform:            MeanVarNormalize
Default exog_missing_value_policy: MissingValuePolicy.ZFill
Default exog_aggregation_policy:   AggregationPolicy.Mean

So in this case, we first apply mean-variance normalization to the exogenous data. Then, we impute missing values by filling them with zeros (ZFill), and we downsample the exogenous data by taking the Mean of any relevant windows.