Loading Custom Datasets

This notebook will explain how to load custom datasets saved to CSV files, for either anomaly detection or forecasting.

Anomaly Detection Datasets

Let’s first look at a synthetic anomaly detection dataset. Note that this section just provides an alternative implementation of the dataset ts_datasets.anomaly.Synthetic. We begin by listing all the CSV files in the relevant directory.

[1]:

import glob
import os
anom_dir = os.path.join("..", "data", "synthetic_anomaly")
csvs = sorted(glob.glob(f"{anom_dir}/*.csv"))
for csv in csvs:
    print(csv)

../data/synthetic_anomaly/horizontal.csv
../data/synthetic_anomaly/horizontal_dip_anomaly.csv
../data/synthetic_anomaly/horizontal_level_anomaly.csv
../data/synthetic_anomaly/horizontal_shock_anomaly.csv
../data/synthetic_anomaly/horizontal_spike_anomaly.csv
../data/synthetic_anomaly/horizontal_trend_anomaly.csv
../data/synthetic_anomaly/seasonal.csv
../data/synthetic_anomaly/seasonal_dip_anomaly.csv
../data/synthetic_anomaly/seasonal_level_anomaly.csv
../data/synthetic_anomaly/seasonal_shock_anomaly.csv
../data/synthetic_anomaly/seasonal_spike_anomaly.csv
../data/synthetic_anomaly/seasonal_trend_anomaly.csv
../data/synthetic_anomaly/upward_downward.csv
../data/synthetic_anomaly/upward_downward_dip_anomaly.csv
../data/synthetic_anomaly/upward_downward_level_anomaly.csv
../data/synthetic_anomaly/upward_downward_shock_anomaly.csv
../data/synthetic_anomaly/upward_downward_spike_anomaly.csv
../data/synthetic_anomaly/upward_downward_trend_anomaly.csv

Let’s visualize what a couple of these CSVs look like.

[2]:

import pandas as pd
from IPython.display import display

for csv in [csvs[0], csvs[8]]:
    print(csv)
    display(pd.read_csv(csv))

../data/synthetic_anomaly/horizontal.csv

	timestamp	horizontal
0	0	1.928031
1	300	-1.156620
2	600	-0.390650
3	900	0.400804
4	1200	-0.874490
...	...	...
9995	2998500	0.362724
9996	2998800	2.657373
9997	2999100	1.472341
9998	2999400	1.033154
9999	2999700	2.950466

10000 rows × 2 columns

../data/synthetic_anomaly/seasonal_level_anomaly.csv

	timestamp	seasonal	anomaly
0	0	-0.577883	0.0
1	300	1.059779	0.0
2	600	1.137609	0.0
3	900	0.743360	0.0
4	1200	1.998400	0.0
...	...	...	...
9995	2998500	-5.388685	0.0
9996	2998800	-5.017828	0.0
9997	2999100	-4.196791	0.0
9998	2999400	-4.234555	0.0
9999	2999700	-3.111685	0.0

10000 rows × 3 columns

Each CSV in the dataset has the following important characteristics:

a time column timestamp (here, a Unix timestamp expressed in units of seconds);
a column anomaly indicating whether a timestamp is anomalous or not (though this is absent for time series which don’t contain any anomalies);
one or more columns for the actual data values

We can create a data loader for all the CSV files in this dataset as follows:

[3]:

from ts_datasets.anomaly import CustomAnomalyDataset
dataset = CustomAnomalyDataset(
    rootdir=anom_dir,       # where the data is stored
    test_frac=0.75,         # use 75% of each time series for testing.
                            # overridden if the column `trainval` is in the actual CSV.
    time_unit="s",          # the timestamp column (automatically detected) is in units of seconds
    assume_no_anomaly=True  # if a CSV doesn't have the "anomaly" column, assume it has no anomalies
)

[4]:

print(f"There are {len(dataset)} time series in this dataset.")
time_series, metadata = dataset[3]

There are 18 time series in this dataset.

This particular time series is univariate. Its variable is named “horizontal”.

[5]:

display(time_series)

	horizontal
timestamp
1970-01-01 00:00:00	1.928031
1970-01-01 00:05:00	-1.156620
1970-01-01 00:10:00	-0.390650
1970-01-01 00:15:00	0.400804
1970-01-01 00:20:00	-0.874490
...	...
1970-02-04 16:55:00	0.362724
1970-02-04 17:00:00	2.657373
1970-02-04 17:05:00	1.472341
1970-02-04 17:10:00	1.033154
1970-02-04 17:15:00	2.950466

10000 rows × 1 columns

The metadata has the same timestamps as the time series. It contains “anomaly” and “trainval” columns. These respectively indicate whether each timestamp is anomalous, and whether each timestamp is for training/validation or testing.

[6]:

display(metadata)

	anomaly	trainval
timestamp
1970-01-01 00:00:00	False	True
1970-01-01 00:05:00	False	True
1970-01-01 00:10:00	False	True
1970-01-01 00:15:00	False	True
1970-01-01 00:20:00	False	True
...	...	...
1970-02-04 16:55:00	False	False
1970-02-04 17:00:00	False	False
1970-02-04 17:05:00	False	False
1970-02-04 17:10:00	False	False
1970-02-04 17:15:00	False	False

10000 rows × 2 columns

[7]:

print(f"{100 - metadata.trainval.mean() * 100}% of the time series is for testing.")
print(f"{metadata.anomaly.mean() * 100}% of the time series is anomalous.")

75.0% of the time series is for testing.
19.57% of the time series is anomalous.

General Purpose (Forecasting) Datasets

Next, let’s load a more general-purpose dataset for forecasting. We will use this opportunity to show some of the more advanced features as well. Here, our dataset consists of a single CSV file which contains many multivariate time series. These time series are collected from a large retailer, and each individual time series corresonds to a different department within a different store. Let’s have a look at the data.

[8]:

csv = os.path.join("..", "data", "walmart", "walmart_mini.csv")
display(pd.read_csv(csv))

	Store	Dept	Date	Weekly_Sales	Temperature	Fuel_Price	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5	CPI	Unemployment	IsHoliday
0	1	1	2010-02-05	24924.50	42.31	2.572	NaN	NaN	NaN	NaN	NaN	211.096358	8.106	False
1	1	1	2010-02-12	46039.49	38.51	2.548	NaN	NaN	NaN	NaN	NaN	211.242170	8.106	True
2	1	1	2010-02-19	41595.55	39.93	2.514	NaN	NaN	NaN	NaN	NaN	211.289143	8.106	False
3	1	1	2010-02-26	19403.54	46.63	2.561	NaN	NaN	NaN	NaN	NaN	211.319643	8.106	False
4	1	1	2010-03-05	21827.90	46.50	2.625	NaN	NaN	NaN	NaN	NaN	211.350143	8.106	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2855	2	10	2012-09-28	37104.67	79.45	3.666	7106.05	1.91	1.65	1549.10	3946.03	222.616433	6.565	False
2856	2	10	2012-10-05	36361.28	70.27	3.617	6037.76	NaN	10.04	3027.37	3853.40	222.815930	6.170	False
2857	2	10	2012-10-12	35332.34	60.97	3.601	2145.50	NaN	33.31	586.83	10421.01	223.015426	6.170	False
2858	2	10	2012-10-19	35721.09	68.08	3.594	4461.89	NaN	1.14	1579.67	2642.29	223.059808	6.170	False
2859	2	10	2012-10-26	34260.76	69.79	3.506	6152.59	129.77	200.00	272.29	2924.15	223.078337	6.170	False

2860 rows × 14 columns

As before, we have a column Date indicating the time. Note that in this case, we have a string rather than a timestamp; this is also okay. However, we now also have some index columns Store and Dept which are used to distinguish between different time series. We specify these to the data loader.

[9]:

from ts_datasets.forecast import CustomDataset
dataset = CustomDataset(
    rootdir=csv,                  # where the data is stored
    index_cols=["Store", "Dept"], # Individual time series are indexed by store & department
    test_frac=0.75,               # use 25% of each time series for testing.
                                  # overridden if the column `trainval` is in the actual CSV.
)

[10]:

print(f"There are {len(dataset)} time series in this dataset.")
time_series, metadata = dataset[17]

There are 20 time series in this dataset.

This particular time series is multivariate.

[11]:

display(time_series)

	Weekly_Sales	Temperature	Fuel_Price	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5	CPI	Unemployment	IsHoliday
Date
2010-02-05	69634.80	40.19	2.572	NaN	NaN	NaN	NaN	NaN	210.752605	8.324	False
2010-02-12	63393.29	38.49	2.548	NaN	NaN	NaN	NaN	NaN	210.897994	8.324	True
2010-02-19	66589.27	39.69	2.514	NaN	NaN	NaN	NaN	NaN	210.945160	8.324	False
2010-02-26	61875.48	46.10	2.561	NaN	NaN	NaN	NaN	NaN	210.975957	8.324	False
2010-03-05	67041.18	47.17	2.625	NaN	NaN	NaN	NaN	NaN	211.006754	8.324	False
...	...	...	...	...	...	...	...	...	...	...	...
2012-09-28	57424.00	79.45	3.666	7106.05	1.91	1.65	1549.10	3946.03	222.616433	6.565	False
2012-10-05	62955.51	70.27	3.617	6037.76	NaN	10.04	3027.37	3853.40	222.815930	6.170	False
2012-10-12	63083.63	60.97	3.601	2145.50	NaN	33.31	586.83	10421.01	223.015426	6.170	False
2012-10-19	60502.97	68.08	3.594	4461.89	NaN	1.14	1579.67	2642.29	223.059808	6.170	False
2012-10-26	63992.36	69.79	3.506	6152.59	129.77	200.00	272.29	2924.15	223.078337	6.170	False

143 rows × 11 columns

The metadata has the same timestamps as the time series. It has a “trainval” column as before, plus index columns “Store” and “Dept”.

[12]:

display(metadata)

	trainval	Store	Dept
Date
2010-02-05	True	2	8
2010-02-12	True	2	8
2010-02-19	True	2	8
2010-02-26	True	2	8
2010-03-05	True	2	8
...	...	...	...
2012-09-28	False	2	8
2012-10-05	False	2	8
2012-10-12	False	2	8
2012-10-19	False	2	8
2012-10-26	False	2	8

143 rows × 3 columns

Broader Takeaways

In general, a dataset can contain any number of CSVs stored under a single root directory. Each CSV can contain one or more time series, where the different time series within a single file are indicated by different values of the index column. Note that this works for anomaly detection as well! You just need to make sure that your CSVs all contain the anomaly column. In general, all features supported by CustomDataset are also supported by CustomAnomalyDataset, as long as your CSV files have the anomaly column.

If you want to either of the above custom datasets for benchmarking, you can call

python benchmark_anomaly.py --model IsolationForest --retrain_freq 7d \
    --dataset CustomAnomalyDataset --data_root data/synthetic_anomaly \
    --data_kwargs '{"assume_no_anomaly": true, "test_frac": 0.75}'

or

python benchmark_forecast.py --model AutoETS  \
    --dataset CustomDataset --data_root data/walmart/walmart_mini.csv \
    --data_kwargs '{"test_frac": 0.25, \
                    "index_cols": ["Store", "Dept"], \
                    "data_cols": ["Weekly_Sales"]}'

Note in the example above, we specify “data_cols” as “Weekly_Sales”. This indicates that the only column we are modeling is Weekly_Sales. If you wanted to do multivariate prediction, you could also add “Temperature”, “Fuel_Price”, “CPI”, etc. We treat the first of the data columns as the target univariate whose value you wish to forecast.