Loading Custom Datasets

This notebook will explain how to load custom datasets saved to CSV files, for either anomaly detection or forecasting.

Anomaly Detection Datasets

Let’s first look at a synthetic anomaly detection dataset. Note that this section just provides an alternative implementation of the dataset ts_datasets.anomaly.Synthetic. We begin by listing all the CSV files in the relevant directory.

[1]:
import glob
import os
anom_dir = os.path.join("..", "data", "synthetic_anomaly")
csvs = sorted(glob.glob(f"{anom_dir}/*.csv"))
for csv in csvs:
    print(csv)
../data/synthetic_anomaly/horizontal.csv
../data/synthetic_anomaly/horizontal_dip_anomaly.csv
../data/synthetic_anomaly/horizontal_level_anomaly.csv
../data/synthetic_anomaly/horizontal_shock_anomaly.csv
../data/synthetic_anomaly/horizontal_spike_anomaly.csv
../data/synthetic_anomaly/horizontal_trend_anomaly.csv
../data/synthetic_anomaly/seasonal.csv
../data/synthetic_anomaly/seasonal_dip_anomaly.csv
../data/synthetic_anomaly/seasonal_level_anomaly.csv
../data/synthetic_anomaly/seasonal_shock_anomaly.csv
../data/synthetic_anomaly/seasonal_spike_anomaly.csv
../data/synthetic_anomaly/seasonal_trend_anomaly.csv
../data/synthetic_anomaly/upward_downward.csv
../data/synthetic_anomaly/upward_downward_dip_anomaly.csv
../data/synthetic_anomaly/upward_downward_level_anomaly.csv
../data/synthetic_anomaly/upward_downward_shock_anomaly.csv
../data/synthetic_anomaly/upward_downward_spike_anomaly.csv
../data/synthetic_anomaly/upward_downward_trend_anomaly.csv

Let’s visualize what a couple of these CSVs look like.

[2]:
import pandas as pd
from IPython.display import display

for csv in [csvs[0], csvs[8]]:
    print(csv)
    display(pd.read_csv(csv))
../data/synthetic_anomaly/horizontal.csv
timestamp horizontal
0 0 1.928031
1 300 -1.156620
2 600 -0.390650
3 900 0.400804
4 1200 -0.874490
... ... ...
9995 2998500 0.362724
9996 2998800 2.657373
9997 2999100 1.472341
9998 2999400 1.033154
9999 2999700 2.950466

10000 rows × 2 columns

../data/synthetic_anomaly/seasonal_level_anomaly.csv
timestamp seasonal anomaly
0 0 -0.577883 0.0
1 300 1.059779 0.0
2 600 1.137609 0.0
3 900 0.743360 0.0
4 1200 1.998400 0.0
... ... ... ...
9995 2998500 -5.388685 0.0
9996 2998800 -5.017828 0.0
9997 2999100 -4.196791 0.0
9998 2999400 -4.234555 0.0
9999 2999700 -3.111685 0.0

10000 rows × 3 columns

Each CSV in the dataset has the following important characteristics:

  • a time column timestamp (here, a Unix timestamp expressed in units of seconds);

  • a column anomaly indicating whether a timestamp is anomalous or not (though this is absent for time series which don’t contain any anomalies);

  • one or more columns for the actual data values

We can create a data loader for all the CSV files in this dataset as follows:

[3]:
from ts_datasets.anomaly import CustomAnomalyDataset
dataset = CustomAnomalyDataset(
    rootdir=anom_dir,       # where the data is stored
    test_frac=0.75,         # use 75% of each time series for testing.
                            # overridden if the column `trainval` is in the actual CSV.
    time_unit="s",          # the timestamp column (automatically detected) is in units of seconds
    assume_no_anomaly=True  # if a CSV doesn't have the "anomaly" column, assume it has no anomalies
)
[4]:
print(f"There are {len(dataset)} time series in this dataset.")
time_series, metadata = dataset[3]
There are 18 time series in this dataset.

This particular time series is univariate. Its variable is named “horizontal”.

[5]:
display(time_series)
horizontal
timestamp
1970-01-01 00:00:00 1.928031
1970-01-01 00:05:00 -1.156620
1970-01-01 00:10:00 -0.390650
1970-01-01 00:15:00 0.400804
1970-01-01 00:20:00 -0.874490
... ...
1970-02-04 16:55:00 0.362724
1970-02-04 17:00:00 2.657373
1970-02-04 17:05:00 1.472341
1970-02-04 17:10:00 1.033154
1970-02-04 17:15:00 2.950466

10000 rows × 1 columns

The metadata has the same timestamps as the time series. It contains “anomaly” and “trainval” columns. These respectively indicate whether each timestamp is anomalous, and whether each timestamp is for training/validation or testing.

[6]:
display(metadata)
anomaly trainval
timestamp
1970-01-01 00:00:00 False True
1970-01-01 00:05:00 False True
1970-01-01 00:10:00 False True
1970-01-01 00:15:00 False True
1970-01-01 00:20:00 False True
... ... ...
1970-02-04 16:55:00 False False
1970-02-04 17:00:00 False False
1970-02-04 17:05:00 False False
1970-02-04 17:10:00 False False
1970-02-04 17:15:00 False False

10000 rows × 2 columns

[7]:
print(f"{100 - metadata.trainval.mean() * 100}% of the time series is for testing.")
print(f"{metadata.anomaly.mean() * 100}% of the time series is anomalous.")
75.0% of the time series is for testing.
19.57% of the time series is anomalous.

General Purpose (Forecasting) Datasets

Next, let’s load a more general-purpose dataset for forecasting. We will use this opportunity to show some of the more advanced features as well. Here, our dataset consists of a single CSV file which contains many multivariate time series. These time series are collected from a large retailer, and each individual time series corresonds to a different department within a different store. Let’s have a look at the data.

[8]:
csv = os.path.join("..", "data", "walmart", "walmart_mini.csv")
display(pd.read_csv(csv))
Store Dept Date Weekly_Sales Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday
0 1 1 2010-02-05 24924.50 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 False
1 1 1 2010-02-12 46039.49 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106 True
2 1 1 2010-02-19 41595.55 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106 False
3 1 1 2010-02-26 19403.54 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106 False
4 1 1 2010-03-05 21827.90 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106 False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8229 2 30 2012-09-28 3307.90 79.45 3.666 7106.05 1.91 1.65 1549.10 3946.03 222.616433 6.565 False
8230 2 30 2012-10-05 3697.52 70.27 3.617 6037.76 NaN 10.04 3027.37 3853.40 222.815930 6.170 False
8231 2 30 2012-10-12 3085.98 60.97 3.601 2145.50 NaN 33.31 586.83 10421.01 223.015426 6.170 False
8232 2 30 2012-10-19 4043.06 68.08 3.594 4461.89 NaN 1.14 1579.67 2642.29 223.059808 6.170 False
8233 2 30 2012-10-26 3869.88 69.79 3.506 6152.59 129.77 200.00 272.29 2924.15 223.078337 6.170 False

8234 rows × 14 columns

As before, we have a column Date indicating the time. Note that in this case, we have a string rather than a timestamp; this is also okay. However, we now also have some index columns Store and Dept which are used to distinguish between different time series. We specify these to the data loader.

[9]:
from ts_datasets.forecast import CustomDataset
dataset = CustomDataset(
    rootdir=csv,                  # where the data is stored
    index_cols=["Store", "Dept"], # Individual time series are indexed by store & department
    test_frac=0.75,               # use 25% of each time series for testing.
                                  # overridden if the column `trainval` is in the actual CSV.
)
[10]:
print(f"There are {len(dataset)} time series in this dataset.")
time_series, metadata = dataset[52]
There are 58 time series in this dataset.

This particular time series is multivariate.

[11]:
display(time_series)
Weekly_Sales Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday
Date
2010-02-05 16827.50 40.19 2.572 NaN NaN NaN NaN NaN 210.752605 8.324 False
2010-02-12 19286.00 38.49 2.548 NaN NaN NaN NaN NaN 210.897994 8.324 True
2010-02-19 17803.75 39.69 2.514 NaN NaN NaN NaN NaN 210.945160 8.324 False
2010-02-26 13153.25 46.10 2.561 NaN NaN NaN NaN NaN 210.975957 8.324 False
2010-03-05 14656.50 47.17 2.625 NaN NaN NaN NaN NaN 211.006754 8.324 False
... ... ... ... ... ... ... ... ... ... ... ...
2012-09-28 11893.45 79.45 3.666 7106.05 1.91 1.65 1549.10 3946.03 222.616433 6.565 False
2012-10-05 16415.05 70.27 3.617 6037.76 NaN 10.04 3027.37 3853.40 222.815930 6.170 False
2012-10-12 15992.38 60.97 3.601 2145.50 NaN 33.31 586.83 10421.01 223.015426 6.170 False
2012-10-19 13573.30 68.08 3.594 4461.89 NaN 1.14 1579.67 2642.29 223.059808 6.170 False
2012-10-26 12962.63 69.79 3.506 6152.59 129.77 200.00 272.29 2924.15 223.078337 6.170 False

143 rows × 11 columns

The metadata has the same timestamps as the time series. It has a “trainval” column as before, plus index columns “Store” and “Dept”.

[12]:
display(metadata)
trainval Store Dept
Date
2010-02-05 True 2 25
2010-02-12 True 2 25
2010-02-19 True 2 25
2010-02-26 True 2 25
2010-03-05 True 2 25
... ... ... ...
2012-09-28 False 2 25
2012-10-05 False 2 25
2012-10-12 False 2 25
2012-10-19 False 2 25
2012-10-26 False 2 25

143 rows × 3 columns

Broader Takeaways

In general, a dataset can contain any number of CSVs stored under a single root directory. Each CSV can contain one or more time series, where the different time series within a single file are indicated by different values of the index column. Note that this works for anomaly detection as well! You just need to make sure that your CSVs all contain the anomaly column. In general, all features supported by CustomDataset are also supported by CustomAnomalyDataset, as long as your CSV files have the anomaly column.

If you want to either of the above custom datasets for benchmarking, you can call

python benchmark_anomaly.py --model IsolationForest --retrain_freq 7d \
    --dataset CustomAnomalyDataset --data_root data/synthetic_anomaly \
    --data_kwargs '{"assume_no_anomaly": true, "test_frac": 0.75}'

or

python benchmark_forecast.py --model AutoETS  \
    --dataset CustomDataset --data_root data/walmart/walmart_mini.csv \
    --data_kwargs '{"test_frac": 0.25, \
                    "index_cols": ["Store", "Dept"], \
                    "data_cols": ["Weekly_Sales"]}'

Note in the example above, we specify “data_cols” as “Weekly_Sales”. This indicates that we want