Loading Custom Datasets
This notebook will explain how to load custom datasets saved to CSV files, for either anomaly detection or forecasting.
Anomaly Detection Datasets
Let’s first look at a synthetic anomaly detection dataset. Note that this section just provides an alternative implementation of the dataset ts_datasets.anomaly.Synthetic
. We begin by listing all the CSV files in the relevant directory.
[1]:
import glob
import os
anom_dir = os.path.join("..", "data", "synthetic_anomaly")
csvs = sorted(glob.glob(f"{anom_dir}/*.csv"))
for csv in csvs:
print(csv)
../data/synthetic_anomaly/horizontal.csv
../data/synthetic_anomaly/horizontal_dip_anomaly.csv
../data/synthetic_anomaly/horizontal_level_anomaly.csv
../data/synthetic_anomaly/horizontal_shock_anomaly.csv
../data/synthetic_anomaly/horizontal_spike_anomaly.csv
../data/synthetic_anomaly/horizontal_trend_anomaly.csv
../data/synthetic_anomaly/seasonal.csv
../data/synthetic_anomaly/seasonal_dip_anomaly.csv
../data/synthetic_anomaly/seasonal_level_anomaly.csv
../data/synthetic_anomaly/seasonal_shock_anomaly.csv
../data/synthetic_anomaly/seasonal_spike_anomaly.csv
../data/synthetic_anomaly/seasonal_trend_anomaly.csv
../data/synthetic_anomaly/upward_downward.csv
../data/synthetic_anomaly/upward_downward_dip_anomaly.csv
../data/synthetic_anomaly/upward_downward_level_anomaly.csv
../data/synthetic_anomaly/upward_downward_shock_anomaly.csv
../data/synthetic_anomaly/upward_downward_spike_anomaly.csv
../data/synthetic_anomaly/upward_downward_trend_anomaly.csv
Let’s visualize what a couple of these CSVs look like.
[2]:
import pandas as pd
from IPython.display import display
for csv in [csvs[0], csvs[8]]:
print(csv)
display(pd.read_csv(csv))
../data/synthetic_anomaly/horizontal.csv
timestamp | horizontal | |
---|---|---|
0 | 0 | 1.928031 |
1 | 300 | -1.156620 |
2 | 600 | -0.390650 |
3 | 900 | 0.400804 |
4 | 1200 | -0.874490 |
... | ... | ... |
9995 | 2998500 | 0.362724 |
9996 | 2998800 | 2.657373 |
9997 | 2999100 | 1.472341 |
9998 | 2999400 | 1.033154 |
9999 | 2999700 | 2.950466 |
10000 rows × 2 columns
../data/synthetic_anomaly/seasonal_level_anomaly.csv
timestamp | seasonal | anomaly | |
---|---|---|---|
0 | 0 | -0.577883 | 0.0 |
1 | 300 | 1.059779 | 0.0 |
2 | 600 | 1.137609 | 0.0 |
3 | 900 | 0.743360 | 0.0 |
4 | 1200 | 1.998400 | 0.0 |
... | ... | ... | ... |
9995 | 2998500 | -5.388685 | 0.0 |
9996 | 2998800 | -5.017828 | 0.0 |
9997 | 2999100 | -4.196791 | 0.0 |
9998 | 2999400 | -4.234555 | 0.0 |
9999 | 2999700 | -3.111685 | 0.0 |
10000 rows × 3 columns
Each CSV in the dataset has the following important characteristics:
a time column
timestamp
(here, a Unix timestamp expressed in units of seconds);a column
anomaly
indicating whether a timestamp is anomalous or not (though this is absent for time series which don’t contain any anomalies);one or more columns for the actual data values
We can create a data loader for all the CSV files in this dataset as follows:
[3]:
from ts_datasets.anomaly import CustomAnomalyDataset
dataset = CustomAnomalyDataset(
rootdir=anom_dir, # where the data is stored
test_frac=0.75, # use 75% of each time series for testing.
# overridden if the column `trainval` is in the actual CSV.
time_unit="s", # the timestamp column (automatically detected) is in units of seconds
assume_no_anomaly=True # if a CSV doesn't have the "anomaly" column, assume it has no anomalies
)
[4]:
print(f"There are {len(dataset)} time series in this dataset.")
time_series, metadata = dataset[3]
There are 18 time series in this dataset.
This particular time series is univariate. Its variable is named “horizontal”.
[5]:
display(time_series)
horizontal | |
---|---|
timestamp | |
1970-01-01 00:00:00 | 1.928031 |
1970-01-01 00:05:00 | -1.156620 |
1970-01-01 00:10:00 | -0.390650 |
1970-01-01 00:15:00 | 0.400804 |
1970-01-01 00:20:00 | -0.874490 |
... | ... |
1970-02-04 16:55:00 | 0.362724 |
1970-02-04 17:00:00 | 2.657373 |
1970-02-04 17:05:00 | 1.472341 |
1970-02-04 17:10:00 | 1.033154 |
1970-02-04 17:15:00 | 2.950466 |
10000 rows × 1 columns
The metadata has the same timestamps as the time series. It contains “anomaly” and “trainval” columns. These respectively indicate whether each timestamp is anomalous, and whether each timestamp is for training/validation or testing.
[6]:
display(metadata)
anomaly | trainval | |
---|---|---|
timestamp | ||
1970-01-01 00:00:00 | False | True |
1970-01-01 00:05:00 | False | True |
1970-01-01 00:10:00 | False | True |
1970-01-01 00:15:00 | False | True |
1970-01-01 00:20:00 | False | True |
... | ... | ... |
1970-02-04 16:55:00 | False | False |
1970-02-04 17:00:00 | False | False |
1970-02-04 17:05:00 | False | False |
1970-02-04 17:10:00 | False | False |
1970-02-04 17:15:00 | False | False |
10000 rows × 2 columns
[7]:
print(f"{100 - metadata.trainval.mean() * 100}% of the time series is for testing.")
print(f"{metadata.anomaly.mean() * 100}% of the time series is anomalous.")
75.0% of the time series is for testing.
19.57% of the time series is anomalous.
General Purpose (Forecasting) Datasets
Next, let’s load a more general-purpose dataset for forecasting. We will use this opportunity to show some of the more advanced features as well. Here, our dataset consists of a single CSV file which contains many multivariate time series. These time series are collected from a large retailer, and each individual time series corresonds to a different department within a different store. Let’s have a look at the data.
[8]:
csv = os.path.join("..", "data", "walmart", "walmart_mini.csv")
display(pd.read_csv(csv))
Store | Dept | Date | Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 2010-02-05 | 24924.50 | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | False |
1 | 1 | 1 | 2010-02-12 | 46039.49 | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | True |
2 | 1 | 1 | 2010-02-19 | 41595.55 | 39.93 | 2.514 | NaN | NaN | NaN | NaN | NaN | 211.289143 | 8.106 | False |
3 | 1 | 1 | 2010-02-26 | 19403.54 | 46.63 | 2.561 | NaN | NaN | NaN | NaN | NaN | 211.319643 | 8.106 | False |
4 | 1 | 1 | 2010-03-05 | 21827.90 | 46.50 | 2.625 | NaN | NaN | NaN | NaN | NaN | 211.350143 | 8.106 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2855 | 2 | 10 | 2012-09-28 | 37104.67 | 79.45 | 3.666 | 7106.05 | 1.91 | 1.65 | 1549.10 | 3946.03 | 222.616433 | 6.565 | False |
2856 | 2 | 10 | 2012-10-05 | 36361.28 | 70.27 | 3.617 | 6037.76 | NaN | 10.04 | 3027.37 | 3853.40 | 222.815930 | 6.170 | False |
2857 | 2 | 10 | 2012-10-12 | 35332.34 | 60.97 | 3.601 | 2145.50 | NaN | 33.31 | 586.83 | 10421.01 | 223.015426 | 6.170 | False |
2858 | 2 | 10 | 2012-10-19 | 35721.09 | 68.08 | 3.594 | 4461.89 | NaN | 1.14 | 1579.67 | 2642.29 | 223.059808 | 6.170 | False |
2859 | 2 | 10 | 2012-10-26 | 34260.76 | 69.79 | 3.506 | 6152.59 | 129.77 | 200.00 | 272.29 | 2924.15 | 223.078337 | 6.170 | False |
2860 rows × 14 columns
As before, we have a column Date
indicating the time. Note that in this case, we have a string rather than a timestamp; this is also okay. However, we now also have some index columns Store
and Dept
which are used to distinguish between different time series. We specify these to the data loader.
[9]:
from ts_datasets.forecast import CustomDataset
dataset = CustomDataset(
rootdir=csv, # where the data is stored
index_cols=["Store", "Dept"], # Individual time series are indexed by store & department
test_frac=0.75, # use 25% of each time series for testing.
# overridden if the column `trainval` is in the actual CSV.
)
[10]:
print(f"There are {len(dataset)} time series in this dataset.")
time_series, metadata = dataset[17]
There are 20 time series in this dataset.
This particular time series is multivariate.
[11]:
display(time_series)
Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday | |
---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||
2010-02-05 | 69634.80 | 40.19 | 2.572 | NaN | NaN | NaN | NaN | NaN | 210.752605 | 8.324 | False |
2010-02-12 | 63393.29 | 38.49 | 2.548 | NaN | NaN | NaN | NaN | NaN | 210.897994 | 8.324 | True |
2010-02-19 | 66589.27 | 39.69 | 2.514 | NaN | NaN | NaN | NaN | NaN | 210.945160 | 8.324 | False |
2010-02-26 | 61875.48 | 46.10 | 2.561 | NaN | NaN | NaN | NaN | NaN | 210.975957 | 8.324 | False |
2010-03-05 | 67041.18 | 47.17 | 2.625 | NaN | NaN | NaN | NaN | NaN | 211.006754 | 8.324 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2012-09-28 | 57424.00 | 79.45 | 3.666 | 7106.05 | 1.91 | 1.65 | 1549.10 | 3946.03 | 222.616433 | 6.565 | False |
2012-10-05 | 62955.51 | 70.27 | 3.617 | 6037.76 | NaN | 10.04 | 3027.37 | 3853.40 | 222.815930 | 6.170 | False |
2012-10-12 | 63083.63 | 60.97 | 3.601 | 2145.50 | NaN | 33.31 | 586.83 | 10421.01 | 223.015426 | 6.170 | False |
2012-10-19 | 60502.97 | 68.08 | 3.594 | 4461.89 | NaN | 1.14 | 1579.67 | 2642.29 | 223.059808 | 6.170 | False |
2012-10-26 | 63992.36 | 69.79 | 3.506 | 6152.59 | 129.77 | 200.00 | 272.29 | 2924.15 | 223.078337 | 6.170 | False |
143 rows × 11 columns
The metadata has the same timestamps as the time series. It has a “trainval” column as before, plus index columns “Store” and “Dept”.
[12]:
display(metadata)
trainval | Store | Dept | |
---|---|---|---|
Date | |||
2010-02-05 | True | 2 | 8 |
2010-02-12 | True | 2 | 8 |
2010-02-19 | True | 2 | 8 |
2010-02-26 | True | 2 | 8 |
2010-03-05 | True | 2 | 8 |
... | ... | ... | ... |
2012-09-28 | False | 2 | 8 |
2012-10-05 | False | 2 | 8 |
2012-10-12 | False | 2 | 8 |
2012-10-19 | False | 2 | 8 |
2012-10-26 | False | 2 | 8 |
143 rows × 3 columns
Broader Takeaways
In general, a dataset can contain any number of CSVs stored under a single root directory. Each CSV can contain one or more time series, where the different time series within a single file are indicated by different values of the index column. Note that this works for anomaly detection as well! You just need to make sure that your CSVs all contain the anomaly
column. In general, all features supported by CustomDataset
are also supported by CustomAnomalyDataset
, as long as your CSV
files have the anomaly
column.
If you want to either of the above custom datasets for benchmarking, you can call
python benchmark_anomaly.py --model IsolationForest --retrain_freq 7d \
--dataset CustomAnomalyDataset --data_root data/synthetic_anomaly \
--data_kwargs '{"assume_no_anomaly": true, "test_frac": 0.75}'
or
python benchmark_forecast.py --model AutoETS \
--dataset CustomDataset --data_root data/walmart/walmart_mini.csv \
--data_kwargs '{"test_frac": 0.25, \
"index_cols": ["Store", "Dept"], \
"data_cols": ["Weekly_Sales"]}'
Note in the example above, we specify “data_cols” as “Weekly_Sales”. This indicates that the only column we are modeling is Weekly_Sales. If you wanted to do multivariate prediction, you could also add “Temperature”, “Fuel_Price”, “CPI”, etc. We treat the first of the data columns as the target univariate whose value you wish to forecast.