ts_datasets: Easy Data Loading

ts_datasets implements Python classes that manipulate numerous time series datasets into standardized pandas.DataFrame s. The sub-modules are ts_datasets.anomaly for time series anomaly detection, and ts_datasets.forecast for time series forecasting. Simply install the package by calling pip install -e ts_datasets/ from the root directory of Merlion. Then, you can load a dataset (e.g. the “realAWSCloudwatch” split of the Numenta Anomaly Benchmark or the “Hourly” subset of the M4 dataset) by calling

from ts_datasets.anomaly import NAB
from ts_datasets.forecast import M4
anom_dataset = NAB(subset="realAWSCloudwatch", rootdir=path_to_NAB)
forecast_dataset = M4(subset="Hourly", rootdir=path_to_M4)

If you install this package in editable mode (i.e. specify -e when calling pip install -e ts_datasets/), there is no need to specify a rootdir for any of the data loaders.

The core features of general data loaders (e.g. for forecasting) are outlined in the API doc for ts_datasets.base.BaseDataset, and the features for time series anomaly detection data loaders are outlined in the API doc for ts_datasets.anomaly.TSADBaseDataset.

The easiest way to load a custom dataset is to use either the ts_datasets.forecast.CustomDataset or ts_datasets.anomaly.CustomAnomalyDataset classes. Please review the tutorial to get started.

anomaly

Datasets for time series anomaly detection (TSAD).

forecast

Datasets for time series forecasting.

Subpackages

Submodules

datasets.base module

class ts_datasets.base.BaseDataset

Bases: object

Base dataset class for storing time series as pd.DataFrame s. Each dataset supports the following features:

  1. __getitem__: you may call ts, metadata = dataset[i]. ts is a time-indexed pandas DataFrame, with each column representing a different variable (in the case of multivariate time series). metadata is a dict or pd.DataFrame with the same index as ts, with different keys indicating different dataset-specific metadata (train/test split, anomaly labels, etc.) for each timestamp.

  2. __len__: Calling len(dataset) will return the number of time series in the dataset.

  3. __iter__: You may iterate over the pandas representations of the time series in the dataset with for ts, metadata in dataset: ...

Note

For each time series, the metadata will always have the key trainval, which is a pd.Series of bool indicating whether each timestamp of the time series should be training/validation (if True) or testing (if False).

time_series: list

A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets, time_series may be a list of filenames, which are read lazily either during iteration, or whenever __getitem__ is invoked.

metadata: list

A list containing the metadata for all individual time series in the dataset.

describe()