ts_datasets: Easy Data Loading

ts_datasets implements Python classes that manipulate numerous time series datasets into standardized pandas.DataFrame s. The sub-modules are ts_datasets.anomaly for time series anomaly detection, and ts_datasets.forecast for time series forecasting. Simply install the package by calling pip install -e ts_datasets/ from the root directory of Merlion. Then, you can load a dataset (e.g. the “realAWSCloudwatch” split of the Numenta Anomaly Benchmark or the “Hourly” subset of the M4 dataset) by calling

from ts_datasets.anomaly import NAB
from ts_datasets.forecast import M4
anom_dataset = NAB(subset="realAWSCloudwatch", rootdir=path_to_NAB)
forecast_dataset = M4(subset="Hourly", rootdir=path_to_M4)

If you install this package in editable mode (i.e. specify -e when calling pip install -e ts_datasets/), there is no need to specify a rootdir for any of the data loaders.

The core features of general data loaders (e.g. for forecasting) are outlined in the API doc for ts_datasets.base.BaseDataset, and the features for time series anomaly detection data loaders are outlined in the API doc for ts_datasets.anomaly.TSADBaseDataset.

This package implements Python classes that manipulate numerous time series datasets (both open source and internal) into standardized pd.DataFrame s.

anomaly

Datasets for time series anomaly detection (TSAD).

forecast

Datasets for time series forecasting.

Subpackages

Submodules

datasets.base module

class ts_datasets.base.BaseDataset

Bases: object

Base dataset class for storing time series as pd.DataFrame s. Each dataset supports the following features:

  1. __getitem__: you may call ts, metadata = dataset[i]. ts is a time-indexed pandas DataFrame, with each column representing a different variable (in the case of multivariate time series). metadata is a dict or pd.DataFrame with the same index as ts, with different keys indicating different dataset-specific metadata (train/test split, anomaly labels, etc.) for each timestamp.

  2. __len__: Calling len(dataset) will return the number of time series in the dataset.

  3. __iter__: You may iterate over the pandas representations of the time series in the dataset with for ts, metadata in dataset: ...

Note

For each time series, the metadata will always have the key trainval, which is a pd.Series of bool indicating whether each timestamp of the time series should be training/validation (if True) or testing (if False).

time_series: list

A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets, time_series may be a list of filenames, which are read lazily either during iteration, or whenever __getitem__ is invoked.

metadata: list

A list containing the metadata for all individual time series in the dataset.

describe()