ts_datasets: Easy Data Loading
ts_datasets implements Python classes that manipulate numerous time series datasets
into standardized pandas.DataFrame s. The sub-modules are ts_datasets.anomaly
for time series anomaly detection, and ts_datasets.forecast for time series forecasting.
Simply install the package by calling pip install -e ts_datasets/ from the root directory of Merlion.
Then, you can load a dataset (e.g. the “realAWSCloudwatch” split of the Numenta Anomaly Benchmark
or the “Hourly” subset of the M4 dataset) by calling
from ts_datasets.anomaly import NAB
from ts_datasets.forecast import M4
anom_dataset = NAB(subset="realAWSCloudwatch", rootdir=path_to_NAB)
forecast_dataset = M4(subset="Hourly", rootdir=path_to_M4)
If you install this package in editable mode (i.e. specify -e when calling pip install -e ts_datasets/),
there is no need to specify a rootdir for any of the data loaders.
The core features of general data loaders (e.g. for forecasting) are outlined in the API doc for
ts_datasets.base.BaseDataset, and the features for time series anomaly detection data loaders
are outlined in the API doc for ts_datasets.anomaly.TSADBaseDataset.
The easiest way to load a custom dataset is to use either the ts_datasets.forecast.CustomDataset or
ts_datasets.anomaly.CustomAnomalyDataset classes. Please review the tutorial
to get started.
Datasets for time series anomaly detection (TSAD). |
|
Datasets for time series forecasting. |
Subpackages
datasets.base module
- class ts_datasets.base.BaseDataset
Bases:
objectBase dataset class for storing time series as
pd.DataFrames. Each dataset supports the following features:__getitem__: you may callts, metadata = dataset[i].tsis a time-indexedpandasDataFrame, with each column representing a different variable (in the case of multivariate time series).metadatais a dict orpd.DataFramewith the same index asts, with different keys indicating different dataset-specific metadata (train/test split, anomaly labels, etc.) for each timestamp.__len__: Callinglen(dataset)will return the number of time series in the dataset.__iter__: You may iterate over thepandasrepresentations of the time series in the dataset withfor ts, metadata in dataset: ...
Note
For each time series, the
metadatawill always have the keytrainval, which is apd.Seriesofboolindicating whether each timestamp of the time series should be training/validation (ifTrue) or testing (ifFalse).-
time_series:
list A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets,
time_seriesmay be a list of filenames, which are read lazily either during iteration, or whenever__getitem__is invoked.
-
metadata:
list A list containing the metadata for all individual time series in the dataset.
- describe()