ts_datasets: Easy Data Loading
ts_datasets
implements Python classes that manipulate numerous time series datasets
into standardized pandas.DataFrame
s. The sub-modules are ts_datasets.anomaly
for time series anomaly detection, and ts_datasets.forecast
for time series forecasting.
Simply install the package by calling pip install -e ts_datasets/
from the root directory of Merlion.
Then, you can load a dataset (e.g. the “realAWSCloudwatch” split of the Numenta Anomaly Benchmark
or the “Hourly” subset of the M4 dataset) by calling
from ts_datasets.anomaly import NAB
from ts_datasets.forecast import M4
anom_dataset = NAB(subset="realAWSCloudwatch", rootdir=path_to_NAB)
forecast_dataset = M4(subset="Hourly", rootdir=path_to_M4)
If you install this package in editable mode (i.e. specify -e
when calling pip install -e ts_datasets/
),
there is no need to specify a rootdir
for any of the data loaders.
The core features of general data loaders (e.g. for forecasting) are outlined in the API doc for
ts_datasets.base.BaseDataset
, and the features for time series anomaly detection data loaders
are outlined in the API doc for ts_datasets.anomaly.TSADBaseDataset
.
The easiest way to load a custom dataset is to use either the ts_datasets.forecast.CustomDataset
or
ts_datasets.anomaly.CustomAnomalyDataset
classes. Please review the tutorial
to get started.
Datasets for time series anomaly detection (TSAD). |
|
Datasets for time series forecasting. |
Subpackages
Submodules
datasets.base module
- class ts_datasets.base.BaseDataset
Bases:
object
Base dataset class for storing time series as
pd.DataFrame
s. Each dataset supports the following features:__getitem__
: you may callts, metadata = dataset[i]
.ts
is a time-indexedpandas
DataFrame, with each column representing a different variable (in the case of multivariate time series).metadata
is a dict orpd.DataFrame
with the same index asts
, with different keys indicating different dataset-specific metadata (train/test split, anomaly labels, etc.) for each timestamp.__len__
: Callinglen(dataset)
will return the number of time series in the dataset.__iter__
: You may iterate over thepandas
representations of the time series in the dataset withfor ts, metadata in dataset: ...
Note
For each time series, the
metadata
will always have the keytrainval
, which is apd.Series
ofbool
indicating whether each timestamp of the time series should be training/validation (ifTrue
) or testing (ifFalse
).- time_series: list
A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets,
time_series
may be a list of filenames, which are read lazily either during iteration, or whenever__getitem__
is invoked.
- metadata: list
A list containing the metadata for all individual time series in the dataset.
- describe()