ts_datasets.anomaly package
Datasets for time series anomaly detection (TSAD). All the time series in these datasets have anomaly labels.
- ts_datasets.anomaly.get_dataset(dataset_name, rootdir=None, **kwargs)
- Parameters
dataset_name (
str
) – the name of the dataset to load, formatted as<name>
or<name>_<subset>
, e.g.IOPsCompetition
orNAB_realAWSCloudwatch
rootdir (
Optional
[str
]) – the directory where the desired dataset is stored. Not required if the packagets_datasets
is installed in editable mode, i.e. with flag-e
.kwargs – keyword arguments for the data loader you are trying to load.
- Return type
- Returns
the data loader for the desired dataset (and subset) desired
- class ts_datasets.anomaly.TSADBaseDataset
Bases:
BaseDataset
Base dataset class for storing time series intended for anomaly detection.
Each dataset supports the following features:
__getitem__
: you may callts, metadata = dataset[i]
.ts
is a time-indexedpandas
DataFrame, with each column representing a different variable (in the case of multivariate time series).metadata
is a dict orpd.DataFrame
with the same index asts
, with different keys indicating different dataset-specific metadata (train/test split, anomaly labels, etc.) for each timestamp.__len__
: Callinglen(dataset)
will return the number of time series in the dataset.__iter__
: You may iterate over thepandas
representations of the time series in the dataset withfor ts, metadata in dataset: ...
Note
For each time series, the
metadata
will always have the keytrainval
, which is apd.Series
ofbool
indicating whether each timestamp of the time series should be training/validation (ifTrue
) or testing (ifFalse
).Note
For each time series, the
metadata
will always have the keyanomaly
, which is apd.Series
ofbool
indicating whether each timestamp is anomalous.- property max_lead_sec
The maximum number of seconds an anomaly may be detected early, for this dataset.
None
signifies no early detections allowed, or that the user may override this value with something better suited for their purposes.
- property max_lag_sec
The maximum number of seconds after the start of an anomaly, that we consider detections to be accurate (and not ignored for being too late).
None
signifies that any detection in the window is acceptable, or that the user may override this value with something better suited for their purposes.
- describe()
-
time_series:
list
A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets,
time_series
may be a list of filenames, which are read lazily either during iteration, or whenever__getitem__
is invoked.
-
metadata:
list
A list containing the metadata for all individual time series in the dataset.
- class ts_datasets.anomaly.CustomAnomalyDataset(rootdir, test_frac=0.5, assume_no_anomaly=False, time_col=None, time_unit='s', data_cols=None, index_cols=None)
Bases:
CustomDataset
,TSADBaseDataset
Wrapper to load a custom dataset for anomaly detection. Please review the tutorial <tutorials/CustomDataset> to get started.
- Parameters
rootdir – Filename of a single CSV, or a directory containing many CSVs. Each CSV must contain 1 or more time series.
test_frac – If we don’t find a column “trainval” in the time series, this is the fraction of each time series which we use for testing.
assume_no_anomaly – If we don’t find a column “anomaly” in the time series, we assume there are no anomalies in the data if this value is
True
, and we throw an exception if this value isFalse
.time_col – Name of the column used to index time. We use the first non-index, non-metadata column if none is given.
data_cols – Name of the columns to fetch from the dataset. If
None
, use all non-time, non-index columns.time_unit – If the time column is numerical, we assume it is a timestamp expressed in this unit.
index_cols – If a CSV file contains multiple time series, these are the columns used to index those time series. For example, a CSV file may contain time series of sales for many (store, department) pairs. In this case,
index_cols
may be["Store", "Dept"]
. The values of the index columns will be added to the metadata of the data loader.
- property metadata_cols
- check_ts_for_metadata(ts, col)
-
time_series:
list
A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets,
time_series
may be a list of filenames, which are read lazily either during iteration, or whenever__getitem__
is invoked.
-
metadata:
list
A list containing the metadata for all individual time series in the dataset.
- class ts_datasets.anomaly.IOpsCompetition(rootdir=None)
Bases:
TSADBaseDataset
Wrapper to load the dataset used for the final round of the IOPs competition (http://iops.ai/competition_detail/?competition_id=5).
The dataset contains 29 time series of KPIs gathered from large tech companies (Alibaba, Sogou, Tencent, Baidu, and eBay). These time series are sampled at either 1min or 5min intervals, and are split into train and test sections.
Note that the original competition prohibited algorithms which directly hard-coded the KPI ID to set model parameters. So training a new model for each time series was against competition rules. They did, however, allow algorithms which analyzed each time series (in an automated way), and used the results of that automated analysis to perform algorithm/model selection.
- Parameters
rootdir – The root directory at which the dataset can be found.
- property max_lag_sec
The IOps competition allows anomalies to be detected up to 35min after they start. We are currently not using this, but we are leaving the override here as a placeholder, if we want to change it later.
-
time_series:
list
A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets,
time_series
may be a list of filenames, which are read lazily either during iteration, or whenever__getitem__
is invoked.
-
metadata:
list
A list containing the metadata for all individual time series in the dataset.
- class ts_datasets.anomaly.NAB(subset='all', rootdir=None)
Bases:
TSADBaseDataset
Wrapper to load datasets found in the Numenta Anomaly Benchmark (https://github.com/numenta/NAB).
The NAB contains a range of datasets and are categorized by their domains.
- Parameters
subset – One of the elements in subsets.
rootdir – The root directory at which the dataset can be found.
- valid_subsets = ['all', 'artificial', 'artificialWithAnomaly', 'realAWSCloudwatch', 'realAdExchange', 'realKnownCause', 'realTraffic', 'realTweets']
- static load_labels(datafile, label_list, freq)
- property max_lead_sec
The anomalies in the NAB dataset are already windows which permit early detection. So we explicitly disallow any earlier detection.
- download(rootdir, subsets)
-
time_series:
list
A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets,
time_series
may be a list of filenames, which are read lazily either during iteration, or whenever__getitem__
is invoked.
-
metadata:
list
A list containing the metadata for all individual time series in the dataset.
- class ts_datasets.anomaly.Synthetic(subset='anomaly', rootdir=None)
Bases:
TSADBaseDataset
Wrapper to load a sythetically generated dataset. The dataset was generated using three base time series, each of which was separately injected with shocks, spikes, dips and level shifts, making a total of 15 time series (including the base time series without anomalies). Subsets can are defined by the base time series used (“horizontal”, “seasonal”, “upward_downward”), or the type of injected anomaly (“shock”, “spike”, “dip”, “level”). The “anomaly” subset refers to all times series with injected anomalies (12) while “base” refers to all time series without them (3).
- base_ts_subsets = ['horizontal', 'seasonal', 'upward_downward']
- anomaly_subsets = ['shock', 'spike', 'dip', 'level', 'trend']
- valid_subsets = ['anomaly', 'all', 'base', 'horizontal', 'seasonal', 'upward_downward', 'shock', 'spike', 'dip', 'level', 'trend']
-
time_series:
list
A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets,
time_series
may be a list of filenames, which are read lazily either during iteration, or whenever__getitem__
is invoked.
-
metadata:
list
A list containing the metadata for all individual time series in the dataset.
- class ts_datasets.anomaly.UCR(rootdir=None)
Bases:
TSADBaseDataset
Data loader for the Hexagon ML/UC Riverside Time Series Anomaly Archive.
See here for details.
Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping Chen, Bing Hu, Nurjahan Begum, Anthony Bagnall , Abdullah Mueen, Gustavo Batista, & Hexagon-ML (2019). The UCR Time Series Classification Archive. URL https://www.cs.ucr.edu/~eamonn/time_series_data_2018/
-
time_series:
list
A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets,
time_series
may be a list of filenames, which are read lazily either during iteration, or whenever__getitem__
is invoked.
- download(rootdir)
-
metadata:
list
A list containing the metadata for all individual time series in the dataset.
-
time_series:
- class ts_datasets.anomaly.SMD(subset='all', rootdir=None)
Bases:
TSADBaseDataset
The Server Machine Dataset (SMD) is a new 5-week-long dataset from a large Internet company collected and made publicly available. It contains data from 28 server machines and each machine is monitored by 33 metrics. SMD is divided into training set and testing set of equal size.
- filename = 'ServerMachineDataset'
- url = 'https://www.dropbox.com/s/x53ph5cru62kv0f/ServerMachineDataset.tar.gz?dl=1'
- valid_subsets = ['machine-1-1', 'machine-1-2', 'machine-1-3', 'machine-1-4', 'machine-1-5', 'machine-1-6', 'machine-1-7', 'machine-1-8', 'machine-2-1', 'machine-2-2', 'machine-2-3', 'machine-2-4', 'machine-2-5', 'machine-2-6', 'machine-2-7', 'machine-2-8', 'machine-2-9', 'machine-3-1', 'machine-3-2', 'machine-3-3', 'machine-3-4', 'machine-3-5', 'machine-3-6', 'machine-3-7', 'machine-3-8', 'machine-3-9', 'machine-3-10', 'machine-3-11']
-
time_series:
list
A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets,
time_series
may be a list of filenames, which are read lazily either during iteration, or whenever__getitem__
is invoked.
-
metadata:
list
A list containing the metadata for all individual time series in the dataset.
- class ts_datasets.anomaly.SMAP(subset=None, rootdir=None)
Bases:
TSADBaseDataset
Soil Moisture Active Passive (SMAP) satellite and Mars Science Laboratory (MSL) rover Datasets. SMAP and MSL are two realworld public datasets, which are two real-world datasets expert-labeled by NASA.
- url = 'https://www.dropbox.com/s/uv9ojw353qwzqht/SMAP.tar.gz?dl=1'
-
time_series:
list
A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets,
time_series
may be a list of filenames, which are read lazily either during iteration, or whenever__getitem__
is invoked.
-
metadata:
list
A list containing the metadata for all individual time series in the dataset.
- class ts_datasets.anomaly.MSL(subset=None, rootdir=None)
Bases:
TSADBaseDataset
Soil Moisture Active Passive (SMAP) satellite and Mars Science Laboratory (MSL) rover Datasets. SMAP and MSL are two realworld public datasets, which are two real-world datasets expert-labeled by NASA.
- url = 'https://www.dropbox.com/s/uv9ojw353qwzqht/SMAP.tar.gz?dl=1'
-
time_series:
list
A list of all individual time series contained in the dataset. Iterating over the dataset will iterate over this list. Note that for some large datasets,
time_series
may be a list of filenames, which are read lazily either during iteration, or whenever__getitem__
is invoked.
-
metadata:
list
A list containing the metadata for all individual time series in the dataset.