merlion.evaluate package
This sub-package implements utilities and metrics for evaluating the performance of time series models on different tasks.
Base class for an automated model evaluation framework. |
|
Metrics and utilities for evaluating time series anomaly detection models. |
|
Metrics and utilities for evaluating forecasting models in a continuous sense. |
merlion.evaluate.base
Base class for an automated model evaluation framework.
- class merlion.evaluate.base.EvaluatorConfig(train_window=None, retrain_freq=None, cadence=None)
Bases:
object
Abstract class which defines an evaluator config.
- Parameters
train_window (
Optional
[float
]) – the maximum duration of data we would like to train the model on.None
means no limit.retrain_freq (
Optional
[float
]) – the frequency at which we want to re-train the model.None
means we only train the model once on the initial training data.cadence (
Optional
[float
]) – the frequency at which we want to obtain predictions from the model.None
means that we obtain a new prediction at the same frequency as the model’s predictive horizon.0
means that we obtain a new prediction at every timestamp.
- property train_window: Optional[Union[Timedelta, DateOffset]]
- Returns
the maximum duration of data we would like to train the model on.
None
means no limit.
- property retrain_freq: Optional[Union[Timedelta, DateOffset]]
- Returns
the frequency at which we want to re-train the model.
None
means we only train the model on the initial training data.
- property cadence: Union[Timedelta, DateOffset]
- Returns
the cadence at which we are having our model produce new predictions. Defaults to the retraining frequency if not explicitly provided.
- property horizon: DateOffset
- Returns
the horizon our model is predicting into the future. Equal to the prediction cadence by default.
- to_dict()
- class merlion.evaluate.base.EvaluatorBase(model, config)
Bases:
object
An evaluator simulates the live deployment of a model on historical data. It trains a model on an initial time series, and then re-trains that model at a specified frequency.
The EvaluatorBase.get_predict method returns the train & test predictions of a model, as if it were being trained incrementally on the test data in the manner described above.
Subclasses define slightly different protocols for different tasks, e.g. anomaly detection vs. forecasting.
- Parameters
model (
ModelBase
) – the model to evaluate.config (
EvaluatorConfig
) – the evaluation configuration.
- config_class
alias of
EvaluatorConfig
- property train_window
- property retrain_freq
- property cadence
- property horizon
- default_train_kwargs()
- Return type
dict
- default_retrain_kwargs()
- Return type
dict
- get_predict(train_vals, test_vals, exog_data=None, train_kwargs=None, retrain_kwargs=None)
Initialize the model by training it on an initial set of train data. Get the model’s predictions on the test data, retraining the model as appropriate.
- Parameters
train_vals (
TimeSeries
) – initial training datatest_vals (
TimeSeries
) – all data where we want to get the model’s predictions and compare it to the ground truthexog_data (
Optional
[TimeSeries
]) – any exogenous data (only used for some models)train_kwargs (
Optional
[dict
]) – dict of keyword arguments we want to use for the initial training processretrain_kwargs (
Optional
[dict
]) – dict of keyword arguments we want to use for all subsequent retrainings
- Return type
Tuple
[Any
,Union
[TimeSeries
,List
[TimeSeries
]]]- Returns
(train_result, result)
.train_result
is the output of training the model ontrain_vals
(None
ifpretrained
isTrue
).result
is the model’s predictions ontest_vals
, and is specific to each evaluation task.
- abstract evaluate(ground_truth, predict, metric)
Given the ground truth time series & the model’s prediction (as produced by EvaluatorBase.get_predict), compute the specified evaluation metric. If no metric is specified, return the appropriate score accumulator for the task. Implementation is task-specific.
merlion.evaluate.anomaly
Metrics and utilities for evaluating time series anomaly detection models.
- class merlion.evaluate.anomaly.ScoreType(value)
Bases:
Enum
The algorithm to use to compute true/false positives/negatives. See the technical report for more details on each score type. Merlion’s preferred default is revised point-adjusted.
- Pointwise = 0
- PointAdjusted = 1
- RevisedPointAdjusted = 2
- class merlion.evaluate.anomaly.TSADScoreAccumulator(num_tp_anom=0, num_tp_pointwise=0, num_tp_point_adj=0, num_fn_anom=0, num_fn_pointwise=0, num_fn_point_adj=0, num_fp=0, num_tn=0, tp_score=0.0, fp_score=0.0, tp_detection_delays=None, tp_anom_durations=None, anom_durations=None)
Bases:
object
Accumulator which maintains summary statistics describing an anomaly detection algorithm’s performance. Can be used to compute many different time series anomaly detection metrics.
- precision(score_type=ScoreType.RevisedPointAdjusted)
- recall(score_type=ScoreType.RevisedPointAdjusted)
- f1(score_type=ScoreType.RevisedPointAdjusted)
- f_beta(score_type=ScoreType.RevisedPointAdjusted, beta=1.0)
- mean_time_to_detect()
- mean_detected_anomaly_duration()
- mean_anomaly_duration()
- nab_score(tp_weight=1.0, fp_weight=0.11, fn_weight=1.0, tn_weight=0.0)
Computes the NAB score, given the accumulated performance metrics and the specified weights for different types of errors. The score is described in section II.C of https://arxiv.org/pdf/1510.03336.pdf. At a high level, this score is a cost-sensitive, recency-weighted accuracy measure for time series anomaly detection.
NAB uses the following profiles for benchmarking (https://github.com/numenta/NAB/blob/master/config/profiles.json):
standard (default) - tp_weight = 1.0, fp_weight = 0.11, fn_weight = 1.0
reward low false positive rate - tp_weight = 1.0, fp_weight = 0.22, fn_weight = 1.0
reward low false negative rate - tp_weight = 1.0, fp_weight = 0.11, fn_weight = 2.0
Note that tn_weight is ignored.
- Parameters
tp_weight – relative weight of true positives.
fp_weight – relative weight of false positives.
fn_weight – relative weight of false negatives.
tn_weight – relative weight of true negatives. Ignored, but included for completeness.
- Returns
NAB score
- merlion.evaluate.anomaly.accumulate_tsad_score(ground_truth, predict, max_early_sec=None, max_delay_sec=None, metric=None)
Computes the components required to compute multiple different types of performance metrics for time series anomaly detection.
- Parameters
ground_truth (
Union
[TimeSeries
,UnivariateTimeSeries
]) – A time series indicating whether each time step corresponds to an anomaly.predict (
Union
[TimeSeries
,UnivariateTimeSeries
]) – A time series with the anomaly score predicted for each time step. Detections correspond to nonzero scores.max_early_sec – The maximum amount of time (in seconds) the anomaly detection is allowed to occur before the actual incidence. If None, no early detections are allowed. Note that None is the same as 0.
max_delay_sec – The maximum amount of time (in seconds) the anomaly detection is allowed to occur after the start of the actual incident (but before the end of the actual incident). If None, we allow any detection during the duration of the incident. Note that None differs from 0 because 0 means that we only permit detections that are early or exactly on time!
metric – A function which takes a TSADScoreAccumulator as input and returns a
float
. The TSADScoreAccumulator object is returned ifmetric
isNone
.
- Return type
Union
[TSADScoreAccumulator
,float
]
- class merlion.evaluate.anomaly.TSADMetric(value)
Bases:
Enum
Enumeration of evaluation metrics for time series anomaly detection. For each value, the name is the metric, and the value is a partial function of form
f(ground_truth, predicted, **kwargs)
- MeanTimeToDetect = functools.partial(<function accumulate_tsad_score>, metric=<function TSADScoreAccumulator.mean_time_to_detect>)
- F1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
- Precision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
- Recall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
- PointwiseF1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.Pointwise: 0>))
- PointwisePrecision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.Pointwise: 0>))
- PointwiseRecall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.Pointwise: 0>))
- PointAdjustedF1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.PointAdjusted: 1>))
- PointAdjustedPrecision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.PointAdjusted: 1>))
- PointAdjustedRecall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.PointAdjusted: 1>))
- NABScore = functools.partial(<function accumulate_tsad_score>, metric=<function TSADScoreAccumulator.nab_score>)
- NABScoreLowFN = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.nab_score>, fn_weight=2.0))
- NABScoreLowFP = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.nab_score>, fp_weight=0.22))
- F2 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f_beta>, score_type=<ScoreType.RevisedPointAdjusted: 2>, beta=2.0))
- F5 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f_beta>, score_type=<ScoreType.RevisedPointAdjusted: 2>, beta=5.0))
- class merlion.evaluate.anomaly.TSADEvaluatorConfig(max_early_sec=None, max_delay_sec=None, **kwargs)
Bases:
EvaluatorConfig
Configuration class for a TSADEvaluator.
- Parameters
max_early_sec (
Optional
[float
]) – the maximum number of seconds we allow an anomaly to be detected early.max_delay_sec (
Optional
[float
]) – if an anomaly is detected more than this many seconds after its start, it is not counted as being detected.
- class merlion.evaluate.anomaly.TSADEvaluator(model, config)
Bases:
EvaluatorBase
Simulates the live deployment of an anomaly detection model.
- Parameters
model – the model to evaluate.
config – the evaluation configuration.
- config_class
alias of
TSADEvaluatorConfig
- property max_early_sec
- property max_delay_sec
- default_retrain_kwargs()
- Return type
dict
- get_predict(train_vals, test_vals, exog_data=None, train_kwargs=None, retrain_kwargs=None, post_process=True)
Initialize the model by training it on an initial set of train data. Simulate real-time anomaly detection by the model, while re-training it at the desired frequency.
- Parameters
train_vals (
TimeSeries
) – initial training datatest_vals (
TimeSeries
) – all data where we want to get the model’s predictions and compare it to the ground truthexog_data (
Optional
[TimeSeries
]) – any exogenous data (only used for some models)train_kwargs (
Optional
[dict
]) – dict of keyword arguments we want to use for the initial training process. Typically, you will want to provide the key “anomaly_labels” here, if you have training data with labeled anomalies, as well as the key “post_rule_train_config”, if you want to use a custom training config for the model’s post-rule.retrain_kwargs (
Optional
[dict
]) – dict of keyword arguments we want to use for all subsequent retrainings. Typically, you will not supply any this argument.post_process – whether to apply the model’s post-rule on the returned results.
- Return type
Tuple
[TimeSeries
,TimeSeries
]- Returns
(train_result, result)
.train_result
is a TimeSeries of the model’s anomaly scores ontrain_vals
.result
is a TimeSeries of the model’s anomaly scores ontest_vals
.
- evaluate(ground_truth, predict, metric=None)
- Parameters
ground_truth (
TimeSeries
) – TimeSeries of ground truth anomaly labelspredict (
TimeSeries
) – TimeSeries of predicted anomaly scoresmetric (
Optional
[TSADMetric
]) – the TSADMetric we wish to evaluate.
- Return type
Union
[TSADScoreAccumulator
,float
]- Returns
the value of the evaluation
metric
, if one is given. A TSADScoreAccumulator otherwise.
merlion.evaluate.forecast
Metrics and utilities for evaluating forecasting models in a continuous sense.
- class merlion.evaluate.forecast.ForecastScoreAccumulator(ground_truth, predict, insample=None, periodicity=1, ub=None, lb=None, target_seq_index=None)
Bases:
object
Accumulator which maintains summary statistics describing a forecasting algorithm’s performance. Can be used to compute many different forecasting metrics.
- Parameters
ground_truth (
Union
[UnivariateTimeSeries
,TimeSeries
]) – ground truth time seriespredict (
Union
[UnivariateTimeSeries
,TimeSeries
]) – predicted truth time series(optional) (target_seq_index) – time series used for training model. This value is used for computing MSES, MSIS
(optional) – periodicity. m=1 indicates the non-seasonal time series, whereas m>1 indicates seasonal time series. This value is used for computing MSES, MSIS.
(optional) – upper bound of 95% prediction interval. This value is used for computing MSIS
(optional) – lower bound of 95% prediction interval. This value is used for computing MSIS
(optional) – the index of the target sequence, for multivariate.
- check_before_eval()
- mae()
Mean Absolute Error (MAE)
For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as
\[\frac{1}{T}\sum_{t=1}^T{(|y_t - \hat{y}_t|)}.\]
- marre()
Mean Absolute Ranged Relative Error (MARRE)
For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as
\[100 \cdot \frac{1}{T} \sum_{t=1}^{T} {\left| \frac{y_t - \hat{y}_t} {\max_t{y_t} - \min_t{y_t}} \right|}.\]
- rmse()
Root Mean Squared Error (RMSE)
For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as
\[\sqrt{\frac{1}{T}\sum_{t=1}^T{(y_t - \hat{y}_t)^2}}.\]
- smape()
symmetric Mean Absolute Percentage Error (sMAPE). For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as
\[200 \cdot \frac{1}{T} \sum_{t=1}^{T}{\frac{\left| y_t - \hat{y}_t \right|}{\left| y_t \right| + \left| \hat{y}_t \right|}}.\]
- rmspe()
Root Mean Squared Percent Error (RMSPE)
For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as
\[100 \cdot \sqrt{\frac{1}{T}\sum_{t=1}^T\frac{(y_t - \hat{y}_t)}{y_t}^2}.\]
- mase()
Mean Absolute Scaled Error (MASE) For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\). In sample time series \(\hat{x}\) of length \(N\) and periodicity \(m\) it is computed as
\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T}\left| y_t - \hat{y}_t \right|}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
- msis()
Mean Scaled Interval Score (MSIS) This metric evaluates the quality of 95% prediction intervals. For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), the lower and upper bounds of the prediction intervals \(L\) and \(U\). Given in sample time series \(\hat{x}\) of length \(N\) and periodicity \(m\), it is computed as
\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T} (U_t - L_t) + 100 \cdot (L_t - y_t)[y_t<L_t] + 100\cdot(y_t - U_t)[y_t > U_t]}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
- merlion.evaluate.forecast.accumulate_forecast_score(ground_truth, predict, insample=None, periodicity=1, ub=None, lb=None, metric=None, target_seq_index=None)
- Return type
Union
[ForecastScoreAccumulator
,float
]
- class merlion.evaluate.forecast.ForecastMetric(value)
Bases:
Enum
Enumeration of evaluation metrics for time series forecasting. For each value, the name is the metric, and the value is a partial function of form
f(ground_truth, predict, **kwargs)
. Here,ground_truth
is the original time series, andpredict
is the result returned by a ForecastEvaluator.- MAE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.mae>)
Mean Absolute Error (MAE) is formulated as:
\[\frac{1}{T}\sum_{t=1}^T{(|y_t - \hat{y}_t|)}.\]
- MARRE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.marre>)
Mean Absolute Ranged Relative Error (MARRE) is formulated as:
\[100 \cdot \frac{1}{T} \sum_{t=1}^{T} {\left| \frac{y_t - \hat{y}_t} {\max_t{y_t} - \min_t{y_t}} \right|}.\]
- RMSE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.rmse>)
Root Mean Squared Error (RMSE) is formulated as:
\[\sqrt{\frac{1}{T}\sum_{t=1}^T{(y_t - \hat{y}_t)^2}}.\]
- sMAPE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.smape>)
symmetric Mean Absolute Percentage Error (sMAPE) is formulated as:
\[200 \cdot \frac{1}{T}\sum_{t=1}^{T}{\frac{\left| y_t - \hat{y}_t \right|}{\left| y_t \right| + \left| \hat{y}_t \right|}}.\]
- RMSPE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.rmspe>)
Root Mean Square Percent Error is formulated as:
\[100 \cdot \sqrt{\frac{1}{T}\sum_{t=1}^T\frac{(y_t - \hat{y}_t)}{y_t}^2}.\]
- MASE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.mase>)
Mean Absolute Scaled Error (MASE) is formulated as:
\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T}\left| y_t - \hat{y}_t \right|}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
- MSIS = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.msis>)
Mean Scaled Interval Score (MSIS) is formulated as:
\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T} (U_t - L_t) + 100 \cdot (L_t - y_t)[y_t<L_t] + 100\cdot(y_t - U_t)[y_t > U_t]}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
- class merlion.evaluate.forecast.ForecastEvaluatorConfig(horizon=None, **kwargs)
Bases:
EvaluatorConfig
Configuration class for a ForecastEvaluator
- Parameters
horizon (
Optional
[float
]) – the model’s prediction horizon. Whenever the model makes a prediction, it will predicthorizon
seconds into the future.
- property horizon: Optional[Union[Timedelta, DateOffset]]
- Returns
the horizon our model is predicting into the future. Defaults to the retraining frequency.
- property cadence: Optional[Union[Timedelta, DateOffset]]
- Returns
the cadence at which we are having our model produce new predictions. Defaults to the predictive horizon if there is one, and the retraining frequency otherwise.
- class merlion.evaluate.forecast.ForecastEvaluator(model, config)
Bases:
EvaluatorBase
Simulates the live deployment of an forecaster model.
- Parameters
model – the model to evaluate.
config – the evaluation configuration.
- config_class
alias of
ForecastEvaluatorConfig
- property horizon
- property cadence
- evaluate(ground_truth, predict, metric=ForecastMetric.sMAPE)
- Parameters
ground_truth (
TimeSeries
) – the series of test datapredict (
Union
[TimeSeries
,List
[TimeSeries
]]) – the series of predicted valuesmetric (
ForecastMetric
) – the evaluation metric.