merlion.evaluate package

This sub-package implements utilities and metrics for evaluating the performance of time series models on different tasks.

base

Base class for an automated model evaluation framework.

anomaly

Metrics and utilities for evaluating time series anomaly detection models.

forecast

Metrics and utilities for evaluating forecasting models in a continuous sense.

merlion.evaluate.base

Base class for an automated model evaluation framework.

class merlion.evaluate.base.EvaluatorConfig(train_window=None, retrain_freq=None, cadence=None)

Bases: object

Abstract class which defines an evaluator config.

Parameters
  • train_window (Optional[float]) – the maximum duration of data we would like to train the model on. None means no limit.

  • retrain_freq (Optional[float]) – the frequency at which we want to re-train the model. None means we only train the model once on the initial training data.

  • cadence (Optional[float]) – the frequency at which we want to obtain predictions from the model. None means that we obtain a new prediction at the same frequency as the model’s predictive horizon. 0 means that we obtain a new prediction at every timestamp.

property train_window: Optional[Union[Timedelta, DateOffset]]
Returns

the maximum duration of data we would like to train the model on. None means no limit.

property retrain_freq: Optional[Union[Timedelta, DateOffset]]
Returns

the frequency at which we want to re-train the model. None means we only train the model on the initial training data.

property cadence: Union[Timedelta, DateOffset]
Returns

the cadence at which we are having our model produce new predictions. Defaults to the retraining frequency if not explicitly provided.

property horizon: DateOffset
Returns

the horizon our model is predicting into the future. Equal to the prediction cadence by default.

to_dict()
class merlion.evaluate.base.EvaluatorBase(model, config)

Bases: object

An evaluator simulates the live deployment of a model on historical data. It trains a model on an initial time series, and then re-trains that model at a specified frequency.

The EvaluatorBase.get_predict method returns the train & test predictions of a model, as if it were being trained incrementally on the test data in the manner described above.

Subclasses define slightly different protocols for different tasks, e.g. anomaly detection vs. forecasting.

Parameters
config_class

alias of EvaluatorConfig

property train_window
property retrain_freq
property cadence
property horizon
default_train_kwargs()
Return type

dict

default_retrain_kwargs()
Return type

dict

get_predict(train_vals, test_vals, exog_data=None, train_kwargs=None, retrain_kwargs=None)

Initialize the model by training it on an initial set of train data. Get the model’s predictions on the test data, retraining the model as appropriate.

Parameters
  • train_vals (TimeSeries) – initial training data

  • test_vals (TimeSeries) – all data where we want to get the model’s predictions and compare it to the ground truth

  • exog_data (Optional[TimeSeries]) – any exogenous data (only used for some models)

  • train_kwargs (Optional[dict]) – dict of keyword arguments we want to use for the initial training process

  • retrain_kwargs (Optional[dict]) – dict of keyword arguments we want to use for all subsequent retrainings

Return type

Tuple[Any, Union[TimeSeries, List[TimeSeries]]]

Returns

(train_result, result). train_result is the output of training the model on train_vals (None if pretrained is True). result is the model’s predictions on test_vals, and is specific to each evaluation task.

abstract evaluate(ground_truth, predict, metric)

Given the ground truth time series & the model’s prediction (as produced by EvaluatorBase.get_predict), compute the specified evaluation metric. If no metric is specified, return the appropriate score accumulator for the task. Implementation is task-specific.

merlion.evaluate.anomaly

Metrics and utilities for evaluating time series anomaly detection models.

class merlion.evaluate.anomaly.ScoreType(value)

Bases: Enum

The algorithm to use to compute true/false positives/negatives. See the technical report for more details on each score type. Merlion’s preferred default is revised point-adjusted.

Pointwise = 0
PointAdjusted = 1
RevisedPointAdjusted = 2
class merlion.evaluate.anomaly.TSADScoreAccumulator(num_tp_anom=0, num_tp_pointwise=0, num_tp_point_adj=0, num_fn_anom=0, num_fn_pointwise=0, num_fn_point_adj=0, num_fp=0, num_tn=0, tp_score=0.0, fp_score=0.0, tp_detection_delays=None, tp_anom_durations=None, anom_durations=None)

Bases: object

Accumulator which maintains summary statistics describing an anomaly detection algorithm’s performance. Can be used to compute many different time series anomaly detection metrics.

precision(score_type=ScoreType.RevisedPointAdjusted)
recall(score_type=ScoreType.RevisedPointAdjusted)
f1(score_type=ScoreType.RevisedPointAdjusted)
f_beta(score_type=ScoreType.RevisedPointAdjusted, beta=1.0)
mean_time_to_detect()
mean_detected_anomaly_duration()
mean_anomaly_duration()
nab_score(tp_weight=1.0, fp_weight=0.11, fn_weight=1.0, tn_weight=0.0)

Computes the NAB score, given the accumulated performance metrics and the specified weights for different types of errors. The score is described in section II.C of https://arxiv.org/pdf/1510.03336.pdf. At a high level, this score is a cost-sensitive, recency-weighted accuracy measure for time series anomaly detection.

NAB uses the following profiles for benchmarking (https://github.com/numenta/NAB/blob/master/config/profiles.json):

  • standard (default) - tp_weight = 1.0, fp_weight = 0.11, fn_weight = 1.0

  • reward low false positive rate - tp_weight = 1.0, fp_weight = 0.22, fn_weight = 1.0

  • reward low false negative rate - tp_weight = 1.0, fp_weight = 0.11, fn_weight = 2.0

Note that tn_weight is ignored.

Parameters
  • tp_weight – relative weight of true positives.

  • fp_weight – relative weight of false positives.

  • fn_weight – relative weight of false negatives.

  • tn_weight – relative weight of true negatives. Ignored, but included for completeness.

Returns

NAB score

merlion.evaluate.anomaly.accumulate_tsad_score(ground_truth, predict, max_early_sec=None, max_delay_sec=None, metric=None)

Computes the components required to compute multiple different types of performance metrics for time series anomaly detection.

Parameters
  • ground_truth (Union[TimeSeries, UnivariateTimeSeries]) – A time series indicating whether each time step corresponds to an anomaly.

  • predict (Union[TimeSeries, UnivariateTimeSeries]) – A time series with the anomaly score predicted for each time step. Detections correspond to nonzero scores.

  • max_early_sec – The maximum amount of time (in seconds) the anomaly detection is allowed to occur before the actual incidence. If None, no early detections are allowed. Note that None is the same as 0.

  • max_delay_sec – The maximum amount of time (in seconds) the anomaly detection is allowed to occur after the start of the actual incident (but before the end of the actual incident). If None, we allow any detection during the duration of the incident. Note that None differs from 0 because 0 means that we only permit detections that are early or exactly on time!

  • metric – A function which takes a TSADScoreAccumulator as input and returns a float. The TSADScoreAccumulator object is returned if metric is None.

Return type

Union[TSADScoreAccumulator, float]

class merlion.evaluate.anomaly.TSADMetric(value)

Bases: Enum

Enumeration of evaluation metrics for time series anomaly detection. For each value, the name is the metric, and the value is a partial function of form f(ground_truth, predicted, **kwargs)

MeanTimeToDetect = functools.partial(<function accumulate_tsad_score>, metric=<function TSADScoreAccumulator.mean_time_to_detect>)
F1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
Precision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
Recall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
PointwiseF1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.Pointwise: 0>))
PointwisePrecision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.Pointwise: 0>))
PointwiseRecall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.Pointwise: 0>))
PointAdjustedF1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.PointAdjusted: 1>))
PointAdjustedPrecision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.PointAdjusted: 1>))
PointAdjustedRecall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.PointAdjusted: 1>))
NABScore = functools.partial(<function accumulate_tsad_score>, metric=<function TSADScoreAccumulator.nab_score>)
NABScoreLowFN = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.nab_score>, fn_weight=2.0))
NABScoreLowFP = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.nab_score>, fp_weight=0.22))
F2 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f_beta>, score_type=<ScoreType.RevisedPointAdjusted: 2>, beta=2.0))
F5 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f_beta>, score_type=<ScoreType.RevisedPointAdjusted: 2>, beta=5.0))
class merlion.evaluate.anomaly.TSADEvaluatorConfig(max_early_sec=None, max_delay_sec=None, **kwargs)

Bases: EvaluatorConfig

Configuration class for a TSADEvaluator.

Parameters
  • max_early_sec (Optional[float]) – the maximum number of seconds we allow an anomaly to be detected early.

  • max_delay_sec (Optional[float]) – if an anomaly is detected more than this many seconds after its start, it is not counted as being detected.

class merlion.evaluate.anomaly.TSADEvaluator(model, config)

Bases: EvaluatorBase

Simulates the live deployment of an anomaly detection model.

Parameters
  • model – the model to evaluate.

  • config – the evaluation configuration.

config_class

alias of TSADEvaluatorConfig

property max_early_sec
property max_delay_sec
default_retrain_kwargs()
Return type

dict

get_predict(train_vals, test_vals, exog_data=None, train_kwargs=None, retrain_kwargs=None, post_process=True)

Initialize the model by training it on an initial set of train data. Simulate real-time anomaly detection by the model, while re-training it at the desired frequency.

Parameters
  • train_vals (TimeSeries) – initial training data

  • test_vals (TimeSeries) – all data where we want to get the model’s predictions and compare it to the ground truth

  • exog_data (Optional[TimeSeries]) – any exogenous data (only used for some models)

  • train_kwargs (Optional[dict]) – dict of keyword arguments we want to use for the initial training process. Typically, you will want to provide the key “anomaly_labels” here, if you have training data with labeled anomalies, as well as the key “post_rule_train_config”, if you want to use a custom training config for the model’s post-rule.

  • retrain_kwargs (Optional[dict]) – dict of keyword arguments we want to use for all subsequent retrainings. Typically, you will not supply any this argument.

  • post_process – whether to apply the model’s post-rule on the returned results.

Return type

Tuple[TimeSeries, TimeSeries]

Returns

(train_result, result). train_result is a TimeSeries of the model’s anomaly scores on train_vals. result is a TimeSeries of the model’s anomaly scores on test_vals.

evaluate(ground_truth, predict, metric=None)
Parameters
  • ground_truth (TimeSeries) – TimeSeries of ground truth anomaly labels

  • predict (TimeSeries) – TimeSeries of predicted anomaly scores

  • metric (Optional[TSADMetric]) – the TSADMetric we wish to evaluate.

Return type

Union[TSADScoreAccumulator, float]

Returns

the value of the evaluation metric, if one is given. A TSADScoreAccumulator otherwise.

merlion.evaluate.forecast

Metrics and utilities for evaluating forecasting models in a continuous sense.

class merlion.evaluate.forecast.ForecastScoreAccumulator(ground_truth, predict, insample=None, periodicity=1, ub=None, lb=None, target_seq_index=None)

Bases: object

Accumulator which maintains summary statistics describing a forecasting algorithm’s performance. Can be used to compute many different forecasting metrics.

Parameters
  • ground_truth (Union[UnivariateTimeSeries, TimeSeries]) – ground truth time series

  • predict (Union[UnivariateTimeSeries, TimeSeries]) – predicted truth time series

  • (optional) (target_seq_index) – time series used for training model. This value is used for computing MSES, MSIS

  • (optional) – periodicity. m=1 indicates the non-seasonal time series, whereas m>1 indicates seasonal time series. This value is used for computing MSES, MSIS.

  • (optional) – upper bound of 95% prediction interval. This value is used for computing MSIS

  • (optional) – lower bound of 95% prediction interval. This value is used for computing MSIS

  • (optional) – the index of the target sequence, for multivariate.

check_before_eval()
mae()

Mean Absolute Error (MAE)

For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as

\[\frac{1}{T}\sum_{t=1}^T{(|y_t - \hat{y}_t|)}.\]
marre()

Mean Absolute Ranged Relative Error (MARRE)

For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as

\[100 \cdot \frac{1}{T} \sum_{t=1}^{T} {\left| \frac{y_t - \hat{y}_t} {\max_t{y_t} - \min_t{y_t}} \right|}.\]
rmse()

Root Mean Squared Error (RMSE)

For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as

\[\sqrt{\frac{1}{T}\sum_{t=1}^T{(y_t - \hat{y}_t)^2}}.\]
smape()

symmetric Mean Absolute Percentage Error (sMAPE). For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as

\[200 \cdot \frac{1}{T} \sum_{t=1}^{T}{\frac{\left| y_t - \hat{y}_t \right|}{\left| y_t \right| + \left| \hat{y}_t \right|}}.\]
rmspe()

Root Mean Squared Percent Error (RMSPE)

For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as

\[100 \cdot \sqrt{\frac{1}{T}\sum_{t=1}^T\frac{(y_t - \hat{y}_t)}{y_t}^2}.\]
mase()

Mean Absolute Scaled Error (MASE) For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\). In sample time series \(\hat{x}\) of length \(N\) and periodicity \(m\) it is computed as

\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T}\left| y_t - \hat{y}_t \right|}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
msis()

Mean Scaled Interval Score (MSIS) This metric evaluates the quality of 95% prediction intervals. For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), the lower and upper bounds of the prediction intervals \(L\) and \(U\). Given in sample time series \(\hat{x}\) of length \(N\) and periodicity \(m\), it is computed as

\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T} (U_t - L_t) + 100 \cdot (L_t - y_t)[y_t<L_t] + 100\cdot(y_t - U_t)[y_t > U_t]}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
merlion.evaluate.forecast.accumulate_forecast_score(ground_truth, predict, insample=None, periodicity=1, ub=None, lb=None, metric=None, target_seq_index=None)
Return type

Union[ForecastScoreAccumulator, float]

class merlion.evaluate.forecast.ForecastMetric(value)

Bases: Enum

Enumeration of evaluation metrics for time series forecasting. For each value, the name is the metric, and the value is a partial function of form f(ground_truth, predict, **kwargs). Here, ground_truth is the original time series, and predict is the result returned by a ForecastEvaluator.

MAE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.mae>)

Mean Absolute Error (MAE) is formulated as:

\[\frac{1}{T}\sum_{t=1}^T{(|y_t - \hat{y}_t|)}.\]
MARRE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.marre>)

Mean Absolute Ranged Relative Error (MARRE) is formulated as:

\[100 \cdot \frac{1}{T} \sum_{t=1}^{T} {\left| \frac{y_t - \hat{y}_t} {\max_t{y_t} - \min_t{y_t}} \right|}.\]
RMSE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.rmse>)

Root Mean Squared Error (RMSE) is formulated as:

\[\sqrt{\frac{1}{T}\sum_{t=1}^T{(y_t - \hat{y}_t)^2}}.\]
sMAPE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.smape>)

symmetric Mean Absolute Percentage Error (sMAPE) is formulated as:

\[200 \cdot \frac{1}{T}\sum_{t=1}^{T}{\frac{\left| y_t - \hat{y}_t \right|}{\left| y_t \right| + \left| \hat{y}_t \right|}}.\]
RMSPE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.rmspe>)

Root Mean Square Percent Error is formulated as:

\[100 \cdot \sqrt{\frac{1}{T}\sum_{t=1}^T\frac{(y_t - \hat{y}_t)}{y_t}^2}.\]
MASE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.mase>)

Mean Absolute Scaled Error (MASE) is formulated as:

\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T}\left| y_t - \hat{y}_t \right|}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
MSIS = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.msis>)

Mean Scaled Interval Score (MSIS) is formulated as:

\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T} (U_t - L_t) + 100 \cdot (L_t - y_t)[y_t<L_t] + 100\cdot(y_t - U_t)[y_t > U_t]}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
class merlion.evaluate.forecast.ForecastEvaluatorConfig(horizon=None, **kwargs)

Bases: EvaluatorConfig

Configuration class for a ForecastEvaluator

Parameters

horizon (Optional[float]) – the model’s prediction horizon. Whenever the model makes a prediction, it will predict horizon seconds into the future.

property horizon: Optional[Union[Timedelta, DateOffset]]
Returns

the horizon our model is predicting into the future. Defaults to the retraining frequency.

property cadence: Optional[Union[Timedelta, DateOffset]]
Returns

the cadence at which we are having our model produce new predictions. Defaults to the predictive horizon if there is one, and the retraining frequency otherwise.

class merlion.evaluate.forecast.ForecastEvaluator(model, config)

Bases: EvaluatorBase

Simulates the live deployment of an forecaster model.

Parameters
  • model – the model to evaluate.

  • config – the evaluation configuration.

config_class

alias of ForecastEvaluatorConfig

property horizon
property cadence
evaluate(ground_truth, predict, metric=ForecastMetric.sMAPE)
Parameters