merlion.evaluate package
This subpackage implements utilities and metrics for evaluating the performance of time series models on different tasks.
Base class for an automated model evaluation framework. 

Metrics and utilities for evaluating time series anomaly detection models. 

Metrics and utilities for evaluating forecasting models in a continuous sense. 
Submodules
merlion.evaluate.base module
Base class for an automated model evaluation framework.
 class merlion.evaluate.base.EvaluatorConfig(train_window=None, retrain_freq=None, cadence=None)
Bases:
object
Abstract class which defines an evaluator config.
 Parameters
train_window (
Optional
[float
]) – the maximum duration of data we would like to train the model on.None
means no limit.retrain_freq (
Optional
[float
]) – the frequency at which we want to retrain the model.None
means we only train the model once on the initial training data.cadence (
Optional
[float
]) – the frequency at which we want to obtain predictions from the model.None
means that we obtain a new prediction at the same frequency as the model’s predictive horizon (set by the alert condition).0
means that we obtain a new prediction at every timestamp.
 property cadence
 Returns
the cadence (interval, in number of seconds) at which we are having our model produce new predictions. Defaults to the retraining frequency if not explicitly provided.
 property horizon
 Returns
the horizon (number of seconds) our model is predicting into the future. Equal to the prediction cadence by default.
 to_dict()
 class merlion.evaluate.base.EvaluatorBase(model, config)
Bases:
object
An evaluator simulates the live deployment of a model on historical data. It trains a model on an initial time series, and then retrains that model at a specified frequency.
The
EvaluatorBase.get_predict
method returns the train & test predictions of a model, as if it were being trained incrementally on the test data in the manner described above.Subclasses define slightly different protocols for different tasks, e.g. anomaly detection vs. forecasting.
 Parameters
model (
ModelBase
) – the model to evaluate.config (
EvaluatorConfig
) – the evaluation configuration.
 config_class
alias of
EvaluatorConfig
 property train_window
 property retrain_freq
 property cadence
 property horizon
 default_train_kwargs()
 Return type
dict
 default_retrain_kwargs()
 Return type
dict
 get_predict(train_vals, test_vals, train_kwargs=None, retrain_kwargs=None)
Initialize the model by training it on an initial set of train data. Get the model’s predictions on the test data, retraining the model as appropriate.
 Parameters
train_vals (
TimeSeries
) – initial training datatest_vals (
TimeSeries
) – all data where we want to get the model’s predictions and compare it to the ground truthtrain_kwargs (
Optional
[dict
]) – dict of keyword arguments we want to use for the initial training process.retrain_kwargs (
Optional
[dict
]) – dict of keyword arguments we want to use for all subsequent retrainings.
 Return type
Tuple
[Any
,Union
[TimeSeries
,List
[TimeSeries
]]] Returns
(train_result, result)
.train_result
is the output of training the model ontrain_vals
.result
is the model’s predictions ontest_vals
, and is specific to each evaluation task.
 abstract evaluate(ground_truth, predict, metric)
Given the ground truth time series & the model’s prediction (as produced by
EvaluatorBase.get_predict
), compute the specified evaluation metric. If no metric is specified, return the appropriate score accumulator for the task. Implementation is taskspecific.
merlion.evaluate.anomaly module
Metrics and utilities for evaluating time series anomaly detection models.
 class merlion.evaluate.anomaly.ScoreType(value)
Bases:
Enum
The algorithm to use to compute true/false positives/negatives. See the technical report for more details on each score type. Merlion’s preferred default is revised pointadjusted.
 Pointwise = 0
 PointAdjusted = 1
 RevisedPointAdjusted = 2
 class merlion.evaluate.anomaly.TSADScoreAccumulator(num_tp_anom=0, num_tp_pointwise=0, num_tp_point_adj=0, num_fn_anom=0, num_fn_pointwise=0, num_fn_point_adj=0, num_fp=0, num_tn=0, tp_score=0.0, fp_score=0.0, tp_detection_delays=None, tp_anom_durations=None, anom_durations=None)
Bases:
object
Accumulator which maintains summary statistics describing an anomaly detection algorithm’s performance. Can be used to compute many different time series anomaly detection metrics.
 precision(score_type=ScoreType.RevisedPointAdjusted)
 recall(score_type=ScoreType.RevisedPointAdjusted)
 f1(score_type=ScoreType.RevisedPointAdjusted)
 f_beta(score_type=ScoreType.RevisedPointAdjusted, beta=1.0)
 mean_time_to_detect()
 mean_detected_anomaly_duration()
 mean_anomaly_duration()
 nab_score(tp_weight=1.0, fp_weight=0.11, fn_weight=1.0, tn_weight=0.0)
Computes the NAB score, given the accumulated performance metrics and the specified weights for different types of errors. The score is described in section II.C of https://arxiv.org/pdf/1510.03336.pdf. At a high level, this score is a costsensitive, recencyweighted accuracy measure for time series anomaly detection.
NAB uses the following profiles for benchmarking (https://github.com/numenta/NAB/blob/master/config/profiles.json):
standard (default)  tp_weight = 1.0, fp_weight = 0.11, fn_weight = 1.0
reward low false positive rate  tp_weight = 1.0, fp_weight = 0.22, fn_weight = 1.0
reward low false negative rate  tp_weight = 1.0, fp_weight = 0.11, fn_weight = 2.0
Note that tn_weight is ignored.
 Parameters
tp_weight – relative weight of true positives.
fp_weight – relative weight of false positives.
fn_weight – relative weight of false negatives.
tn_weight – relative weight of true negatives. Ignored, but included for completeness.
 Returns
NAB score
 merlion.evaluate.anomaly.accumulate_tsad_score(ground_truth, predict, max_early_sec=None, max_delay_sec=None, metric=None)
Computes the components required to compute multiple different types of performance metrics for time series anomaly detection.
 Parameters
ground_truth (
TimeSeries
) – A time series indicating whether each time step corresponds to an anomaly.predict (
TimeSeries
) – A time series with the anomaly score predicted for each time step. Detections correspond to nonzero scores.max_early_sec – The maximum amount of time (in seconds) the anomaly detection is allowed to occur before the actual incidence. If None, no early detections are allowed. Note that None is the same as 0.
max_delay_sec – The maximum amount of time (in seconds) the anomaly detection is allowed to occur after the start of the actual incident (but before the end of the actual incident). If None, we allow any detection during the duration of the incident. Note that None differs from 0 because 0 means that we only permit detections that are early or exactly on time!
metric – A function which takes a
TSADScoreAccumulator
as input and returns afloat
. TheTSADScoreAccumulator
object is returned ifmetric
isNone
.
 Return type
Union
[TSADScoreAccumulator
,float
]
 class merlion.evaluate.anomaly.TSADMetric(value)
Bases:
Enum
Enumeration of evaluation metrics for time series anomaly detection. For each value, the name is the metric, and the value is a partial function of form
f(ground_truth, predicted, **kwargs)
 MeanTimeToDetect = functools.partial(<function accumulate_tsad_score>, metric=<function TSADScoreAccumulator.mean_time_to_detect>)
 F1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
 Precision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
 Recall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
 PointwiseF1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.Pointwise: 0>))
 PointwisePrecision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.Pointwise: 0>))
 PointwiseRecall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.Pointwise: 0>))
 PointAdjustedF1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.PointAdjusted: 1>))
 PointAdjustedPrecision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.PointAdjusted: 1>))
 PointAdjustedRecall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.PointAdjusted: 1>))
 NABScore = functools.partial(<function accumulate_tsad_score>, metric=<function TSADScoreAccumulator.nab_score>)
 NABScoreLowFN = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.nab_score>, fn_weight=2.0))
 NABScoreLowFP = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.nab_score>, fp_weight=0.22))
 F2 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f_beta>, score_type=<ScoreType.RevisedPointAdjusted: 2>, beta=2.0))
 F5 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f_beta>, score_type=<ScoreType.RevisedPointAdjusted: 2>, beta=5.0))
 class merlion.evaluate.anomaly.TSADEvaluatorConfig(max_early_sec=None, max_delay_sec=None, **kwargs)
Bases:
EvaluatorConfig
Configuration class for a
TSADEvaluator
. Parameters
max_early_sec (
Optional
[float
]) – the maximum number of seconds we allow an anomaly to be detected early.max_delay_sec (
Optional
[float
]) – if an anomaly is detected more than this many seconds after its start, it is not counted as being detected.
 class merlion.evaluate.anomaly.TSADEvaluator(model, config)
Bases:
EvaluatorBase
Simulates the live deployment of an anomaly detection model.
 Parameters
model – the model to evaluate.
config – the evaluation configuration.
 config_class
alias of
TSADEvaluatorConfig
 property max_early_sec
 property max_delay_sec
 default_retrain_kwargs()
 Return type
dict
 get_predict(train_vals, test_vals, train_kwargs=None, retrain_kwargs=None, post_process=True)
Initialize the model by training it on an initial set of train data. Simulate realtime anomaly detection by the model, while retraining it at the desired frequency.
 Parameters
train_vals (
TimeSeries
) – initial training datatest_vals (
TimeSeries
) – all data where we want to get the model’s predictions and compare it to the ground truthtrain_kwargs (
Optional
[dict
]) – dict of keyword arguments we want to use for the initial training process. Typically, you will want to provide the key “anomaly_labels” here, if you have training data with labeled anomalies, as well as the key “post_rule_train_config”, if you want to use a custom training config for the model’s postrule.retrain_kwargs (
Optional
[dict
]) – dict of keyword arguments we want to use for all subsequent retrainings. Typically, you will not supply any this argument.post_process – whether to apply the model’s postrule on the returned results.
 Return type
Tuple
[TimeSeries
,TimeSeries
] Returns
(train_result, result)
.train_result
is aTimeSeries
of the model’s anomaly scores ontrain_vals
.result
is aTimeSeries
of the model’s anomaly scores ontest_vals
.
 evaluate(ground_truth, predict, metric=None)
 Parameters
ground_truth (
TimeSeries
) –TimeSeries
of ground truth anomaly labelspredict (
TimeSeries
) –TimeSeries
of predicted anomaly scoresmetric (
Optional
[TSADMetric
]) – theTSADMetric
we wish to evaluate.
 Return type
Union
[TSADScoreAccumulator
,float
] Returns
the value of the evaluation
metric
, if one is given. ATSADScoreAccumulator
otherwise.
merlion.evaluate.forecast module
Metrics and utilities for evaluating forecasting models in a continuous sense.
 class merlion.evaluate.forecast.ForecastScoreAccumulator(ground_truth, predict, insample=None, periodicity=1, ub=None, lb=None)
Bases:
object
Accumulator which maintains summary statistics describing a forecasting algorithm’s performance. Can be used to compute many different forecasting metrics.
 Parameters
ground_truth (
TimeSeries
) – ground truth time seriespredict (
TimeSeries
) – predicted truth time series(optional) (lb) – time series used for training model. This value is used for computing MSES, MSIS
(optional) – periodicity. m=1 indicates the nonseasonal time series, whereas m>1 indicates seasonal time series. This value is used for computing MSES, MSIS.
(optional) – upper bound of 95% prediction interval. This value is used for computing MSIS
(optional) – lower bound of 95% prediction interval. This value is used for computing MSIS
 check_before_eval()
 mae()
Mean Absolute Error (MAE)
For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as
\[\frac{1}{T}\sum_{t=1}^T{(y_t  \hat{y}_t)}.\]
 marre()
Mean Absolute Ranged Relative Error (MARRE)
For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as
\[100 \cdot \frac{1}{T} \sum_{t=1}^{T} {\left \frac{y_t  \hat{y}_t} {\max_t{y_t}  \min_t{y_t}} \right}.\]
 rmse()
Root Mean Squared Error (RMSE)
For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as
\[\sqrt{\frac{1}{T}\sum_{t=1}^T{(y_t  \hat{y}_t)^2}}.\]
 smape()
symmetric Mean Absolute Percentage Error (sMAPE). For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as
\[200 \cdot \frac{1}{T} \sum_{t=1}^{T}{\frac{\left y_t  \hat{y}_t \right}{\left y_t \right + \left \hat{y}_t \right}}.\]
 mase()
Mean Absolute Scaled Error (MASE) For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\). In sample time series \(\hat{x}\) of length \(N\) and periodicity \(m\) it is computed as
\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T}\left y_t  \hat{y}_t \right}{\frac{1}{Nm}\sum_{t=m+1}^{N}\left x_t  x_{tm} \right}.\]
 msis()
Mean Scaled Interval Score (MSIS) This metric evaluates the quality of 95% prediction intervals. For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), the lower and upper bounds of the prediction intervals \(L\) and \(U\). Given in sample time series \(\hat{x}\) of length \(N\) and periodicity \(m\), it is computed as
\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T} (U_t  L_t) + 100 \cdot (L_t  y_t)[y_t<L_t] + 100\cdot(y_t  U_t)[y_t > U_t]}{\frac{1}{Nm}\sum_{t=m+1}^{N}\left x_t  x_{tm} \right}.\]
 merlion.evaluate.forecast.accumulate_forecast_score(ground_truth, predict, insample=None, periodicity=1, ub=None, lb=None, metric=None)
 Return type
Union
[ForecastScoreAccumulator
,float
]
 class merlion.evaluate.forecast.ForecastMetric(value)
Bases:
Enum
Enumeration of evaluation metrics for time series forecasting. For each value, the name is the metric, and the value is a partial function of form
f(ground_truth, predict, **kwargs)
. Here,ground_truth
is the original time series, andpredict
is the result returned by aForecastEvaluator
. MAE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.mae>)
Mean Absolute Error (MAE) is formulated as:
\[\frac{1}{T}\sum_{t=1}^T{(y_t  \hat{y}_t)}.\]
 MARRE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.marre>)
Mean Absolute Ranged Relative Error (MARRE) is formulated as:
\[100 \cdot \frac{1}{T} \sum_{t=1}^{T} {\left \frac{y_t  \hat{y}_t} {\max_t{y_t}  \min_t{y_t}} \right}.\]
 RMSE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.rmse>)
Root Mean Squared Error (RMSE) is formulated as:
\[\sqrt{\frac{1}{T}\sum_{t=1}^T{(y_t  \hat{y}_t)^2}}.\]
 sMAPE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.smape>)
symmetric Mean Absolute Percentage Error (sMAPE) is formulated as:
\[200 \cdot \frac{1}{T}\sum_{t=1}^{T}{\frac{\left y_t  \hat{y}_t \right}{\left y_t \right + \left \hat{y}_t \right}}.\]
 MASE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.mase>)
Mean Absolute Scaled Error (MASE) is formulated as:
\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T}\left y_t  \hat{y}_t \right}{\frac{1}{Nm}\sum_{t=m+1}^{N}\left x_t  x_{tm} \right}.\]
 MSIS = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.msis>)
Mean Scaled Interval Score (MSIS) is formulated as:
\[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T} (U_t  L_t) + 100 \cdot (L_t  y_t)[y_t<L_t] + 100\cdot(y_t  U_t)[y_t > U_t]}{\frac{1}{Nm}\sum_{t=m+1}^{N}\left x_t  x_{tm} \right}.\]
 class merlion.evaluate.forecast.ForecastEvaluatorConfig(horizon=None, **kwargs)
Bases:
EvaluatorConfig
Configuration class for a
ForecastEvaluator
 Parameters
horizon (
Optional
[float
]) – the model’s prediction horizon. Whenever the model makes a prediction, it will predicthorizon
seconds into the future.
 property horizon
 Returns
the horizon (number of seconds) our model is predicting into the future. Defaults to the retraining frequency.
 property cadence
 Returns
the cadence (interval, in number of seconds) at which we are having our model produce new predictions. Defaults to the predictive horizon if there is one, and the retraining frequency otherwise.
 class merlion.evaluate.forecast.ForecastEvaluator(model, config)
Bases:
EvaluatorBase
Simulates the live deployment of an forecaster model.
 Parameters
model – the model to evaluate.
config – the evaluation configuration.
 config_class
alias of
ForecastEvaluatorConfig
 property horizon
 property cadence
 evaluate(ground_truth, predict, metric=ForecastMetric.sMAPE)
 Parameters
ground_truth (
TimeSeries
) – the series of test datapredict (
Union
[TimeSeries
,List
[TimeSeries
]]) – the series of predicted valuesmetric (
ForecastMetric
) – the evaluation metric.