merlion.evaluate package
This sub-package implements utilities and metrics for evaluating the performance of time series models on different tasks.
| Base class for an automated model evaluation framework. | |
| Metrics and utilities for evaluating time series anomaly detection models. | |
| Metrics and utilities for evaluating forecasting models in a continuous sense. | 
merlion.evaluate.base
Base class for an automated model evaluation framework.
- class merlion.evaluate.base.EvaluatorConfig(train_window=None, retrain_freq=None, cadence=None)
- Bases: - object- Abstract class which defines an evaluator config. - Parameters
- train_window ( - Optional[- float]) – the maximum duration of data we would like to train the model on.- Nonemeans no limit.
- retrain_freq ( - Optional[- float]) – the frequency at which we want to re-train the model.- Nonemeans we only train the model once on the initial training data.
- cadence ( - Optional[- float]) – the frequency at which we want to obtain predictions from the model.- Nonemeans that we obtain a new prediction at the same frequency as the model’s predictive horizon.- 0means that we obtain a new prediction at every timestamp.
 
 - property train_window: Optional[Union[Timedelta, DateOffset]]
- Returns
- the maximum duration of data we would like to train the model on. - Nonemeans no limit.
 
 - property retrain_freq: Optional[Union[Timedelta, DateOffset]]
- Returns
- the frequency at which we want to re-train the model. - Nonemeans we only train the model on the initial training data.
 
 - property cadence: Union[Timedelta, DateOffset]
- Returns
- the cadence at which we are having our model produce new predictions. Defaults to the retraining frequency if not explicitly provided. 
 
 - property horizon: DateOffset
- Returns
- the horizon our model is predicting into the future. Equal to the prediction cadence by default. 
 
 - to_dict()
 
- class merlion.evaluate.base.EvaluatorBase(model, config)
- Bases: - object- An evaluator simulates the live deployment of a model on historical data. It trains a model on an initial time series, and then re-trains that model at a specified frequency. - The EvaluatorBase.get_predict method returns the train & test predictions of a model, as if it were being trained incrementally on the test data in the manner described above. - Subclasses define slightly different protocols for different tasks, e.g. anomaly detection vs. forecasting. - Parameters
- model ( - ModelBase) – the model to evaluate.
- config ( - EvaluatorConfig) – the evaluation configuration.
 
 - config_class
- alias of - EvaluatorConfig
 - property train_window
 - property retrain_freq
 - property cadence
 - property horizon
 - default_train_kwargs()
- Return type
- dict
 
 - default_retrain_kwargs()
- Return type
- dict
 
 - get_predict(train_vals, test_vals, exog_data=None, train_kwargs=None, retrain_kwargs=None)
- Initialize the model by training it on an initial set of train data. Get the model’s predictions on the test data, retraining the model as appropriate. - Parameters
- train_vals ( - TimeSeries) – initial training data
- test_vals ( - TimeSeries) – all data where we want to get the model’s predictions and compare it to the ground truth
- exog_data ( - Optional[- TimeSeries]) – any exogenous data (only used for some models)
- train_kwargs ( - Optional[- dict]) – dict of keyword arguments we want to use for the initial training process
- retrain_kwargs ( - Optional[- dict]) – dict of keyword arguments we want to use for all subsequent retrainings
 
- Return type
- Tuple[- Any,- Union[- TimeSeries,- List[- TimeSeries]]]
- Returns
- (train_result, result).- train_resultis the output of training the model on- train_vals(- Noneif- pretrainedis- True).- resultis the model’s predictions on- test_vals, and is specific to each evaluation task.
 
 - abstract evaluate(ground_truth, predict, metric)
- Given the ground truth time series & the model’s prediction (as produced by EvaluatorBase.get_predict), compute the specified evaluation metric. If no metric is specified, return the appropriate score accumulator for the task. Implementation is task-specific. 
 
merlion.evaluate.anomaly
Metrics and utilities for evaluating time series anomaly detection models.
- class merlion.evaluate.anomaly.ScoreType(value)
- Bases: - Enum- The algorithm to use to compute true/false positives/negatives. See the technical report for more details on each score type. Merlion’s preferred default is revised point-adjusted. - Pointwise = 0
 - PointAdjusted = 1
 - RevisedPointAdjusted = 2
 
- class merlion.evaluate.anomaly.TSADScoreAccumulator(num_tp_anom=0, num_tp_pointwise=0, num_tp_point_adj=0, num_fn_anom=0, num_fn_pointwise=0, num_fn_point_adj=0, num_fp=0, num_tn=0, tp_score=0.0, fp_score=0.0, tp_detection_delays=None, tp_anom_durations=None, anom_durations=None)
- Bases: - object- Accumulator which maintains summary statistics describing an anomaly detection algorithm’s performance. Can be used to compute many different time series anomaly detection metrics. - precision(score_type=ScoreType.RevisedPointAdjusted)
 - recall(score_type=ScoreType.RevisedPointAdjusted)
 - f1(score_type=ScoreType.RevisedPointAdjusted)
 - f_beta(score_type=ScoreType.RevisedPointAdjusted, beta=1.0)
 - mean_time_to_detect()
 - mean_detected_anomaly_duration()
 - mean_anomaly_duration()
 - nab_score(tp_weight=1.0, fp_weight=0.11, fn_weight=1.0, tn_weight=0.0)
- Computes the NAB score, given the accumulated performance metrics and the specified weights for different types of errors. The score is described in section II.C of https://arxiv.org/pdf/1510.03336.pdf. At a high level, this score is a cost-sensitive, recency-weighted accuracy measure for time series anomaly detection. - NAB uses the following profiles for benchmarking (https://github.com/numenta/NAB/blob/master/config/profiles.json): - standard (default) - tp_weight = 1.0, fp_weight = 0.11, fn_weight = 1.0 
- reward low false positive rate - tp_weight = 1.0, fp_weight = 0.22, fn_weight = 1.0 
- reward low false negative rate - tp_weight = 1.0, fp_weight = 0.11, fn_weight = 2.0 
 - Note that tn_weight is ignored. - Parameters
- tp_weight – relative weight of true positives. 
- fp_weight – relative weight of false positives. 
- fn_weight – relative weight of false negatives. 
- tn_weight – relative weight of true negatives. Ignored, but included for completeness. 
 
- Returns
- NAB score 
 
 
- merlion.evaluate.anomaly.accumulate_tsad_score(ground_truth, predict, max_early_sec=None, max_delay_sec=None, metric=None)
- Computes the components required to compute multiple different types of performance metrics for time series anomaly detection. - Parameters
- ground_truth ( - Union[- TimeSeries,- UnivariateTimeSeries]) – A time series indicating whether each time step corresponds to an anomaly.
- predict ( - Union[- TimeSeries,- UnivariateTimeSeries]) – A time series with the anomaly score predicted for each time step. Detections correspond to nonzero scores.
- max_early_sec – The maximum amount of time (in seconds) the anomaly detection is allowed to occur before the actual incidence. If None, no early detections are allowed. Note that None is the same as 0. 
- max_delay_sec – The maximum amount of time (in seconds) the anomaly detection is allowed to occur after the start of the actual incident (but before the end of the actual incident). If None, we allow any detection during the duration of the incident. Note that None differs from 0 because 0 means that we only permit detections that are early or exactly on time! 
- metric – A function which takes a TSADScoreAccumulator as input and returns a - float. The TSADScoreAccumulator object is returned if- metricis- None.
 
- Return type
- Union[- TSADScoreAccumulator,- float]
 
- class merlion.evaluate.anomaly.TSADMetric(value)
- Bases: - Enum- Enumeration of evaluation metrics for time series anomaly detection. For each value, the name is the metric, and the value is a partial function of form - f(ground_truth, predicted, **kwargs)- MeanTimeToDetect = functools.partial(<function accumulate_tsad_score>, metric=<function TSADScoreAccumulator.mean_time_to_detect>)
 - F1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
 - Precision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
 - Recall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.RevisedPointAdjusted: 2>))
 - PointwiseF1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.Pointwise: 0>))
 - PointwisePrecision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.Pointwise: 0>))
 - PointwiseRecall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.Pointwise: 0>))
 - PointAdjustedF1 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f1>, score_type=<ScoreType.PointAdjusted: 1>))
 - PointAdjustedPrecision = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.precision>, score_type=<ScoreType.PointAdjusted: 1>))
 - PointAdjustedRecall = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.recall>, score_type=<ScoreType.PointAdjusted: 1>))
 - NABScore = functools.partial(<function accumulate_tsad_score>, metric=<function TSADScoreAccumulator.nab_score>)
 - NABScoreLowFN = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.nab_score>, fn_weight=2.0))
 - NABScoreLowFP = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.nab_score>, fp_weight=0.22))
 - F2 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f_beta>, score_type=<ScoreType.RevisedPointAdjusted: 2>, beta=2.0))
 - F5 = functools.partial(<function accumulate_tsad_score>, metric=functools.partial(<function TSADScoreAccumulator.f_beta>, score_type=<ScoreType.RevisedPointAdjusted: 2>, beta=5.0))
 
- class merlion.evaluate.anomaly.TSADEvaluatorConfig(max_early_sec=None, max_delay_sec=None, **kwargs)
- Bases: - EvaluatorConfig- Configuration class for a TSADEvaluator. - Parameters
- max_early_sec ( - Optional[- float]) – the maximum number of seconds we allow an anomaly to be detected early.
- max_delay_sec ( - Optional[- float]) – if an anomaly is detected more than this many seconds after its start, it is not counted as being detected.
 
 
- class merlion.evaluate.anomaly.TSADEvaluator(model, config)
- Bases: - EvaluatorBase- Simulates the live deployment of an anomaly detection model. - Parameters
- model – the model to evaluate. 
- config – the evaluation configuration. 
 
 - config_class
- alias of - TSADEvaluatorConfig
 - property max_early_sec
 - property max_delay_sec
 - default_retrain_kwargs()
- Return type
- dict
 
 - get_predict(train_vals, test_vals, exog_data=None, train_kwargs=None, retrain_kwargs=None, post_process=True)
- Initialize the model by training it on an initial set of train data. Simulate real-time anomaly detection by the model, while re-training it at the desired frequency. - Parameters
- train_vals ( - TimeSeries) – initial training data
- test_vals ( - TimeSeries) – all data where we want to get the model’s predictions and compare it to the ground truth
- exog_data ( - Optional[- TimeSeries]) – any exogenous data (only used for some models)
- train_kwargs ( - Optional[- dict]) – dict of keyword arguments we want to use for the initial training process. Typically, you will want to provide the key “anomaly_labels” here, if you have training data with labeled anomalies, as well as the key “post_rule_train_config”, if you want to use a custom training config for the model’s post-rule.
- retrain_kwargs ( - Optional[- dict]) – dict of keyword arguments we want to use for all subsequent retrainings. Typically, you will not supply any this argument.
- post_process – whether to apply the model’s post-rule on the returned results. 
 
- Return type
- Tuple[- TimeSeries,- TimeSeries]
- Returns
- (train_result, result).- train_resultis a TimeSeries of the model’s anomaly scores on- train_vals.- resultis a TimeSeries of the model’s anomaly scores on- test_vals.
 
 - evaluate(ground_truth, predict, metric=None)
- Parameters
- ground_truth ( - TimeSeries) – TimeSeries of ground truth anomaly labels
- predict ( - TimeSeries) – TimeSeries of predicted anomaly scores
- metric ( - Optional[- TSADMetric]) – the TSADMetric we wish to evaluate.
 
- Return type
- Union[- TSADScoreAccumulator,- float]
- Returns
- the value of the evaluation - metric, if one is given. A TSADScoreAccumulator otherwise.
 
 
merlion.evaluate.forecast
Metrics and utilities for evaluating forecasting models in a continuous sense.
- class merlion.evaluate.forecast.ForecastScoreAccumulator(ground_truth, predict, insample=None, periodicity=1, ub=None, lb=None, target_seq_index=None)
- Bases: - object- Accumulator which maintains summary statistics describing a forecasting algorithm’s performance. Can be used to compute many different forecasting metrics. - Parameters
- ground_truth ( - Union[- UnivariateTimeSeries,- TimeSeries]) – ground truth time series
- predict ( - Union[- UnivariateTimeSeries,- TimeSeries]) – predicted truth time series
- (optional) (target_seq_index) – time series used for training model. This value is used for computing MSES, MSIS 
- (optional) – periodicity. m=1 indicates the non-seasonal time series, whereas m>1 indicates seasonal time series. This value is used for computing MSES, MSIS. 
- (optional) – upper bound of 95% prediction interval. This value is used for computing MSIS 
- (optional) – lower bound of 95% prediction interval. This value is used for computing MSIS 
- (optional) – the index of the target sequence, for multivariate. 
 
 - check_before_eval()
 - mae()
- Mean Absolute Error (MAE) - For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as \[\frac{1}{T}\sum_{t=1}^T{(|y_t - \hat{y}_t|)}.\]
 - marre()
- Mean Absolute Ranged Relative Error (MARRE) - For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as \[100 \cdot \frac{1}{T} \sum_{t=1}^{T} {\left| \frac{y_t - \hat{y}_t} {\max_t{y_t} - \min_t{y_t}} \right|}.\]
 - rmse()
- Root Mean Squared Error (RMSE) - For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as \[\sqrt{\frac{1}{T}\sum_{t=1}^T{(y_t - \hat{y}_t)^2}}.\]
 - smape()
- symmetric Mean Absolute Percentage Error (sMAPE). For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as \[200 \cdot \frac{1}{T} \sum_{t=1}^{T}{\frac{\left| y_t - \hat{y}_t \right|}{\left| y_t \right| + \left| \hat{y}_t \right|}}.\]
 - rmspe()
- Root Mean Squared Percent Error (RMSPE) - For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), it is computed as \[100 \cdot \sqrt{\frac{1}{T}\sum_{t=1}^T\frac{(y_t - \hat{y}_t)}{y_t}^2}.\]
 - mase()
- Mean Absolute Scaled Error (MASE) For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\). In sample time series \(\hat{x}\) of length \(N\) and periodicity \(m\) it is computed as \[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T}\left| y_t - \hat{y}_t \right|}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
 - msis()
- Mean Scaled Interval Score (MSIS) This metric evaluates the quality of 95% prediction intervals. For ground truth time series \(y\) and predicted time series \(\hat{y}\) of length \(T\), the lower and upper bounds of the prediction intervals \(L\) and \(U\). Given in sample time series \(\hat{x}\) of length \(N\) and periodicity \(m\), it is computed as \[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T} (U_t - L_t) + 100 \cdot (L_t - y_t)[y_t<L_t] + 100\cdot(y_t - U_t)[y_t > U_t]}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
 
- merlion.evaluate.forecast.accumulate_forecast_score(ground_truth, predict, insample=None, periodicity=1, ub=None, lb=None, metric=None, target_seq_index=None)
- Return type
- Union[- ForecastScoreAccumulator,- float]
 
- class merlion.evaluate.forecast.ForecastMetric(value)
- Bases: - Enum- Enumeration of evaluation metrics for time series forecasting. For each value, the name is the metric, and the value is a partial function of form - f(ground_truth, predict, **kwargs). Here,- ground_truthis the original time series, and- predictis the result returned by a ForecastEvaluator.- MAE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.mae>)
- Mean Absolute Error (MAE) is formulated as: \[\frac{1}{T}\sum_{t=1}^T{(|y_t - \hat{y}_t|)}.\]
 - MARRE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.marre>)
- Mean Absolute Ranged Relative Error (MARRE) is formulated as: \[100 \cdot \frac{1}{T} \sum_{t=1}^{T} {\left| \frac{y_t - \hat{y}_t} {\max_t{y_t} - \min_t{y_t}} \right|}.\]
 - RMSE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.rmse>)
- Root Mean Squared Error (RMSE) is formulated as: \[\sqrt{\frac{1}{T}\sum_{t=1}^T{(y_t - \hat{y}_t)^2}}.\]
 - sMAPE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.smape>)
- symmetric Mean Absolute Percentage Error (sMAPE) is formulated as: \[200 \cdot \frac{1}{T}\sum_{t=1}^{T}{\frac{\left| y_t - \hat{y}_t \right|}{\left| y_t \right| + \left| \hat{y}_t \right|}}.\]
 - RMSPE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.rmspe>)
- Root Mean Square Percent Error is formulated as: \[100 \cdot \sqrt{\frac{1}{T}\sum_{t=1}^T\frac{(y_t - \hat{y}_t)}{y_t}^2}.\]
 - MASE = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.mase>)
- Mean Absolute Scaled Error (MASE) is formulated as: \[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T}\left| y_t - \hat{y}_t \right|}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
 - MSIS = functools.partial(<function accumulate_forecast_score>, metric=<function ForecastScoreAccumulator.msis>)
- Mean Scaled Interval Score (MSIS) is formulated as: \[\frac{1}{T}\cdot\frac{\sum_{t=1}^{T} (U_t - L_t) + 100 \cdot (L_t - y_t)[y_t<L_t] + 100\cdot(y_t - U_t)[y_t > U_t]}{\frac{1}{N-m}\sum_{t=m+1}^{N}\left| x_t - x_{t-m} \right|}.\]
 
- class merlion.evaluate.forecast.ForecastEvaluatorConfig(horizon=None, **kwargs)
- Bases: - EvaluatorConfig- Configuration class for a ForecastEvaluator - Parameters
- horizon ( - Optional[- float]) – the model’s prediction horizon. Whenever the model makes a prediction, it will predict- horizonseconds into the future.
 - property horizon: Optional[Union[Timedelta, DateOffset]]
- Returns
- the horizon our model is predicting into the future. Defaults to the retraining frequency. 
 
 - property cadence: Optional[Union[Timedelta, DateOffset]]
- Returns
- the cadence at which we are having our model produce new predictions. Defaults to the predictive horizon if there is one, and the retraining frequency otherwise. 
 
 
- class merlion.evaluate.forecast.ForecastEvaluator(model, config)
- Bases: - EvaluatorBase- Simulates the live deployment of an forecaster model. - Parameters
- model – the model to evaluate. 
- config – the evaluation configuration. 
 
 - config_class
- alias of - ForecastEvaluatorConfig
 - property horizon
 - property cadence
 - evaluate(ground_truth, predict, metric=ForecastMetric.sMAPE)
- Parameters
- ground_truth ( - TimeSeries) – the series of test data
- predict ( - Union[- TimeSeries,- List[- TimeSeries]]) – the series of predicted values
- metric ( - ForecastMetric) – the evaluation metric.