utils

Contains various utility files & functions useful for different models.

time_features

Utils for converting pandas datetime to numerical vectors

rolling_window_dataset

A rolling window dataset

early_stopping

Earlying Stopping

autosarima_utils

Low-level utils for AutoML models.

utils.time_features

Utils for converting pandas datetime to numerical vectors

class merlion.models.utils.time_features.TimeFeature

Bases: object

class merlion.models.utils.time_features.SecondOfMinute

Bases: TimeFeature

Second of minute encoded as value between [-0.5, 0.5]

class merlion.models.utils.time_features.MinuteOfHour

Bases: TimeFeature

Minute of hour encoded as value between [-0.5, 0.5]

class merlion.models.utils.time_features.HourOfDay

Bases: TimeFeature

Hour of day encoded as value between [-0.5, 0.5]

class merlion.models.utils.time_features.DayOfWeek

Bases: TimeFeature

Day of week encoded as value between [-0.5, 0.5]

class merlion.models.utils.time_features.DayOfMonth

Bases: TimeFeature

Day of month encoded as value between [-0.5, 0.5]

class merlion.models.utils.time_features.DayOfYear

Bases: TimeFeature

Day of year encoded as value between [-0.5, 0.5]

class merlion.models.utils.time_features.MonthOfYear

Bases: TimeFeature

Month of year encoded as value between [-0.5, 0.5]

class merlion.models.utils.time_features.WeekOfYear

Bases: TimeFeature

Week of year encoded as value between [-0.5, 0.5]

merlion.models.utils.time_features.time_features_from_frequency_str(freq_str)
Parameters

freq_str (str) – Frequency string of the form [multiple][granularity] such as “12H”, “5min”, “1D” etc.

Return type

List[TimeFeature]

Returns

a list of time features that will be appropriate for the given frequency string.

merlion.models.utils.time_features.get_time_features(dates, ts_encoding='h')

Convert pandas Datetime to numerical vectors that can be used for training

utils.rolling_window_dataset

A rolling window dataset

class merlion.models.utils.rolling_window_dataset.RollingWindowDataset(data, target_seq_index, n_past, n_future, exog_data=None, shuffle=False, ts_index=False, batch_size=1, flatten=True, ts_encoding=None, valid_fraction=0.0, validation=False, seed=0)

Bases: object

A rolling window dataset which returns (past, future) windows for the whole time series. If ts_index=True is used, a batch size of 1 is employed, and each window returned by the dataset is (past, future), where past and future are both TimeSeries objects. If ts_index=False is used (default option, more efficient), each window returned by the dataset is (past_np, past_time, future_np, future_time):

  • past_np is a numpy array with shape (batch_size, n_past * dim) if flatten is True, otherwise (batch_size, n_past, dim).

  • past_time is a numpy array of times with shape (batch_size, n_past)

  • future_np is a numpy array with shape (batch_size, dim) if target_seq_index is None (autoregressive prediction), or shape (batch_size, n_future) if target_seq_index is specified.

  • future_time is a numpy array of times with shape (batch_size, n_future)

Parameters
  • data (Union[TimeSeries, DataFrame]) – time series data in the format of TimeSeries or pandas DataFrame with DatetimeIndex

  • target_seq_index (Optional[int]) – The index of the univariate (amongst all univariates in a general multivariate time series) whose value we would like to use for the future labeling. If target_seq_index = None, it implies that all the sequences are required for the future labeling. In this case, we set n_future = 1 and use the time series for 1-step autoregressive prediction.

  • n_past (int) – number of steps for past

  • n_future (int) – number of steps for future. If target_seq_index = None, we manually set n_future = 1.

  • exog_data (Union[TimeSeries, DataFrame, None]) – exogenous data to as inputs for the model, but not as outputs to predict. We assume the future values of exogenous variables are known a priori at test time.

  • shuffle (bool) – whether the windows of the time series should be shuffled.

  • ts_index (bool) – keep original TimeSeries internally for all the slicing, and output TimeSeries. by default, Numpy array will handle the internal data workflow and Numpy array will be the output.

  • batch_size (Optional[int]) – the number of windows to return in parallel. If None, return the whole dataset.

  • flatten (bool) – whether the output time series arrays should be flattened to 2 dimensions.

  • ts_encoding (Optional[str]) – whether the timestamp should be encoded to a float vector, which can be used for training deep learning based time series models; if None, the timestamp is not encoded. If not None, it represents the frequency for time features encoding options:[s:secondly, t:minutely, h:hourly, d:daily, b:business days, w:weekly, m:monthly]

  • valid_fraction (float) – Fraction of validation set splitted from training data. if valid_fraction = 0 or valid_fraction = 1, we iterate over the entire dataset.

  • validation (Optional[bool]) – Whether the data is from the validation set or not. if validation = None, we iterate over the entire dataset.

property validation

If set False, we only provide access to the training windows; if set True, we only provide access to the validation windows. if set``None``, we iterate over the entire dataset.

property seed

Set Random seed to perturb the training data

property n_windows

Number of total slides windows

property n_valid

Number of slides windows in validation set

property n_train

Number of slides windows in training set

property n_points
collate_batch(batch)

utils.early_stopping

Earlying Stopping

class merlion.models.utils.early_stopping.EarlyStopping(patience=7, delta=0)

Bases: object

Early stopping for deep model training

Parameters
  • patience – Number of epochs with no improvement after which training will be stopped.

  • delta – Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.

save_best_state_and_dict(val_loss, model)
load_best_model(model)

utils.autosarima_utils

Low-level utils for AutoML models.

merlion.models.utils.autosarima_utils.diff(x, lag=1, differences=1)

Return suitably lagged and iterated differences from the given 1D or 2D array x

merlion.models.utils.autosarima_utils.detect_maxiter_sarima_model(y, d, D, m, method, information_criterion, exog=None, **kwargs)

run a zero model with SARIMA(2; d; 2)(1; D; 1) / ARIMA(2; d; 2) determine the optimal maxiter

merlion.models.utils.autosarima_utils.seas_seasonalstationaritytest(x, m)

Estimate the strength of seasonal component. The idea can be found in https://otexts.com/fpp2/seasonal-strength.html R implementation uses mstl instead of stl to deal with multiple seasonality

merlion.models.utils.autosarima_utils.nsdiffs(x, m, max_D=1, test='seas')

Estimate the seasonal differencing order D with statistical test

Parameters: x : the time series to difference m : the number of seasonal periods max_D : the maximal number of seasonal differencing order allowed test: the type of test of seasonality to use to detect seasonal periodicity

merlion.models.utils.autosarima_utils.KPSS_stationaritytest(xx, alpha=0.05)

The KPSS test is used with the null hypothesis that x has a stationary root against a unit-root alternative

The KPSS test is used with the null hypothesis that x has a stationary root against a unit-root alternative. Then the test returns the least number of differences required to pass the test at the level alpha

merlion.models.utils.autosarima_utils.ndiffs(x, alpha=0.05, max_d=2, test='kpss')

Estimate the differencing order d with statistical test

Parameters: x : the time series to difference alpha : level of the test, possible values range from 0.01 to 0.1 max_d : the maximal number of differencing order allowed test: the type of test of seasonality to use to detect seasonal periodicity