merlion.utils package
This package contains various utilities, including the TimeSeries
class and
utilities for resampling time series.
Submodules
merlion.utils.istat module
- class merlion.utils.istat.IStat(value=None, n=0)
Bases:
object
An abstract base class for computing various statistics incrementally, with emphasis on recency-weighted variants.
- Parameters
value (
Optional
[float
]) – Initial value of the statistic. Defaults to None.n (
int
) – Initial sample size. Defaults to 0.
- property n
- property value
- abstract add(x)
Add a new value to update the statistic. :param x: new value to add to the sample.
- abstract drop(x)
Drop a value to update the statistic. :param x: value to drop from the sample.
- add_batch(batch)
Add a batch of new values to update the statistic. :type batch:
List
[float
] :param batch: new values to add to the sample.
- drop_batch(batch)
Drop a batch of new values to update the statistic. :type batch:
List
[float
] :param batch: new values to add to the sample.
- class merlion.utils.istat.Mean(value=None, n=0)
Bases:
IStat
Class for incrementally computing the mean of a series of numbers.
- Parameters
value (
Optional
[float
]) – Initial value of the statistic. Defaults to None.n (
int
) – Initial sample size. Defaults to 0.
- property value
- add(x)
Add a new value to update the statistic. :param x: new value to add to the sample.
- drop(x)
Drop a value to update the statistic. :param x: value to drop from the sample.
- class merlion.utils.istat.Variance(ex_value=None, ex2_value=None, n=0, ddof=1)
Bases:
IStat
Class for incrementally computing the variance of a series of numbers.
- Parameters
ex_value (
Optional
[float
]) – Initial value of the first moment (mean).ex2_value (
Optional
[float
]) – Initial value of the second moment.n (
int
) – Initial sample size.ddof (
int
) – The delta degrees of freedom to use when correcting the estimate of the variance.
\[\text{Var}(x_i) = \text{E}(x_i^2) - \text{E}(x_i)^2\]- add(x)
Add a new value to update the statistic. :param x: new value to add to the sample.
- drop(x)
Drop a value to update the statistic. :param x: value to drop from the sample.
- property true_value
- property corrected_value
- property value
- property sd
- property se
- class merlion.utils.istat.ExponentialMovingAverage(recency_weight=0.1, **kwargs)
Bases:
Mean
Class for incrementally computing the exponential moving average of a series of numbers.
- Parameters
recency_weight (
float
) – Recency weight to use when updating the exponential moving average.
Letting
w
be the recency weight,\[\begin{split}\begin{align*} \text{EMA}_w(x_0) & = x_0 \\ \text{EMA}_w(x_t) & = w \cdot x_t + (1-w) \cdot \text{EMA}_w(x_{t-1}) \end{align*}\end{split}\]- property recency_weight
- property value
- drop(x)
Exponential Moving Average does not support dropping values
- class merlion.utils.istat.RecencyWeightedVariance(recency_weight, **kwargs)
Bases:
Variance
Class for incrementally computing the recency-weighted variance of a series of numbers.
- Parameters
recency_weight (
float
) – Recency weight to use when updating the recency weighted variance.
Letting
w
be the recency weight,\[\text{RWV}_w(x_t) = \text{EMA}_w({x^2_t}) - \text{EMA}_w(x_t)^2\]- mean_class
alias of
ExponentialMovingAverage
- property recency_weight
- drop(x)
Recency Weighted Variance does not support dropping values
merlion.utils.misc module
- class merlion.utils.misc.AutodocABCMeta(classname, bases, cls_dict)
Bases:
ABCMeta
Metaclass used to ensure that inherited members of an abstract base class also inherit docstrings for inherited methods.
- class merlion.utils.misc.ValIterOrderedDict
Bases:
OrderedDict
OrderedDict whose iterator goes over self.values() instead of self.keys().
- merlion.utils.misc.dynamic_import(import_path, alias=None)
Dynamically import a member from the specified module.
- Parameters
import_path (
str
) – syntax ‘module_name:member_name’, e.g. ‘merlion.transform.normalize:PowerTransform’alias (
Optional
[dict
]) – dict which maps shortcuts for the registered classes, to their full import paths.
- Returns
imported class
- merlion.utils.misc.initializer(func)
Decorator for the __init__ method. Automatically assigns the parameters.
- class merlion.utils.misc.ProgressBar(total, length=40, decimals=1, fill='█')
Bases:
object
- Parameters
total (
int
) – total iterationslength (
int
) – character length of bardecimals (
int
) – positive number of decimals in percent completefill (
str
) – bar fill character
- print(iteration, prefix, suffix, end='')
- Parameters
iteration – current iteration
prefix – prefix string
suffix – suffix string
end – end character (e.g.
"\r"
,"\r\n"
)
merlion.utils.resample module
- class merlion.utils.resample.AlignPolicy(value)
Bases:
Enum
Policies for aligning multiple univariate time series.
- OuterJoin = 0
- InnerJoin = 1
- FixedReference = 2
- FixedGranularity = 3
- class merlion.utils.resample.AggregationPolicy(value)
Bases:
Enum
Aggregation policies. Values are partial functions for pandas.core.resample.Resampler methods.
- Mean = functools.partial(<function AggregationPolicy.<lambda>>)
- Sum = functools.partial(<function AggregationPolicy.<lambda>>)
- Median = functools.partial(<function AggregationPolicy.<lambda>>)
- First = functools.partial(<function AggregationPolicy.<lambda>>)
- Last = functools.partial(<function AggregationPolicy.<lambda>>)
- Min = functools.partial(<function AggregationPolicy.<lambda>>)
- Max = functools.partial(<function AggregationPolicy.<lambda>>)
- class merlion.utils.resample.MissingValuePolicy(value)
Bases:
Enum
Missing value imputation policies. Values are partial functions for
pd.Series
methods.- FFill = functools.partial(<function MissingValuePolicy.<lambda>>)
Fill gap with the first value before the gap.
- BFill = functools.partial(<function MissingValuePolicy.<lambda>>)
Fill gap with the first value after the gap.
- Nearest = functools.partial(<function MissingValuePolicy.<lambda>>, method='nearest')
Replace missing value with the value closest to it.
- Interpolate = functools.partial(<function MissingValuePolicy.<lambda>>, method='time')
Fill in missing values by linear interpolation.
- merlion.utils.resample.to_pd_datetime(timestamp)
Converts a timestamp (or list/iterable of timestamps) to pandas Datetime, truncated at the millisecond.
- merlion.utils.resample.granularity_str_to_seconds(granularity)
Converts a string/float/int granularity (representing a timedelta) to the number of seconds it represents, truncated at the millisecond.
- Return type
Optional
[float
]
- merlion.utils.resample.get_gcd_timedelta(*time_stamp_lists)
Calculates all timedeltas present in any of the lists of time stamps given, and returns the GCD of all these timedeltas (up to units of milliseconds).
- merlion.utils.resample.reindex_df(df, reference, missing_value_policy)
Reindexes a Datetime-indexed dataframe
df
to have the same time stamps as a reference sequence of timestamps. Imputes missing values with the givenMissingValuePolicy
.
merlion.utils.time_series module
- class merlion.utils.time_series.UnivariateTimeSeries(time_stamps, values, name=None, freq='1h')
Bases:
Series
Please read the tutorial before reading this API doc. This class is a time-indexed
pd.Series
which represents a univariate time series. For the most part, it supports all the same features aspd.Series
, with the following key differences to iteration and indexing:Iterating over a
UnivariateTimeSeries
is implemented asfor timestamp, value in univariate: # do stuff...
where
timestamp
is a Unix timestamp, andvalue
is the corresponding time series value.Integer index:
u[i]
yields the tuple(u.time_stamps[i], u.values[i])
Slice index:
u[i:j:k]
yields a newUnivariateTimeSeries(u.time_stamps[i:j:k], u.values[i:j:k])
The class also supports the following additional features:
univariate.time_stamps
returns the list of Unix timestamps, andunivariate.values
returns the list of the time series values. You may access thepd.DatetimeIndex
directly withunivariate.index
(or itsnp.ndarray
representation withunivariate.np_time_stamps
), and thenp.ndarray
of values withunivariate.np_values
.univariate.concat(other)
will concatenate the UnivariateTimeSeriesother
to the right end ofunivariate
.left, right = univariate.bisect(t)
will split the univariate at the given timestampt
.window = univariate.window(t0, tf)
will return the subset of the time series occurring between timestampst0
(inclusive) andtf
(non-inclusive)series = univariate.to_pd()
will convert theUnivariateTimeSeries
into a regularpd.Series
(for compatibility).univariate = UnivariateTimeSeries.from_pd(series)
uses a time-indexedpd.Series
to create aUnivariateTimeSeries
object directly.
- __getitem__(i)
- Parameters
i (
Union
[int
,slice
]) – integer index or slice- Return type
Union[Tuple[float, float], UnivariateTimeSeries]
- Returns
(self.time_stamps[i], self.values[i])
ifi
is an integer.UnivariateTimeSeries(self.time_series[i], self.values[i])
ifi
is a slice.
- __iter__()
The i’th item in the iterator is the tuple
(self.time_stamps[i], self.values[i])
.
- Parameters
time_stamps (
Optional
[Sequence
[Union
[int
,float
]]]) – a sequence of Unix timestamps. You may specifyNone
if you only havevalues
with no specific time stamps.values (
Sequence
[float
]) – a sequence of univariate values, wherevalues[i]
occurs at timetime_stamps[i]
name (
Optional
[str
]) – the name of the univariate time seriesfreq – if
time_stamps
is not provided, the univariate is assumed to be sampled at frequencyfreq
.freq
may be a string (e.g."1h"
), timedelta, orint
/float
(in units of seconds).
- property np_time_stamps
- Return type
np.ndarray
- Returns
the
numpy
representation of this time series’s Unix timestamps
- property np_values
- Return type
np.ndarray
- Returns
the
numpy
representation of this time series’s values
- property time_stamps
- Return type
List[float]
- Returns
the list of Unix timestamps for the time series
- property values
- Return type
List[float]
- Returns
the list of values for the time series.
- property t0
- Return type
float
- Returns
the first timestamp in the univariate time series.
- property tf
- Return type
float
- Returns
the final timestamp in the univariate time series.
- is_empty()
- Return type
bool
- Returns
True if the univariate is empty, False if not.
- copy(deep=True)
Copies the
UnivariateTimeSeries
. Simply a wrapper around thepd.Series.copy()
method.
- concat(other)
Concatenates the
UnivariateTimeSeries
other
to the right of this one. :param UnivariateTimeSeries other: anotherUnivariateTimeSeries
:rtype: UnivariateTimeSeries :return: concatenated univariate time series
- bisect(t, t_in_left=False)
Splits the time series at the point where the given timestamp occurs.
- Parameters
t (
float
) – a Unix timestamp or datetime object. Everything before timet
is in the left split, and everything after timet
is in the right split.t_in_left (
bool
) – ifTrue
,t
is in the left split. Otherwise,t
is in the right split.
- Return type
- Returns
the left and right splits of the time series.
- window(t0, tf, include_tf=False)
- Parameters
t0 (
float
) – The timestamp/datetime at the start of the window (inclusive)tf (
float
) – The timestamp/datetime at the end of the window (inclusive ifinclude_tf
isTrue
, non-inclusive otherwise)include_tf (
bool
) – Whether to includetf
in the window.
- Return type
- Returns
The subset of the time series occurring between timestamps
t0
(inclusive) andtf
(included ifinclude_tf
isTrue
, excluded otherwise).
- to_dict()
- Return type
Dict
[float
,float
]- Returns
A dictionary representing the data points in the time series.
- classmethod from_dict(obj, name=None)
- Parameters
obj (
Dict
[float
,float
]) – A dictionary of timestamp - value pairsname – the name to assign the output
- Return type
- Returns
the
UnivariateTimeSeries
represented by series.
- to_pd()
- Return type
Series
- Returns
A pandas Series representing the time series, indexed by time.
- classmethod from_pd(series, name=None, freq='1h')
- Parameters
series (
Series
) – apd.Series
. If it has a``pd.DatetimeIndex``, we will use that index for the timestamps. Otherwise, we will create one at the specified frequency.name – the name to assign the output
freq – if
series
is not indexed by time, this is the frequency at which we will assume it is sampled.
- Return type
- Returns
the
UnivariateTimeSeries
represented by series.
- to_ts()
- Return type
- Returns
A
TimeSeries
representing this univariate time series.
- classmethod empty(name=None)
- Return type
- Returns
A Merlion
UnivariateTimeSeries
that has empty timestamps and values.
- class merlion.utils.time_series.TimeSeries(univariates, *, check_aligned=True)
Bases:
object
Please read the tutorial before reading this API doc. This class represents a general multivariate time series as a wrapper around a number of (optionally named)
UnivariateTimeSeries
. ATimeSeries
object is initialized astime_series = TimeSeries(univariates)
, whereunivariates
is either a list ofUnivariateTimeSeries
, or a dictionary mapping string names to their correspondingUnivariateTimeSeries
objects.Because the individual
univariates
need not be sampled at the same times, an important concept forTimeSeries
is alignment. We say that aTimeSeries
is aligned if all of its univariates have observations sampled at the exact set set of times.One may access the
UnivariateTimeSeries
comprising thisTimeSeries
in four ways:Iterate over the individual univariates using
for var in time_series.univariates: # do stuff with each UnivariateTimeSeries var
Access an individual
UnivariateTimeSeries
by name astime_series.univariates[name]
. If you supplied unnamed univariates to the constructor (i.e. using a list), the name of a univariate will just be its index in that list.Get the list of each univariate’s name with
time_series.names
.Iterate over named univariates as
for name, var in time_series.items(): # do stuff
Note that this is equivalent to iterating over
zip(time_series.names, time_series.univariates)
.
This class supports the following additional features as well:
Interoperability with
pandas
df = time_series.to_pd()
yields a time-indexedpd.DataFrame
, where each column (with the appropriate name) corresponds to a variable. Missing values areNaN
.time_series = TimeSeries.from_pd(df)
takes a time-indexedpd.DataFrame
and returns a correspondingTimeSeries
object (missing values are handled appropriately). The order oftime_series.univariates
is the order ofdf.keys()
.
Automated alignment:
aligned = time_series.align()
resamples each oftime_series.univariates
so that they all have the same timestamps. By default, this is done by taking the union of all timestamps present in any individual univariate time series, and imputing missing values via interpolation. See the method documentation for details on how you may configure the alignment policy.Transparent indexing and iteration for
TimeSeries
which have all univariates aligned (i.e. they all have the same timestamps)Get the length and shape of the time series (equal to the number of observations in each individual univariate). Note that if the time series is not aligned, we will return the length/shape of an equivalent
pandas
dataframe and emit a warning.Index
time_series[i] = (times[i], (x1[i], ..., xn[i]))
(assumingtime_series
hasn
aligned univariates with timestampstimes
, andxk = time_series.univariates[k-1].values
). Slice returns aTimeSeries
object and works as one would expect.Assuming
time_series
hasn
variables, you may iterate withfor t_i, (x1_i, ..., xn_i) in time_series: # do stuff
Notably, this lets you call
times, val_vectors = zip(*time_series)
Time-based queries for any time series
Get the two sub
TimeSeries
before and after a timestampt
vialeft, right = time_series.bisect(t)
Get the sub
TimeSeries
between timestampst0
(inclusive) andtf
(non-inclusive) viawindow = time_series.window(t0, tf)
Concatenation: two
TimeSeries
may be concatenated (in time) astime_series = time_series_1 + time_series_2
.
- __getitem__(i)
Only supported if all individual variable time series are sampled at the same time stamps.
- Parameters
i (
Union
[int
,slice
]) – integer index or slice.- Return type
Union[Tuple[float, Tuple[float]], TimeSeries]
- Returns
If
i
is an integer, returns the tuple(time_stamps[i], tuple(var.values[i] for var in self.univariates))
. Ifi
is a slice, returns the time seriesTimeSeries([var[i] for var in self.univariates])
- __iter__()
Only supported if all individual variable time series are sampled at the same time stamps. The i’th item of the iterator is the tuple
(time_stamps[i], tuple(var.values[i] for var in self.univariates))
.
- property names
- Returns
The list of the names of the univariates.
- items()
- Returns
Iterator over
(name, univariate)
tuples.
- property dim: int
- Return type
int
- Returns
The dimension of the time series (the number of variables).
- property is_aligned: bool
- Return type
bool
- Returns
Whether all individual variable time series are sampled at the same time stamps, i.e. they are aligned.
- property np_time_stamps
- Return type
np.ndarray
- Returns
the
numpy
representation of this time series’s Unix timestamps
- property time_stamps
- Return type
List[float]
- Returns
the list of Unix timestamps for the time series
- property t0: float
- Return type
float
- Returns
the first timestamp in the time series.
- property tf: float
- Return type
float
- Returns
the final timestamp in the time series.
- is_empty()
- Return type
bool
- Returns
whether the time series is empty
- squeeze()
- Return type
- Returns
a
UnivariateTimeSeries
if the time series only has one univariate, otherwise returns itself, aTimeSeries
- property shape: Tuple[int, int]
- Return type
Tuple
[int
,int
]- Returns
the shape of this time series, i.e.
(self.dim, len(self))
- bisect(t, t_in_left=False)
Splits the time series at the point where the given timestap
t
occurs.- Parameters
t (
float
) – a Unix timestamp or datetime object. Everything before timet
is in the left split, and everything after timet
is in the right split.t_in_left (
bool
) – ifTrue
,t
is in the left split. Otherwise,t
is in the right split.
- Return type
Tuple[TimeSeries, TimeSeries]
- Returns
the left and right splits of the time series.
- window(t0, tf, include_tf=False)
- Parameters
t0 (
float
) – The timestamp/datetime at the start of the window (inclusive)tf (
float
) – The timestamp/datetime at the end of the window (inclusive ifinclude_tf
isTrue
, non-inclusive otherwise)include_tf (
bool
) – Whether to includetf
in the window.
- Returns
The subset of the time series occurring between timestamps
t0
(inclusive) andtf
(included ifinclude_tf
isTrue
, excluded otherwise).- Return type
- to_pd()
- Return type
DataFrame
- Returns
A pandas DataFrame (indexed by time) which represents this time series. Each variable corresponds to a column of the DataFrame. Timestamps which are present for one variable but not another, are represented with NaN.
- classmethod from_pd(df, check_times=True, freq='1h')
- Parameters
df (
Union
[Series
,DataFrame
]) – A pandas DataFrame with a DatetimeIndex. Each column corresponds to a different variable of the time series, and the key of column (in sorted order) give the relative order of those variables (in the list self.univariates). Missing values should be represented withNaN
. May also be a pandas Series for univariate time series.check_times – whether to check that all times in the index are unique (up to the millisecond) and sorted.
- Return type
- Returns
the
TimeSeries
object corresponding todf
.
- classmethod from_ts_list(ts_list, *, check_aligned=True)
- Parameters
ts_list (Iterable[TimeSeries]) – iterable of time series we wish to form a multivariate time series with
check_aligned (bool) – whether to check if the output time series is aligned
- Return type
- Returns
A multivariate
TimeSeries
created from all the time series in the inputs.
- align(*, reference=None, granularity=None, origin=None, remove_non_overlapping=True, alignment_policy=None, aggregation_policy=AggregationPolicy.Mean, missing_value_policy=MissingValuePolicy.Interpolate)
Aligns all the univariate time series comprising this multivariate time series so that they all have the same time stamps.
- Parameters
reference (
Optional
[Sequence
[Union
[int
,float
]]]) – A specific set of timestamps we want the resampled time series to contain. Required ifalignment_policy
isAlignPolicy.FixedReference
. Overrides other alignment policies if specified.granularity (
Union
[str
,int
,float
,None
]) – The granularity (in seconds) of the resampled time time series. Defaults to the GCD time difference between adjacent elements ofreference
(when available) ortime_series
(otherwise). Ignored ifreference
is given oralignment_policy
isAlignPolicy.FixedReference
. Overrides other alignment policies if specified.origin (
Optional
[int
]) – The first timestamp of the resampled time series. Only used if the alignment policy is AlignPolicy.FixedGranularity.remove_non_overlapping – If
True
, we will only keep the portions of the univariates that overlap with each other. For example, if we have 3 univariates which span timestamps [0, 3600], [60, 3660], and [30, 3540], we will only keep timestamps in the range [60, 3540]. IfFalse
, we will keep all timestamps produced by the resampling.alignment_policy (
Optional
[AlignPolicy
]) –The policy we want to use to align the time time series.
AlignPolicy.FixedReference
aligns each single-variable time series toreference
, a user-specified sequence of timestamps.AlignPolicy.FixedGranularity
resamples each single-variable time series at the same granularity, aggregating windows and imputing missing values as desired.AlignPolicy.OuterJoin
returns a time series with the union of all timestamps present in any single-variable time series.AlignPolicy.InnerJoin
returns a time series with the intersection of all timestamps present in all single-variable time series.
aggregation_policy (
AggregationPolicy
) – The policy used to aggregate windows of adjacent observations when downsampling.missing_value_policy (
MissingValuePolicy
) – The policy used to impute missing values created when upsampling.
- Return type
- Returns
The resampled multivariate time series.
- merlion.utils.time_series.ts_csv_load(file_name, ms=True, n_vars=None)
- Parameters
file_name (
str
) – a csv file starting with the field timestamp followed by all the all variable names.ms – whether the timestamps are in milliseconds (rather than seconds)
- Return type
- Returns
A merlion
TimeSeries
object.
- merlion.utils.time_series.ts_to_csv(time_series, file_name)
- Parameters
time_series (
TimeSeries
) – theTimeSeries
object to write to a csv.file_name (
str
) – the name to assign the csv file.
- merlion.utils.time_series.assert_equal_timedeltas(time_series, timedelta=None)
Checks that all time deltas in the time series are equal, either to each other, or a pre-specified timedelta (in seconds).