Merlion’s Data Format
This notebook will explain how to use Merlion’s UnivariateTimeSeries
and TimeSeries
classes. These classes are the core data format used throughout the repo. In general, you may think of each TimeSeries
as being a collection of UnivariateTimeSeries
objects, one for each variable.
Let’s start by loading some data using pandas
.
[1]:
import os
import pandas as pd
df = pd.read_csv(os.path.join("..", "data", "example.csv"))
print(df)
timestamp_millis kpi kpi_label
0 1583140320000 667.118 0
1 1583140380000 611.751 0
2 1583140440000 599.456 0
3 1583140500000 621.446 0
4 1583140560000 1418.234 0
... ... ... ...
86802 1588376760000 874.214 0
86803 1588376820000 937.929 0
86804 1588376880000 1031.279 0
86805 1588376940000 1099.698 0
86806 1588377000000 935.405 0
[86807 rows x 3 columns]
The column timestamp_millis
consists of Unix timestamps (in units of milliseconds), and the column kpi
contains the value of the time series metric at each of those timestamps. We will also create a version of this dataframe that is indexed by time:
[2]:
time_idx_df = df.copy()
time_idx_df["timestamp_millis"] = pd.to_datetime(time_idx_df["timestamp_millis"], unit="ms")
time_idx_df = time_idx_df.set_index("timestamp_millis")
print(time_idx_df)
kpi kpi_label
timestamp_millis
2020-03-02 09:12:00 667.118 0
2020-03-02 09:13:00 611.751 0
2020-03-02 09:14:00 599.456 0
2020-03-02 09:15:00 621.446 0
2020-03-02 09:16:00 1418.234 0
... ... ...
2020-05-01 23:46:00 874.214 0
2020-05-01 23:47:00 937.929 0
2020-05-01 23:48:00 1031.279 0
2020-05-01 23:49:00 1099.698 0
2020-05-01 23:50:00 935.405 0
[86807 rows x 2 columns]
UnivariateTimeSeries: The Basic Building Block
The most transparent way to initialize a UnivariateTimeSeries
is to use its constructor. The constructor takes two arguments: time_stamps
, a list of Unix timestamps (in units of seconds) or datetime objects, and values
, a list of the actual time series values. You may optionally provide a name as well.
[3]:
from merlion.utils import UnivariateTimeSeries
kpi = UnivariateTimeSeries(
time_stamps=df.timestamp_millis/1000, # timestamps in units of seconds
values=df.kpi, # time series values
name="kpi" # optional: a name for this univariate
)
kpi_label = UnivariateTimeSeries(
time_stamps=df.timestamp_millis/1000, # timestamps in units of seconds
values=df.kpi_label # time series values
)
Alternatively, you may initialize a UnivariateTimeSeries
directly from a time-indexed pd.Series
:
[4]:
kpi_equivalent = UnivariateTimeSeries.from_pd(time_idx_df.kpi)
print(f"Are the two UnivariateTimeSeries equal? {(kpi == kpi_equivalent).all()}")
Are the two UnivariateTimeSeries equal? True
We implement the UnivariateTimeSeries
as a pd.Series
with a DatetimeIndex
:
[5]:
print(f"Is {type(kpi).__name__} an instance of pd.Series? "
f"{isinstance(kpi, pd.Series)}")
Is UnivariateTimeSeries an instance of pd.Series? True
[6]:
print(kpi)
time
2020-03-02 09:12:00 667.118
2020-03-02 09:13:00 611.751
2020-03-02 09:14:00 599.456
2020-03-02 09:15:00 621.446
2020-03-02 09:16:00 1418.234
...
2020-05-01 23:46:00 874.214
2020-05-01 23:47:00 937.929
2020-05-01 23:48:00 1031.279
2020-05-01 23:49:00 1099.698
2020-05-01 23:50:00 935.405
Name: kpi, Length: 86807, dtype: float64
You can also convert a UnivariateTimeSeries
back to a regular pd.Series
as follows:
[7]:
print(f"type(kpi.to_pd()) = {type(kpi.to_pd())}")
type(kpi.to_pd()) = <class 'pandas.core.series.Series'>
You can access the timestamps (either as timestamps or datetime objects) and values independently:
[8]:
# Get the Unix timestamps (first 5 for brevity)
print(kpi.time_stamps[:5])
[1583140320.0, 1583140380.0, 1583140440.0, 1583140500.0, 1583140560.0]
[9]:
# Get the datetimes (this is just the index of the UnivariateTimeSeries,
# since we inherit from pd.Series)
print(kpi.index[:5])
DatetimeIndex(['2020-03-02 09:12:00', '2020-03-02 09:13:00',
'2020-03-02 09:14:00', '2020-03-02 09:15:00',
'2020-03-02 09:16:00'],
dtype='datetime64[ns]', name='time', freq=None)
[10]:
# Get the values
print(kpi.values[:5])
[667.118, 611.751, 599.456, 621.446, 1418.234]
You may index into a UnivariateTimeSeries
to obtain a tuple of (timestamp, value)
:
[11]:
print(f"kpi[0] = {kpi[0]}")
kpi[0] = (1583140320.0, 667.118)
If you instead use a slice index, you will obtain a new UnivariateTimeSeries
:
[12]:
print(f"type(kpi[1:5]) = {type(kpi[1:5])}\n")
print(f"kpi[1:5] = \n\n{kpi[1:5]}")
type(kpi[1:5]) = <class 'merlion.utils.time_series.UnivariateTimeSeries'>
kpi[1:5] =
time
2020-03-02 09:13:00 611.751
2020-03-02 09:14:00 599.456
2020-03-02 09:15:00 621.446
2020-03-02 09:16:00 1418.234
Name: kpi, dtype: float64
Iterating over a UnivaraiateTimeSeries
will iterate over tuples of (timestamp, value)
:
[13]:
for t, x in kpi[:5]:
print((t, x))
(1583140320.0, 667.118)
(1583140380.0, 611.751)
(1583140440.0, 599.456)
(1583140500.0, 621.446)
(1583140560.0, 1418.234)
TimeSeries: Merlion’s Standard Data Class
Because Merlion is a general-purpose library that handles both univariate and multivariate time series, our standard data class is TimeSeries
. This class acts as a wrapper around a collection of UnivariateTimeSeries
. We choose this format rather than a vector-based approach because this approach is much more robust to missing values, or different univariates being sampled at different rates.
The most transparent way to initialize a TimeSeries
is with its constructor, which takes a collection (list or (ordered) dictionary) of UnivariateTimeSeries
its only argument:
[14]:
from collections import OrderedDict
from merlion.utils import TimeSeries
time_series_list = TimeSeries(univariates=[kpi.copy(), kpi_label.copy()])
time_series_dict = TimeSeries(
univariates=OrderedDict([("kpi_renamed", kpi.copy()),
("kpi_label", kpi_label.copy())]))
Alternatively, you may initialize a TimeSeries
from a pd.DataFrame
and convert a TimeSeries
to a pd.DataFrame
as follows:
[15]:
time_series = TimeSeries.from_pd(time_idx_df)
print(f"type(TimeSeries.from_pd(time_idx_df)) = {type(time_series)}\n")
recovered_time_idx_df = time_series.to_pd()
print("(recovered_time_idx_df == time_idx_df).all()")
print((recovered_time_idx_df == time_idx_df).all())
type(TimeSeries.from_pd(time_idx_df)) = <class 'merlion.utils.time_series.TimeSeries'>
(recovered_time_idx_df == time_idx_df).all()
kpi True
kpi_label True
dtype: bool
We may access the names of the individual univariates with time_series.names
, access a specific univariate via time_series.univariates[name]
, and iterate over univariates by iterating for univariate in time_series.univariates
. Concretely:
[16]:
# When we use a list of univariates, we retain the names of the univariates
# where possible. If a univariate is unnamed, we set its name to its integer
# index in the list of all univariates given. Here, kpi_label was
# originally unnamed, so we set its name to 1
print(time_series_list.names)
['kpi', 'kpi_label']
[17]:
# If we pass a dictionary instead of a list, all univariates will have
# their specified names. The order is retained from the OrderedDict.
print(time_series_dict.names)
['kpi_renamed', 'kpi_label']
[18]:
# We can access the KPI like so:
kpi1 = time_series_list.univariates["kpi"]
kpi2 = time_series_dict.univariates["kpi_renamed"]
# kpi1 and kpi2 are the same univariate, just with different names
assert (kpi1 == kpi2).all()
[19]:
# We can iterate over all univariates like so:
for univariate in time_series_dict.univariates:
print(univariate)
print()
time
2020-03-02 09:12:00 667.118
2020-03-02 09:13:00 611.751
2020-03-02 09:14:00 599.456
2020-03-02 09:15:00 621.446
2020-03-02 09:16:00 1418.234
...
2020-05-01 23:46:00 874.214
2020-05-01 23:47:00 937.929
2020-05-01 23:48:00 1031.279
2020-05-01 23:49:00 1099.698
2020-05-01 23:50:00 935.405
Name: kpi_renamed, Length: 86807, dtype: float64
time
2020-03-02 09:12:00 0.0
2020-03-02 09:13:00 0.0
2020-03-02 09:14:00 0.0
2020-03-02 09:15:00 0.0
2020-03-02 09:16:00 0.0
...
2020-05-01 23:46:00 0.0
2020-05-01 23:47:00 0.0
2020-05-01 23:48:00 0.0
2020-05-01 23:49:00 0.0
2020-05-01 23:50:00 0.0
Name: kpi_label, Length: 86807, dtype: float64
[20]:
# We can also iterate over all univariates & names like so:
for name, univariate in time_series_dict.items():
print(f"Univariate {name}")
print(univariate)
print()
Univariate kpi_renamed
time
2020-03-02 09:12:00 667.118
2020-03-02 09:13:00 611.751
2020-03-02 09:14:00 599.456
2020-03-02 09:15:00 621.446
2020-03-02 09:16:00 1418.234
...
2020-05-01 23:46:00 874.214
2020-05-01 23:47:00 937.929
2020-05-01 23:48:00 1031.279
2020-05-01 23:49:00 1099.698
2020-05-01 23:50:00 935.405
Name: kpi_renamed, Length: 86807, dtype: float64
Univariate kpi_label
time
2020-03-02 09:12:00 0.0
2020-03-02 09:13:00 0.0
2020-03-02 09:14:00 0.0
2020-03-02 09:15:00 0.0
2020-03-02 09:16:00 0.0
...
2020-05-01 23:46:00 0.0
2020-05-01 23:47:00 0.0
2020-05-01 23:48:00 0.0
2020-05-01 23:49:00 0.0
2020-05-01 23:50:00 0.0
Name: kpi_label, Length: 86807, dtype: float64
Time Series Indexing & Alignment
An important concept of TimeSeries
in Merlion is alignment. We call a time series aligned if all of its univariates are sampled at the same time stamps. We illustrate examples of time series that are and aren’t aligned below:
[21]:
aligned = TimeSeries({"kpi": kpi.copy(), "kpi_label": kpi_label.copy()})
print(f"Is aligned? {aligned.is_aligned}")
Is aligned? True
[22]:
not_aligned = TimeSeries({"kpi": kpi[1:], # 2020-03-02 09:13:00 to 2020-05-01 23:50:00
"kpi_label": kpi_label[:-1]}) # 2020-03-02 09:12:00 to 2020-05-01 23:49:00
print(f"Is aligned? {not_aligned.is_aligned}")
Is aligned? False
If your time series is aligned, you may use an integer index to obtain a tuple (timestamp, (value_1, ..., value_k))
, or a slice index to obtain a sub-TimeSeries
:
[23]:
aligned[0]
[23]:
(1583140320.0, [667.118, 0.0])
[24]:
print(f"type(aligned[1:5]) = {type(aligned[1:5])}\n")
print(f"aligned[1:5] = \n{aligned[1:5]}")
type(aligned[1:5]) = <class 'merlion.utils.time_series.TimeSeries'>
aligned[1:5] =
kpi kpi_label
time
2020-03-02 09:13:00 611.751 0.0
2020-03-02 09:14:00 599.456 0.0
2020-03-02 09:15:00 621.446 0.0
2020-03-02 09:16:00 1418.234 0.0
You may also iterate over an aligned time series as for timestamp, (value_1, ..., value_k) in time_series
:
[25]:
for t, (x1, x2) in aligned[:5]:
print((t, (x1, x2)))
(1583140320.0, (667.118, 0.0))
(1583140380.0, (611.751, 0.0))
(1583140440.0, (599.456, 0.0))
(1583140500.0, (621.446, 0.0))
(1583140560.0, (1418.234, 0.0))
Note that Merlion will throw an error if you try to do any of these things with a time series that isn’t aligned! For example,
[26]:
try:
not_aligned[0]
except RuntimeError as e:
print(f"{type(e).__name__}: {e}")
RuntimeError: The univariates comprising this time series are not aligned (they have different time stamps), but alignment is required to index into the time series.
You can still get the length/shape of a misaligned time series, but Merlion will emit a warning.
[27]:
print(len(not_aligned))
/Users/abhatnagar/Desktop/Merlion/merlion/utils/time_series.py:672: UserWarning: The univariates comprising this time series are not aligned (they have different time stamps). The length returned is equal to the length of the _union_ of all time stamps present in any of the univariates.
warnings.warn(warning)
The univariates comprising this time series are not aligned (they have different time stamps). The length returned is equal to the length of the _union_ of all time stamps present in any of the univariates.
86807
[28]:
print(not_aligned.shape)
The univariates comprising this time series are not aligned (they have different time stamps). The length returned is equal to the length of the _union_ of all time stamps present in any of the univariates.
(2, 86807)
However, you may call time_series.align()
to automatically resample the individual univariates of a time series to make it aligned. By default, this will take the union of all the time stamps present in any of the individual univariates, but this is customizable.
[29]:
print(f"Is not_aligned.align() aligned? {not_aligned.align().is_aligned}")
Is not_aligned.align() aligned? True
TimeSeries: A Few Useful Features
We provide much more information on the merlion.utils.time_series.TimeSeries
class in the API docs, but we highlight two more useful features here. These work regardless of whether a time series is aligned!
You may obtain the subset of a time series between times t0
and tf
by calling time_series.window(t0, tf)
. t0
and tf
may be any reasonable format of datetime, or a Unix timestamp.
[30]:
aligned.window("2020-03-05 12:00:00", pd.Timestamp(year=2020, month=4, day=1))
[30]:
kpi kpi_label
time
2020-03-05 12:00:00 1166.819 0.0
2020-03-05 12:01:00 1345.504 0.0
2020-03-05 12:02:00 1061.391 0.0
2020-03-05 12:03:00 1260.874 0.0
2020-03-05 12:04:00 1202.009 0.0
... ... ...
2020-03-31 23:55:00 1154.397 0.0
2020-03-31 23:56:00 1270.292 0.0
2020-03-31 23:57:00 1160.761 0.0
2020-03-31 23:58:00 1082.076 0.0
2020-03-31 23:59:00 1167.297 0.0
[38160 rows x 2 columns]
[31]:
# Note that the first value of the KPI (which is missing in not_aligned) is NaN
not_aligned.window(1583140320, 1583226720)
[31]:
kpi kpi_label
time
2020-03-02 09:12:00 NaN 0.0
2020-03-02 09:13:00 611.751 0.0
2020-03-02 09:14:00 599.456 0.0
2020-03-02 09:15:00 621.446 0.0
2020-03-02 09:16:00 1418.234 0.0
... ... ...
2020-03-03 09:07:00 1132.564 0.0
2020-03-03 09:08:00 1087.037 0.0
2020-03-03 09:09:00 984.432 0.0
2020-03-03 09:10:00 1085.008 0.0
2020-03-03 09:11:00 1020.937 0.0
[1440 rows x 2 columns]
You may also bisect a time series into a left and right portion, at any timestamp.
[32]:
left, right = aligned.bisect("2020-05-01")
print(f"Left\n{left}\n")
print()
print(f"Right\n{right}\n")
Left
kpi kpi_label
time
2020-03-02 09:12:00 667.118 0.0
2020-03-02 09:13:00 611.751 0.0
2020-03-02 09:14:00 599.456 0.0
2020-03-02 09:15:00 621.446 0.0
2020-03-02 09:16:00 1418.234 0.0
... ... ...
2020-04-30 23:55:00 1296.091 0.0
2020-04-30 23:56:00 1323.743 0.0
2020-04-30 23:57:00 1203.672 0.0
2020-04-30 23:58:00 1278.720 0.0
2020-04-30 23:59:00 1217.877 0.0
[85376 rows x 2 columns]
Right
kpi kpi_label
time
2020-05-01 00:00:00 1381.110 0.0
2020-05-01 00:01:00 1807.039 0.0
2020-05-01 00:02:00 1833.385 0.0
2020-05-01 00:03:00 1674.412 0.0
2020-05-01 00:04:00 1683.194 0.0
... ... ...
2020-05-01 23:46:00 874.214 0.0
2020-05-01 23:47:00 937.929 0.0
2020-05-01 23:48:00 1031.279 0.0
2020-05-01 23:49:00 1099.698 0.0
2020-05-01 23:50:00 935.405 0.0
[1431 rows x 2 columns]
Please refer to the API docs on UnivariateTimeSeries
and TimeSeries
for more information.