Merlion’s Data Format

This notebook will explain how to use Merlion’s UnivariateTimeSeries and TimeSeries classes. These classes are the core data format used throughout the repo. In general, you may think of each TimeSeries as being a collection of UnivariateTimeSeries objects, one for each variable.

Let’s start by loading some data using pandas.

[1]:
import os
import pandas as pd

df = pd.read_csv(os.path.join("..", "data", "example.csv"))
print(df)
       timestamp_millis       kpi  kpi_label
0         1583140320000   667.118          0
1         1583140380000   611.751          0
2         1583140440000   599.456          0
3         1583140500000   621.446          0
4         1583140560000  1418.234          0
...                 ...       ...        ...
86802     1588376760000   874.214          0
86803     1588376820000   937.929          0
86804     1588376880000  1031.279          0
86805     1588376940000  1099.698          0
86806     1588377000000   935.405          0

[86807 rows x 3 columns]

The column timestamp_millis consists of Unix timestamps (in units of milliseconds), and the column kpi contains the value of the time series metric at each of those timestamps. We will also create a version of this dataframe that is indexed by time:

[2]:
time_idx_df = df.copy()
time_idx_df["timestamp_millis"] = pd.to_datetime(time_idx_df["timestamp_millis"], unit="ms")
time_idx_df = time_idx_df.set_index("timestamp_millis")
print(time_idx_df)
                          kpi  kpi_label
timestamp_millis
2020-03-02 09:12:00   667.118          0
2020-03-02 09:13:00   611.751          0
2020-03-02 09:14:00   599.456          0
2020-03-02 09:15:00   621.446          0
2020-03-02 09:16:00  1418.234          0
...                       ...        ...
2020-05-01 23:46:00   874.214          0
2020-05-01 23:47:00   937.929          0
2020-05-01 23:48:00  1031.279          0
2020-05-01 23:49:00  1099.698          0
2020-05-01 23:50:00   935.405          0

[86807 rows x 2 columns]

UnivariateTimeSeries: The Basic Building Block

The most transparent way to initialize a UnivariateTimeSeries is to use its constructor. The constructor takes two arguments: time_stamps, a list of Unix timestamps (in units of seconds) or datetime objects, and values, a list of the actual time series values. You may optionally provide a name as well.

[3]:
from merlion.utils import UnivariateTimeSeries

kpi = UnivariateTimeSeries(
    time_stamps=df.timestamp_millis/1000,  # timestamps in units of seconds
    values=df.kpi,                         # time series values
    name="kpi"                             # optional: a name for this univariate
)

kpi_label = UnivariateTimeSeries(
    time_stamps=df.timestamp_millis/1000,  # timestamps in units of seconds
    values=df.kpi_label                    # time series values
)

Alternatively, you may initialize a UnivariateTimeSeries directly from a time-indexed pd.Series:

[4]:
kpi_equivalent = UnivariateTimeSeries.from_pd(time_idx_df.kpi)
print(f"Are the two UnivariateTimeSeries equal? {(kpi == kpi_equivalent).all()}")
Are the two UnivariateTimeSeries equal? True

We implement the UnivariateTimeSeries as a pd.Series with a DatetimeIndex:

[5]:
print(f"Is {type(kpi).__name__} an instance of pd.Series? "
      f"{isinstance(kpi, pd.Series)}")
Is UnivariateTimeSeries an instance of pd.Series? True
[6]:
print(kpi)
2020-03-02 09:12:00     667.118
2020-03-02 09:13:00     611.751
2020-03-02 09:14:00     599.456
2020-03-02 09:15:00     621.446
2020-03-02 09:16:00    1418.234
                         ...
2020-05-01 23:46:00     874.214
2020-05-01 23:47:00     937.929
2020-05-01 23:48:00    1031.279
2020-05-01 23:49:00    1099.698
2020-05-01 23:50:00     935.405
Name: kpi, Length: 86807, dtype: float64

You can also convert a UnivariateTimeSeries back to a regular pd.Series as follows:

[7]:
print(f"type(kpi.to_pd()) = {type(kpi.to_pd())}")
type(kpi.to_pd()) = <class 'pandas.core.series.Series'>

You can access the timestamps (either as timestamps or datetime objects) and values independently:

[8]:
# Get the Unix timestamps (first 5 for brevity)
print(kpi.time_stamps[:5])
[1583140320.0, 1583140380.0, 1583140440.0, 1583140500.0, 1583140560.0]
[9]:
# Get the datetimes (this is just the index of the UnivariateTimeSeries,
# since we inherit from pd.Series)
print(kpi.index[:5])
DatetimeIndex(['2020-03-02 09:12:00', '2020-03-02 09:13:00',
               '2020-03-02 09:14:00', '2020-03-02 09:15:00',
               '2020-03-02 09:16:00'],
              dtype='datetime64[ns]', freq=None)
[10]:
# Get the values
print(kpi.values[:5])
[667.118, 611.751, 599.456, 621.446, 1418.234]

You may index into a UnivariateTimeSeries to obtain a tuple of (timestamp, value):

[11]:
print(f"kpi[0] = {kpi[0]}")
kpi[0] = (1583140320.0, 667.118)

If you instead use a slice index, you will obtain a new UnivariateTimeSeries:

[12]:
print(f"type(kpi[1:5]) = {type(kpi[1:5])}\n")
print(f"kpi[1:5] = \n\n{kpi[1:5]}")
type(kpi[1:5]) = <class 'merlion.utils.time_series.UnivariateTimeSeries'>

kpi[1:5] =

2020-03-02 09:13:00     611.751
2020-03-02 09:14:00     599.456
2020-03-02 09:15:00     621.446
2020-03-02 09:16:00    1418.234
Name: kpi, dtype: float64

Iterating over a UnivaraiateTimeSeries will iterate over tuples of (timestamp, value):

[13]:
for t, x in kpi[:5]:
    print((t, x))
(1583140320.0, 667.118)
(1583140380.0, 611.751)
(1583140440.0, 599.456)
(1583140500.0, 621.446)
(1583140560.0, 1418.234)

TimeSeries: Merlion’s Standard Data Class

Because Merlion is a general-purpose library that handles both univariate and multivariate time series, our standard data class is TimeSeries. This class acts as a wrapper around a collection of UnivariateTimeSeries. We choose this format rather than a vector-based approach because this approach is much more robust to missing values, or different univariates being sampled at different rates.

The most transparent way to initialize a TimeSeries is with its constructor, which takes a collection (list or (ordered) dictionary) of UnivariateTimeSeries its only argument:

[14]:
from collections import OrderedDict
from merlion.utils import TimeSeries

time_series_list = TimeSeries(univariates=[kpi.copy(), kpi_label.copy()])
time_series_dict = TimeSeries(
    univariates=OrderedDict([("kpi_renamed", kpi.copy()),
                             ("kpi_label", kpi_label.copy())]))

Alternatively, you may initialize a TimeSeries from a pd.DataFrame and convert a TimeSeries to a pd.DataFrame as follows:

[15]:
time_series = TimeSeries.from_pd(time_idx_df)
print(f"type(TimeSeries.from_pd(time_idx_df)) = {type(time_series)}\n")

recovered_time_idx_df = time_series.to_pd()
print("(recovered_time_idx_df == time_idx_df).all()")
print((recovered_time_idx_df == time_idx_df).all())
type(TimeSeries.from_pd(time_idx_df)) = <class 'merlion.utils.time_series.TimeSeries'>

(recovered_time_idx_df == time_idx_df).all()
kpi          True
kpi_label    True
dtype: bool

We may access the names of the individual univariates with time_series.names, access a specific univariate via time_series.univariates[name], and iterate over univariates by iterating for univariate in time_series.univariates. Concretely:

[16]:
# When we use a list of univariates, we retain the names of the univariates
# where possible. If a univariate is unnamed, we set its name to its integer
# index in the list of all univariates given. Here, kpi_label was
# originally unnamed, so we set its name to 1
print(time_series_list.names)
['kpi', 'kpi_label']
[17]:
# If we pass a dictionary instead of a list, all univariates will have
# their specified names. The order is retained from the OrderedDict.
print(time_series_dict.names)
['kpi_renamed', 'kpi_label']
[18]:
# We can access the KPI like so:
kpi1 = time_series_list.univariates["kpi"]
kpi2 = time_series_dict.univariates["kpi_renamed"]

# kpi1 and kpi2 are the same univariate, just with different names
assert (kpi1 == kpi2).all()
[19]:
# We can iterate over all univariates like so:
for univariate in time_series_dict.univariates:
    print(univariate)
    print()
2020-03-02 09:12:00     667.118
2020-03-02 09:13:00     611.751
2020-03-02 09:14:00     599.456
2020-03-02 09:15:00     621.446
2020-03-02 09:16:00    1418.234
                         ...
2020-05-01 23:46:00     874.214
2020-05-01 23:47:00     937.929
2020-05-01 23:48:00    1031.279
2020-05-01 23:49:00    1099.698
2020-05-01 23:50:00     935.405
Name: kpi_renamed, Length: 86807, dtype: float64

2020-03-02 09:12:00    0.0
2020-03-02 09:13:00    0.0
2020-03-02 09:14:00    0.0
2020-03-02 09:15:00    0.0
2020-03-02 09:16:00    0.0
                      ...
2020-05-01 23:46:00    0.0
2020-05-01 23:47:00    0.0
2020-05-01 23:48:00    0.0
2020-05-01 23:49:00    0.0
2020-05-01 23:50:00    0.0
Name: kpi_label, Length: 86807, dtype: float64

[20]:
# We can also iterate over all univariates & names like so:
for name, univariate in time_series_dict.items():
    print(f"Univariate {name}")
    print(univariate)
    print()
Univariate kpi_renamed
2020-03-02 09:12:00     667.118
2020-03-02 09:13:00     611.751
2020-03-02 09:14:00     599.456
2020-03-02 09:15:00     621.446
2020-03-02 09:16:00    1418.234
                         ...
2020-05-01 23:46:00     874.214
2020-05-01 23:47:00     937.929
2020-05-01 23:48:00    1031.279
2020-05-01 23:49:00    1099.698
2020-05-01 23:50:00     935.405
Name: kpi_renamed, Length: 86807, dtype: float64

Univariate kpi_label
2020-03-02 09:12:00    0.0
2020-03-02 09:13:00    0.0
2020-03-02 09:14:00    0.0
2020-03-02 09:15:00    0.0
2020-03-02 09:16:00    0.0
                      ...
2020-05-01 23:46:00    0.0
2020-05-01 23:47:00    0.0
2020-05-01 23:48:00    0.0
2020-05-01 23:49:00    0.0
2020-05-01 23:50:00    0.0
Name: kpi_label, Length: 86807, dtype: float64

Time Series Indexing & Alignment

An important concept of TimeSeries in Merlion is alignment. We call a time series aligned if all of its univariates are sampled at the same time stamps. We illustrate examples of time series that are and aren’t aligned below:

[21]:
aligned = TimeSeries({"kpi": kpi.copy(), "kpi_label": kpi_label.copy()})
print(f"Is aligned? {aligned.is_aligned}")
Is aligned? True
[22]:
not_aligned = TimeSeries({"kpi": kpi[1:],                # 2020-03-02 09:13:00 to 2020-05-01 23:50:00
                          "kpi_label": kpi_label[:-1]})  # 2020-03-02 09:12:00 to 2020-05-01 23:49:00
print(f"Is aligned? {not_aligned.is_aligned}")
Is aligned? False

If your time series is aligned, you may use an integer index to obtain a tuple (timestamp, (value_1, ..., value_k)), or a slice index to obtain a sub-TimeSeries:

[23]:
aligned[0]
[23]:
(1583140320.0, (667.118, 0.0))
[24]:
print(f"type(aligned[1:5]) = {type(aligned[1:5])}\n")
print(f"aligned[1:5] = \n{aligned[1:5]}")
type(aligned[1:5]) = <class 'merlion.utils.time_series.TimeSeries'>

aligned[1:5] =
                          kpi  kpi_label
2020-03-02 09:13:00   611.751        0.0
2020-03-02 09:14:00   599.456        0.0
2020-03-02 09:15:00   621.446        0.0
2020-03-02 09:16:00  1418.234        0.0

You may also iterate over an aligned time series as for timestamp, (value_1, ..., value_k) in time_series:

[25]:
for t, (x1, x2) in aligned[:5]:
    print((t, (x1, x2)))
(1583140320.0, (667.118, 0.0))
(1583140380.0, (611.751, 0.0))
(1583140440.0, (599.456, 0.0))
(1583140500.0, (621.446, 0.0))
(1583140560.0, (1418.234, 0.0))

Note that Merlion will throw an error if you try to do any of these things with a time series that isn’t aligned! For example,

[26]:
try:
    not_aligned[0]
except RuntimeError as e:
    print(f"{type(e).__name__}: {e}")
RuntimeError: The univariates comprising this time series are not aligned (they have different time stamps), but alignment is required to index into the time series.

You can still get the length/shape of a misaligned time series, but Merlion will emit a warning.

[27]:
print(len(not_aligned))
/Users/abhatnagar/Desktop/Merlion_public/merlion/utils/time_series.py:617: UserWarning: The univariates comprising this time series are not aligned (they have different time stamps). The length returned is equal to the length of the _union_ of all time stamps present in any of the univariates.
  warnings.warn(warning)
The univariates comprising this time series are not aligned (they have different time stamps). The length returned is equal to the length of the _union_ of all time stamps present in any of the univariates.
86807
[28]:
print(not_aligned.shape)
The univariates comprising this time series are not aligned (they have different time stamps). The length returned is equal to the length of the _union_ of all time stamps present in any of the univariates.
(2, 86807)

However, you may call time_series.align() to automatically resample the individual univariates of a time series to make it aligned. By default, this will take the union of all the time stamps present in any of the individual univariates, but this is customizable.

[29]:
print(f"Is not_aligned.align() aligned? {not_aligned.align().is_aligned}")
Is not_aligned.align() aligned? True

TimeSeries: A Few Useful Features

We provide much more information on the merlion.utils.time_series.TimeSeries class in the API docs, but we highlight two more useful features here. These work regardless of whether a time series is aligned!

You may obtain the subset of a time series between times t0 and tf by calling time_series.window(t0, tf). t0 and tf may be any reasonable format of datetime, or a Unix timestamp.

[30]:
aligned.window("2020-03-05 12:00:00", pd.Timestamp(year=2020, month=4, day=1))
[30]:
                          kpi  kpi_label
2020-03-05 12:00:00  1166.819        0.0
2020-03-05 12:01:00  1345.504        0.0
2020-03-05 12:02:00  1061.391        0.0
2020-03-05 12:03:00  1260.874        0.0
2020-03-05 12:04:00  1202.009        0.0
...                       ...        ...
2020-03-31 23:55:00  1154.397        0.0
2020-03-31 23:56:00  1270.292        0.0
2020-03-31 23:57:00  1160.761        0.0
2020-03-31 23:58:00  1082.076        0.0
2020-03-31 23:59:00  1167.297        0.0

[38160 rows x 2 columns]
[31]:
# Note that the first value of the KPI (which is missing in not_aligned) is NaN
not_aligned.window(1583140320, 1583226720)
[31]:
                          kpi  kpi_label
2020-03-02 09:12:00       NaN        0.0
2020-03-02 09:13:00   611.751        0.0
2020-03-02 09:14:00   599.456        0.0
2020-03-02 09:15:00   621.446        0.0
2020-03-02 09:16:00  1418.234        0.0
...                       ...        ...
2020-03-03 09:07:00  1132.564        0.0
2020-03-03 09:08:00  1087.037        0.0
2020-03-03 09:09:00   984.432        0.0
2020-03-03 09:10:00  1085.008        0.0
2020-03-03 09:11:00  1020.937        0.0

[1440 rows x 2 columns]

You may also bisect a time series into a left and right portion, at any timestamp.

[32]:
left, right = aligned.bisect("2020-05-01")
print(f"Left\n{left}\n")
print()
print(f"Right\n{right}\n")
Left
                          kpi  kpi_label
2020-03-02 09:12:00   667.118        0.0
2020-03-02 09:13:00   611.751        0.0
2020-03-02 09:14:00   599.456        0.0
2020-03-02 09:15:00   621.446        0.0
2020-03-02 09:16:00  1418.234        0.0
...                       ...        ...
2020-04-30 23:55:00  1296.091        0.0
2020-04-30 23:56:00  1323.743        0.0
2020-04-30 23:57:00  1203.672        0.0
2020-04-30 23:58:00  1278.720        0.0
2020-04-30 23:59:00  1217.877        0.0

[85376 rows x 2 columns]


Right
                          kpi  kpi_label
2020-05-01 00:00:00  1381.110        0.0
2020-05-01 00:01:00  1807.039        0.0
2020-05-01 00:02:00  1833.385        0.0
2020-05-01 00:03:00  1674.412        0.0
2020-05-01 00:04:00  1683.194        0.0
...                       ...        ...
2020-05-01 23:46:00   874.214        0.0
2020-05-01 23:47:00   937.929        0.0
2020-05-01 23:48:00  1031.279        0.0
2020-05-01 23:49:00  1099.698        0.0
2020-05-01 23:50:00   935.405        0.0

[1431 rows x 2 columns]

Please refer to the API docs on UnivariateTimeSeries and TimeSeries for more information.