Data Object

In order to feed observational data to the causal discovery algorithms in our API, the raw data– NumPy arrays and a list of variable names (optional), is used to instantiate a CausalAI data object. Note that any data transformation must be applied to the NumPy array prior to instantiating a data object. For time series and tabular data, \(\texttt{TimeSeriesData}\) and \(\texttt{TabularData}\) must be initialized with the aforementioned data respectively.

[1]:
import numpy as np
import math
import matplotlib
from matplotlib import pyplot as plt
import csv
import pandas as pd

Time Series Data

Let’s begin by importing the modules

[2]:
from causalai.data.time_series import TimeSeriesData
from causalai.data.transforms.time_series import StandardizeTransform, DifferenceTransform, Heterogeneous2DiscreteTransform

We will now instantiate a random numpy array and define a data object using our time series data class, and look at its important attributes and methods. Let’s say our time series has length 100, and there are 2 variables.

[3]:
data_array = np.random.random((100, 2))

data_obj = TimeSeriesData(data_array)
print(f'This time series object has length {data_obj.length}')
print(f'This time series object has dimensions {data_obj.dim}')
print(f'This time series object has variables with names {data_obj.var_names}')
This time series object has length [100]
This time series object has dimensions 2
This time series object has variables with names [0, 1]

There are a few things to notice: 1. We are assuming that both the variables are sampled at the same temporal rate (i.e., the same temporal resolution). We currently do not support time series in which different variables have different temporal resolution. 2. Since we did not define any variable names, by default it is enumerated by the variable index values. 3. The data object’s length is returned as a list. We discuss this below under Multi-Data object.

We can alternatively define variable names by passing it to the data object constructor as follows:

[4]:
data_array = np.random.random((100, 2))
var_names = ['A', 'B']

data_obj = TimeSeriesData(data_array, var_names=var_names)
print(f'This time series object has length {data_obj.length}')
print(f'This time series object has dimensions {data_obj.dim}')
print(f'This time series object has variables with names {data_obj.var_names}')
This time series object has length [100]
This time series object has dimensions 2
This time series object has variables with names ['A', 'B']

Finally, the data array can be retrieved as:

[5]:
data_array_ret, = data_obj.data_arrays

print('\nRetrieving data array from the data object and making sure they are exactly the same:')
assert (data_array_ret==data_array).all()
print(data_array.shape)
print(data_array_ret.shape)

Retrieving data array from the data object and making sure they are exactly the same:
(100, 2)
(100, 2)

Multi-Data Object

In time series case, there can be use cases where we have multiple disjoint time series for the same dataset. For instance, the first time series is from January-March, and the second time series is from July-September. In this case, concatenating the two time series would be incorrect.

To support such use cases in our library, one can pass multiple numpy arrays to the data object constructor as follows:

[6]:
data_array1 = np.random.random((100, 2))
data_array2 = np.random.random((24, 2))
var_names = ['A', 'B']

data_obj = TimeSeriesData(data_array1, data_array2, var_names=var_names)
print(f'This time series object has length {data_obj.length}')
print(f'This time series object has dimensions {data_obj.dim}')
print(f'This time series object has variables with names {data_obj.var_names}')

print('\nRetrieving data array from the data object and making sure they are exactly the same:')
data_array1_ret,data_array2_ret = data_obj.data_arrays
assert (data_array1_ret==data_array1).all()
assert (data_array2_ret==data_array2).all()
print(data_array1.shape, data_array2.shape)
print(data_array1_ret.shape, data_array2_ret.shape)
This time series object has length [100, 24]
This time series object has dimensions 2
This time series object has variables with names ['A', 'B']

Retrieving data array from the data object and making sure they are exactly the same:
(100, 2) (24, 2)
(100, 2) (24, 2)

It should now be apparent that the data object length is returned as a list so that one can retrieve the individual time series length.

As side notes, note that all arrays must have the same number of dimensions, otherwise the object constructor will throw an error.

Data object Methods

We list 2 data object methods that may be useful for users. They are: 1. var_name2index: This method takes as input variable name, and returns the index of that variable. 2. extract_array: Extract the arrays corresponding to the node names X,Y,Z, which are provided as inputs. X and Y are individual nodes, and Z is the set of nodes to be used as the conditional set. More explanation below.

First we show below the usage of var_name2index:

[7]:
print(f"The index of variable B is {data_obj.var_name2index('B')}")
The index of variable B is 1

To understand the purpose of the extract_array method, note that in causal discovery, a typical operation is to perform conditioal independence (CI) tests, where conditioned on some set of variables Z, we want to perform independence test between two variables X and Y.

To perform these CI tests, a convenient approach is to list the variables X,Y and the set Z by name and their relative time index, and then define a function which returns all the instances of the corresponding variable values. For instance, in the example below, we are interested in performing a CI test between variables X=(B,t) and Y=(A,t-2) conditioned on the variable set Z=[(A, t-1), (B, t-2)], over all the values of t in the given time series dataset. Note that we follow the naming conventions below: 1. X is the variable B at the current time t. Since it is always t, we drop the time index and simply pass the variable name string. 2. Y is the variable A from the time steps t-2 relative to X. We drop the character t, and specify this choice as (A,-2). 3. Each time indexed variable inside the list Z follows the same naming convention as specified above for Y.

[8]:
data_array = np.random.random((5, 2))
var_names = ['A', 'B']
data_obj = TimeSeriesData(data_array, var_names=var_names)

X = 'B'
Y = ('A', -2)
Z = [('A', -1), ('B', -2)]

x,y,z = data_obj.extract_array(X,Y,Z, max_lag=3)

To understand the outputs x,y,z above, we print below the time series and these outputs with each element labeled with their respective variable name and time index.

[9]:

data_array = data_obj.data_arrays[0] T=data_array.shape[0] print('data_array = [') for i in range(data_array.shape[0]): print(f'[A(t-{T-i-1}): {data_array[i][0]:.2f}, B(t-{T-i-1}): {data_array[i][1]:.2f}],') print(']') T=x.shape[0] print(f'\nX = {X}\nx = [') for i in range(x.shape[0]): print(f'[{X}(t-{T-i-1}): {x[i]:.2f}],') print(']') print(f'\nY = {Y}\ny = [') for i in range(x.shape[0]): print(f'[{Y[0]}(t-{T-i-1-Y[1]}): {y[i]:.2f}],') print(']') print(f'\nZ = {Z}\nz = [') for i in range(x.shape[0]): print(f'[{Z[0][0]}(t-{T-i-1-Z[0][1]}): {z[i][0]:.2f}, {Z[1][0]}(t-{T-i-1-Z[1][1]}): {z[i][1]:.2f}],') print(']')
data_array = [
[A(t-4): 0.08, B(t-4): 0.49],
[A(t-3): 0.44, B(t-3): 0.08],
[A(t-2): 0.40, B(t-2): 0.34],
[A(t-1): 0.76, B(t-1): 0.13],
[A(t-0): 0.54, B(t-0): 0.62],
]

X = B
x = [
[B(t-1): 0.13],
[B(t-0): 0.62],
]

Y = ('A', -2)
y = [
[A(t-3): 0.44],
[A(t-2): 0.40],
]

Z = [('A', -1), ('B', -2)]
z = [
[A(t-2): 0.40, B(t-3): 0.08],
[A(t-1): 0.76, B(t-2): 0.34],
]

Notice that the number of rows in x,y,z are the same and for any given row index, their values correspond to the variable names and relative time index specified. These arrays can now be use to perform CI tests. Our causal discovery models use this method internally, but they can be used directly if needed as well.

On a final note, if the specified list Z contains nodes whose relative lag is more than the value of max_lag, they will be ignored. For instance, if Z contains ('A', -4) and max_lag=3, then this node will be removed from Z prior to computing the z array.

Tabular Data

The tabular data object behaves similarly to the time series object. The modules for the tabular case are as follows:

[10]:
from causalai.data.tabular import TabularData
from causalai.data.transforms.tabular import StandardizeTransform, Heterogeneous2DiscreteTransform

Data Pre-processing

The common data pre-processing transforms for both time series and tabular data are StandardizeTransform and Heterogeneous2DiscreteTransform. They can be imported respectively as follows:

  1. Time series:

from causalai.data.transforms.time_series import StandardizeTransform, Heterogeneous2DiscreteTransform

  1. Tabular:

from causalai.data.transforms.tabular import StandardizeTransform, Heterogeneous2DiscreteTransform

They function identically and may even be used interchangeably, but are supported under tabular and time_series modules for clarity.

StandardizeTransform: Transforms each column of the data provided as Numpy arrays to have zero mean and unit variance. Ingores NaNs. Useful for continuous data.

Heterogeneous2DiscreteTransform: If the user data is heterogeneous, i.e., some variables are discrete while others are continuous, the supported causal discovery algorithms will not function properly. In order to support heterogeneous data, the Heterogeneous2DiscreteTransform can be used to make all the variables discrete, and then causal discovery algorithms that support discrete data can be used. The number of states to be used for discretization can be specified in the module.

In addition to the above transforms, for time series data, CausalAI also supports DifferenceTransform, which can be imported as follows:

from causalai.data.transforms.time_series import DifferenceTransform

DifferenceTransform: Transform time series data by taking the difference between two time steps that are a certain interval apart specified by the argument order. May be used for both continuous and discrete time series data, if required.

StandardizeTransform

Transforms each column of the data to have zero mean and unit variance.

[11]:
from causalai.data.transforms.time_series import StandardizeTransform, Heterogeneous2DiscreteTransform

data_array = np.random.random((100, 2))

StandardizeTransform_ = StandardizeTransform()
StandardizeTransform_.fit(data_array)

data_train_trans = StandardizeTransform_.transform(data_array)


print(f'Dimension-wise mean of the original data array: {data_array.mean(0)}')
print(f'Dimension-wise mean of the transformed data array: {data_train_trans.mean(0)}.'\
      f'\nNotice that this is close to 0.')

print(f'\nDimension-wise standard deviation of the original data array: {data_array.std(0)}')
print(f'Dimension-wise standard deviation of the transformed data array: {data_train_trans.std(0)}.'\
      f' \nNotice that this is close to 1.')


Dimension-wise mean of the original data array: [0.47513212 0.48655998]
Dimension-wise mean of the transformed data array: [4.15223411e-16 2.23154828e-16].
Notice that this is close to 0.

Dimension-wise standard deviation of the original data array: [0.29770807 0.28400914]
Dimension-wise standard deviation of the transformed data array: [0.99999944 0.99999938].
Notice that this is close to 1.

The standard transform class automatically ignores NaNs in the array:

[12]:
data_array = np.random.random((10, 2))
data_array[:2,0] = math.nan

StandardizeTransform_ = StandardizeTransform()
StandardizeTransform_.fit(data_array)

data_train_trans = StandardizeTransform_.transform(data_array)

print(f'Original Array: ')
print(data_array)

print(f'\nTransformed Array: ')
print(data_train_trans)

print('\nBelow we print the mean and standard deviation of the 0th column after ignoring the 1st 2 elements:')

print(f'\nDimension-wise mean of the original data array: {data_array[2:,0].mean(0)}')
print(f'Dimension-wise mean of the transformed data array: {data_train_trans[2:,0].mean(0)}.'\
      f'\nNotice that this is close to 0.')

print(f'\nDimension-wise standard deviation of the original data array: {data_array[2:,0].std(0)}')
print(f'Dimension-wise standard deviation of the transformed data array: {data_train_trans[2:,0].std(0)}.'\
      f' \nNotice that this is close to 1.')

Original Array:
[[       nan 0.80518464]
 [       nan 0.45221782]
 [0.24987259 0.61744902]
 [0.5178477  0.48176765]
 [0.67053628 0.14881708]
 [0.40713205 0.33657983]
 [0.69268823 0.39474171]
 [0.40225941 0.28154496]
 [0.79705495 0.89939579]
 [0.1331715  0.94285576]]

Transformed Array:
[[        nan  1.04677208]
 [        nan -0.32608451]
 [-1.09081273  0.31657859]
 [ 0.15865713 -0.2111511 ]
 [ 0.8705881  -1.50615494]
 [-0.3575694  -0.77585589]
 [ 0.9738745  -0.54963655]
 [-0.38028876 -0.9899128 ]
 [ 1.46049832  1.41320427]
 [-1.63494715  1.58224086]]

Below we print the mean and standard deviation of the 0th column after ignoring the 1st 2 elements:

Dimension-wise mean of the original data array: 0.4838203392232694
Dimension-wise mean of the transformed data array: -1.3877787807814457e-16.
Notice that this is close to 0.

Dimension-wise standard deviation of the original data array: 0.21447081975778504
Dimension-wise standard deviation of the transformed data array: 0.9999989129916689.
Notice that this is close to 1.

On a final note, the causal discovery algorithms automatically handles NaN instances internally as well.

Heterogeneous2DiscreteTransform

Transforms an array of mixed continuous and discrete variables to a discrete array. The discrete variable values are not affected by the transformation. The number of states to be used for discretization can be specified in the module.

[13]:
from causalai.data.transforms.tabular import Heterogeneous2DiscreteTransform

data_c = np.random.randn(10,2)
data_d = np.random.randint(0,2, (10,3))
data_array = np.concatenate([data_c, data_d], axis=1)
var_names = ['c1', 'c2', 'd1', 'd2', 'd3']
print(var_names)
print(data_array)

['c1', 'c2', 'd1', 'd2', 'd3']
[[-0.35585766  0.18792482  0.          1.          1.        ]
 [ 1.16930377  0.2151256   0.          0.          1.        ]
 [ 0.32261274  1.2809729   1.          0.          1.        ]
 [-1.09150846  0.09236801  1.          0.          0.        ]
 [-0.64023739  0.35585544  1.          1.          1.        ]
 [-1.10937773  0.97013573  1.          0.          1.        ]
 [-0.51653727  0.76753388  1.          0.          1.        ]
 [ 0.71953692 -0.49171197  0.          0.          0.        ]
 [ 2.02864175 -0.17647864  0.          1.          0.        ]
 [-0.94696578 -0.39476729  0.          0.          1.        ]]
[14]:
discrete = {'c1': False, 'c2': False, 'd1': True, 'd2': True, 'd3': True}
Heterogeneous2DiscreteTransform_ = Heterogeneous2DiscreteTransform(nstates=5)# specify number of states
Heterogeneous2DiscreteTransform_.fit(data_array, var_names=var_names, discrete=discrete)
data_transformed = Heterogeneous2DiscreteTransform_.transform(data_array)
print(data_transformed)
assert np.all(data_array[:,2:]==data_transformed[:,2:]),\
            f'Something went wrong. Discrete data before and after do not match!'
[[2. 2. 0. 1. 1.]
 [4. 2. 0. 0. 1.]
 [3. 4. 1. 0. 1.]
 [0. 1. 1. 0. 0.]
 [1. 3. 1. 1. 1.]
 [0. 4. 1. 0. 1.]
 [2. 3. 1. 0. 1.]
 [3. 0. 0. 0. 0.]
 [4. 1. 0. 1. 0.]
 [1. 0. 0. 0. 1.]]

DifferenceTransform

Transform time series data by taking the difference between two time steps that are a certain interval apart specified by the argument order.

[15]:
from causalai.data.transforms.time_series import DifferenceTransform

data_array = np.random.randn(10,2)
print(data_array)
[[ 0.71034335 -0.77817239]
 [ 0.41208121 -0.44224965]
 [ 0.16667321  0.42001276]
 [-0.46039254  0.53315306]
 [-0.8463023  -1.20623272]
 [ 1.12214032  0.55983087]
 [ 0.19491086  1.38217805]
 [-0.80278812  0.86078342]
 [-1.24378886  0.19386542]
 [ 0.26081174 -1.33093553]]
[16]:
DifferenceTransform_ = DifferenceTransform(order=1) # difference b/w consecutive time steps
DifferenceTransform_.transform(data_array)
[16]:
array([[-0.29826214,  0.33592274],
       [-0.245408  ,  0.86226242],
       [-0.62706575,  0.1131403 ],
       [-0.38590976, -1.73938578],
       [ 1.96844262,  1.76606359],
       [-0.92722946,  0.82234718],
       [-0.99769898, -0.52139463],
       [-0.44100073, -0.666918  ],
       [ 1.5046006 , -1.52480095]])
[17]:
DifferenceTransform_ = DifferenceTransform(order=2) # difference b/w every 2 time steps
DifferenceTransform_.transform(data_array)
[17]:
array([[-0.54367014,  1.19818515],
       [-0.87247375,  0.97540272],
       [-1.01297551, -1.62624548],
       [ 1.58253286,  0.0266778 ],
       [ 1.04121316,  2.58841077],
       [-1.92492844,  0.30095255],
       [-1.43869972, -1.18831263],
       [ 1.06359986, -2.19171895]])
[ ]: