Data Generator module

causalai.data.data_generator

causalai.data.data_generator.ConditionalDataGenerator(T, data_type='time_series', noise_fn=None, intervention=None, discrete=False, nstates=10, seed=1)

Generate data that is useful for testing CATE (conditional average treatment estimation) for causal inference.

The data is generated using the following structural equation model:

C = noise

W = C + noise

X = C*W + noise

Y = C*X + noise

Note that the depence between variables in the generated data is instantaneous only (no time lagged dependence) for simplicity. Hence this data can be used both for tabular and timeseries cases.

Parameters:
  • T (int) -- Number of samples.

  • data_type (str) -- String (time_series, or tabular) that specifies whether the generated data causal graph should be specified as tabular or time series (default: time_series).

  • noise_fn (list of callables, optional) -- List of functions each of which takes t as input and that returns a random vector of length t. (default: list of np.random.randn)

  • intervention (dict) -- Dictionary of format: {W:np.array, ...} containing only keys of intervened variables with the value being the array of length T with interventional values. Set values to np.nan to leave specific time points of a variable un-intervened.

  • discrete (bool or dict) -- When bool, it specifies whether all the variables are discrete or all of them are continuous. If true, the generated data is discretized into nstates uniform bins (default: False). Alternatively, if discrete is specified as a dictionary, then the keys of this dictionary must be the variable names and the value corresponding to each key must be True or False. A value of False implies the variable is continuous, and discrete otherwise.

  • nstates (int) -- When discrete is True, the nstates specifies the number of bins to use for discretizing the data (default=10).

  • seed (int) -- Set the seed value for random number generation for reproduciblity (default: 1).

Returns:

A tuple of 3 items--

  • data: Generated data array of shape (T, 4).

  • var_names: List of variable names corresponding to the columns of data

  • graph: The causal graph that was used to generate the data array. graph is a dictionary with variable names as keys and the list of parent nodes of each key as the corresponding values.

Return type:

tuple[ndarray, list, dict]

causalai.data.data_generator.DataGenerator(sem_dict, T, noise_fn=None, intervention=None, discrete=False, nstates=10, seed=1)

Use structural equation model to generate time series and tabular data.

Parameters:
  • sem_dict (dict) -- Structural equation model (SEM) provided as a dictionary of format: {<child node>: (<parent node>, coefficient, function)}. Notice that the values are lists containing tuples of the form (<parent node>, coefficient, function). The parent node must be a tuple of size 2 for time series data, where the 1st element (str) specifies the parent node, while the 2nd element is the time lag (must be non-positive). If this lag is 0, it means the the parent is at the same time step as the child (instantaneous effect). For tabular data, the parent node is simply a str. Finally, the coefficient must be of type float, and func is a python callable function that takes one float argument as input. For example in sem_dict = {a:[((b, -2), 0.2, func)], b:[((b, -1), 0.5, func),...]}, the item a:[((b,-2), 0.2, func)] implies: v(a,t) = func(v(b, -2)*0.2), where the v(a,t) denotes the value of node a at time t, and v(b, -2) denotes the value of node b at time t-2. For tabular data example sem_dict = {a:[(b, 0.2, func)], b:[(c, 0.5, func),...]}, the item a:[(b, 0.2, func)] implies: v(a) = func(v(b)*0.2).

  • T (int) -- Number of samples.

  • noise_fn (list of callables, optional) -- List of functions each of which takes t as input and that returns a random vector of length t. (default: list of np.random.randn)

  • intervention (dict) -- Dictionary of format: {1:np.array, ...} containing only keys of intervened variables with the value being the array of length T with interventional values. Set values to np.nan to leave specific time points of a variable un-intervened.

  • discrete (bool or dict) -- When bool, it specifies whether all the variables are discrete or all of them are continuous. If true, the generated data is discretized into nstates uniform bins (default: False). Alternatively, if discrete is specified as a dictionary, then the keys of this dictionary must be the variable names and the value corresponding to each key must be True or False. A value of False implies the variable is continuous, and discrete otherwise.

  • nstates (int) -- When discrete is True, the nstates specifies the number of bins to use for discretizing the data (default=10).

  • seed (int) -- Set the seed value for random number generation for reproduciblity (default: 1).

Returns:

A tuple of 3 items--

  • data: Generated data array of shape (T, number of variables).

  • var_names: List of variable names corresponding to the columns of data

  • graph: The causal graph that was used to generate the data array. graph is a dictionary with variable names as keys and the list of parent nodes of each key as the corresponding values.

Return type:

tuple[ndarray, list, dict]

causalai.data.data_generator.GenerateRandomTabularSEM(var_names=['a', 'b', 'c', 'd', 'e', 'f'], max_num_parents=4, seed=0, fn: ~typing.Callable = <function <lambda>>, coef: float = 0.1)

Generate a random structural equation model (SEM) for tabular data using the following procedure: Randomly divide variables into non-overlapping groups of size between 3 and num_vars. Then randomly create edges between a preceeding group and a following group such that max_num_parents is never exceeded.

Parameters:
  • var_names (list) -- Names of variables in the SEM in the form of a list of str.

  • max_num_parents (int) -- Maximum number of causal parents allowed in the randomly generated SEM.

  • seed (int) -- Random seed used for reproducibility.

  • fn (Callable) -- Function applied to a parent variable when generating child variable data. Default: Linear function for linear causal relation.

  • coef (float) -- coefficient of parent variables in the randomly generated SEM.

causalai.data.data_generator.GenerateRandomTimeseriesSEM(var_names=['a', 'b', 'c', 'd', 'e'], max_num_parents=4, max_lag=4, seed=0, fn: ~typing.Callable = <function <lambda>>, coef: float = 0.1)

Generate a random structural equation model (SEM) for time series data.

Parameters:
  • var_names (list) -- Names of variables in the SEM in the form of a list of str.

  • max_num_parents (int) -- Maximum number of causal parents allowed in the randomly generated SEM.

  • max_lag (int) -- Maximum time lag between parent and child variable allowed in the randomly generated SEM. Must be non-negative.

  • seed (int) -- Random seed used for reproducibility.

  • fn (Callable) -- Function applied to a parent variable when generating child variable data. Default: Linear function for linear causal relation.

  • coef (float) -- coefficient of parent variables in the randomly generated SEM.

causalai.data.data_generator.GenerateSparseTabularSEM(var_names=['a', 'b', 'c', 'd', 'e', 'f'], graph_density=0.1, seed=0, fn: ~typing.Callable = <function <lambda>>, coef: float = 0.1)

Generate a structural equation model (SEM) for tabular data using the following procedure: For N nodes, enumerate them from 0-N. For all i,j between 0-N, if i < j, the edge from vi to vj exists with probability graph_density, and if i >= j there cannot be an edge betwen them.

Parameters:
  • var_names (list) -- Names of variables in the SEM in the form of a list of str.

  • graph_density (float in the range (0,1]) -- Probability that an edge between node i and j exists.

  • seed (int) -- Random seed used for reproducibility.

  • fn (Callable) -- Function applied to a parent variable when generating child variable data. Default: Linear function for linear causal relation.

  • coef (float) -- coefficient of parent variables in the randomly generated SEM.

causalai.data.data_generator.GenerateSparseTimeSeriesSEM(var_names=['a', 'b', 'c', 'd', 'e'], graph_density=0.1, max_lag=4, seed=0, fn: ~typing.Callable = <function <lambda>>, coef: float = 0.1)

Generate a structural equation model (SEM) for time series data using the following procedure: For N nodes, enumerate them from 0-N. For each time lag (until max_lag), for all i,j between 0-N, if i < j, the edge from vi to vj exists with probability graph_density, and if i >= j there cannot be an edge betwen them.

Parameters:
  • var_names (list) -- Names of variables in the SEM in the form of a list of str.

  • graph_density (float in the range (0,1]) -- Probability that an edge between node i and j exists.

  • max_lag (int) -- Maximum time lag between parent and child variable allowed in the randomly generated SEM.

  • seed (int) -- Random seed used for reproducibility.

  • fn (Callable) -- Function applied to a parent variable when generating child variable data. Default: Linear function for linear causal relation.

  • coef -- coefficient of parent variables in the randomly generated SEM. Note: larger values may cause exploding values in data array for some seeds.