PC Tabular module

causalai.models.tabular.pc

The Peter-Clark (PC) algorithm is one of the most general purpose algorithms for causal discovery that can be used for both tabular and time series data, of both continuous and discrete types. Briefly, the PC algorithm works in two steps, it first identifies the undirected causal graph, and then (partially) directs the edges. In the first step, we check for the existence of a causal connection between every pair of variables by checking if there exists a condition set (a subset of variables excluding the two said variables), conditioned on which, the two variables are independent. In the second step, the edges are directed by identifying colliders. Note that the edge orientation strategy of the PC algorithm may result in partially directed graphs.

The PC algorithm makes four core assumptions: 1. Causal Markov condition, which implies that two variables that are d-separated in a causal graph are probabilistically independent, 2. faithfulness, i.e., no conditional independence can hold unless the Causal Markov condition is met, 3. no hidden confounders, and 4. no cycles in the causal graph.

class causalai.models.tabular.pc.PCSingle(data: ~causalai.data.tabular.TabularData, prior_knowledge: ~causalai.models.common.prior_knowledge.PriorKnowledge | None = None, CI_test: ~causalai.models.common.CI_tests.partial_correlation.PartialCorrelation | ~causalai.models.common.CI_tests.kci.KCI | ~causalai.models.common.CI_tests.discrete_ci_tests.DiscreteCI_tests = <causalai.models.common.CI_tests.partial_correlation.PartialCorrelation object>, use_multiprocessing: bool | None = False)

Peter-Clark (PC) algorithm for estimating parents of single variable.

__init__(data: ~causalai.data.tabular.TabularData, prior_knowledge: ~causalai.models.common.prior_knowledge.PriorKnowledge | None = None, CI_test: ~causalai.models.common.CI_tests.partial_correlation.PartialCorrelation | ~causalai.models.common.CI_tests.kci.KCI | ~causalai.models.common.CI_tests.discrete_ci_tests.DiscreteCI_tests = <causalai.models.common.CI_tests.partial_correlation.PartialCorrelation object>, use_multiprocessing: bool | None = False)

PC algorithm for estimating parents of single variable.

Parameters:

data (TabularData object) -- this is a TabularData object and contains attributes likes data.data_arrays, which is a list of numpy array of shape (observations N, variables D).
prior_knowledge (PriorKnowledge object) -- Specify prior knoweledge to the causal discovery process by either forbidding links that are known to not exist, or adding back links that do exist based on expert knowledge. See the PriorKnowledge class for more details.
CI_test (PartialCorrelation or KCI object) -- This object perform conditional independence tests (default: PartialCorrelation). See object class for more details.
use_multiprocessing (bool) -- If True, computations are performed using multi-processing which makes the algorithm faster.

run(target_var: int | str, pvalue_thres: float = 0.05, max_condition_set_size: int | None = 4, full_cd: bool = False) → ResultInfoTabularSingle

Runs PC algorithm for estimating the causal stength of all potential parents of a single variable.

Parameters:

target_var (int or str) -- Target variable index or name for which parents need to be estimated.
pvalue_thres (float) -- Significance level used for hypothesis testing (default: 0.05). Candidate parents with pvalues above pvalue_thres are ignored, and the rest are returned as the cause of the target_var.
max_condition_set_size (int) -- If not None, independence tests using condition sets of size {0,1,...max_condition_set_size} are performed (which are cheaper) before using condition sets involving all the candidate parents (default: 4). For example, max_condition_set_size = 0 implies that the greedy procedure will only consider condition sets of size 0 to eliminate causal links between the target_var and a specific variable, if the pvalue between them turns out to be larger than pvalue_thres=0.05. Similarly max_condition_set_size=1 will consider condition sets of size 0 and 1. The value of max_condition_set_size can be at maximum the total number of parents-1. If a value larger than this is specified, max_condition_set_size is chosen as min(max_condition_set_size, len(all_parents)-1). If None is given, then condition sets involving all the candidate parents are used. While each CI test in this case becomes more expensive than the greedy case, the number of CI tests in this cases is limited to the number of candidate parents, which is less than the greedy case.
full_cd (bool) -- This variable is only meant for internal use to handle multiprocessing if set to True (default: False).

Returns:

Dictionay has three keys:

parents : List of estimated parents.
value_dict : Dictionary of form {var3_name:float, ...} containing the test statistic of a link.
pvalue_dict : Dictionary of form {var3_name:float, ...} containing the p-value corresponding to the above test statistic.

Return type:

dict

class causalai.models.tabular.pc.PC(data: ~causalai.data.tabular.TabularData, prior_knowledge: ~causalai.models.common.prior_knowledge.PriorKnowledge | None = None, CI_test: ~causalai.models.common.CI_tests.partial_correlation.PartialCorrelation | ~causalai.models.common.CI_tests.kci.KCI | ~causalai.models.common.CI_tests.discrete_ci_tests.DiscreteCI_tests = <causalai.models.common.CI_tests.partial_correlation.PartialCorrelation object>, use_multiprocessing: bool | None = False, **kargs)

Peter-Clark (PC) algorithm for estimating parents of single variable.

__init__(data: ~causalai.data.tabular.TabularData, prior_knowledge: ~causalai.models.common.prior_knowledge.PriorKnowledge | None = None, CI_test: ~causalai.models.common.CI_tests.partial_correlation.PartialCorrelation | ~causalai.models.common.CI_tests.kci.KCI | ~causalai.models.common.CI_tests.discrete_ci_tests.DiscreteCI_tests = <causalai.models.common.CI_tests.partial_correlation.PartialCorrelation object>, use_multiprocessing: bool | None = False, **kargs)

PC algorithm for estimating parents of all variables.

Parameters:

data (TabularData object) -- this is a TabularData object and contains attributes likes data.data_arrays, which is a list of numpy array of shape (observations N, variables D).
prior_knowledge (PriorKnowledge object) -- Specify prior knoweledge to the causal discovery process by either forbidding links that are known to not exist, or adding back links that do exist based on expert knowledge. See the PriorKnowledge class for more details.
CI_test (PartialCorrelation or KCI object) -- This object perform conditional independence tests (default: PartialCorrelation). See object class for more details.
use_multiprocessing (bool) -- If True, computations are performed using multi-processing which makes the algorithm faster.

get_parents(pvalue_thres: float = 0.05, target_var: int | str | None = None) → Dict[int | str, Tuple[int | str]]

Assuming run() function has been called, get_parents function returns a dictionary. The keys of this dictionary are the variable names, and the corresponding values are the list of parent names that cause the target variable under the given pvalue_thres.

Parameters:

pvalue_thres (float) -- This pvalue_thres is the significance level used for hypothesis testing (default: 0.05).
target_var (str or int, optional) -- If specified (must be one of the data variable names), the parents of only this variable are returned as a list, otherwise a dictionary is returned where each key is a target variable name, and the corresponding values is the list of its parents.

Returns:

Dictionay has D keys, where D is the number of variables. The value corresponding each key is the list of parent names that cause the target variable under the given pvalue_thres.

Return type:

dict

run(pvalue_thres: float = 0.05, max_condition_set_size: int | None = None) → Tuple[Dict[int | str, ResultInfoTabularFull], List[Tuple[int | str, int | str]]]

Runs PC algorithm for estimating the causal stength of all potential parents of all the variables.

Parameters:

pvalue_thres (float) -- Significance level used for hypothesis testing (default: 0.05). Candidate parents with pvalues above pvalue_thres are ignored, and the rest are returned as the cause of the target_var.
max_condition_set_size (int) -- If not None, independence tests using condition sets of size {0,1,...max_condition_set_size} are performed (which are cheaper) before using condition sets involving all the candidate parents (default: 4). For example, max_condition_set_size = 0 implies that the greedy procedure will only consider condition sets of size 0 to eliminate causal links between the target_var and a specific variable, if the pvalue between them turns out to be larger than pvalue_thres=0.05. Similarly max_condition_set_size=1 will consider condition sets of size 0 and 1. The value of max_condition_set_size can be at maximum the total number of parents-1. If a value larger than this is specified, max_condition_set_size is chosen as min(max_condition_set_size, len(all_parents)-1). If None is given, then condition sets involving all the candidate parents are used. While each CI test in this case becomes more expensive than the greedy case, the number of CI tests in this cases is limited to the number of candidate parents, which is less than the greedy case.

Returns:

Dictionay has D keys, where D is the number of variables. The value corresponding each key is the dictionary output of PCSingle.run.

Return type:

dict