Grow-Shrink Tabular module

causalai.models.tabular.grow_shrink

The Grow-Shrink algorithm can be used for discovering the minimal Markov blanket (MB) of a target variable in tabular data. A MB is a minimal conditioning set making the target variable independent of all other variables; under the assumption of faithfulness, which we make here, the MB is unique and corresponds to the set of parents, children and co-parents of the target variable. The MB can be used for feature selection.

The Grow-Shrink algorithm operates in two phases, called growth and shrink. The growth phase first adds to the MB estimation variables unconditionally dependent on the target variable, then conditions on those variables and adds the conditionally dependent variables to the estimation. Assuming perfect conditional independence testing, this yields a superset of the actual MB. The shrink phase then removes from the estimated MB variables independent from the target variable conditional on all other variables in the MB estimation. The algorithm does not partition the estimated MB into parents/children/co-parents.

The assumptions we make for the growth-shrink algorithm are: 1. Causal Markov condition, which implies that two variables that are d-separated in a causal graph are probabilistically independent, 2. faithfulness, i.e., no conditional independence can hold unless the Causal Markov condition is met, 3. no hidden confounders, and 4. no cycles in the causal graph.

class causalai.models.tabular.grow_shrink.GrowShrink(data: ~causalai.data.tabular.TabularData, prior_knowledge: ~causalai.models.common.prior_knowledge.PriorKnowledge | None = None, CI_test: ~causalai.models.common.CI_tests.partial_correlation.PartialCorrelation | ~causalai.models.common.CI_tests.kci.KCI | ~causalai.models.common.CI_tests.discrete_ci_tests.DiscreteCI_tests | ~causalai.models.common.CI_tests.ccit.CCITtest = <causalai.models.common.CI_tests.partial_correlation.PartialCorrelation object>, use_multiprocessing: bool | None = False, update_shrink: bool = False)

Grow-Shrink (GS) algorithm for estimating a minimal markov blanket in tabular data. For details, see: "Bayesian Network Induction via Local Neighborhoods", Dimitris Margaritis and Sebastian Thrun, NeurIPS 1999.

__init__(data: ~causalai.data.tabular.TabularData, prior_knowledge: ~causalai.models.common.prior_knowledge.PriorKnowledge | None = None, CI_test: ~causalai.models.common.CI_tests.partial_correlation.PartialCorrelation | ~causalai.models.common.CI_tests.kci.KCI | ~causalai.models.common.CI_tests.discrete_ci_tests.DiscreteCI_tests | ~causalai.models.common.CI_tests.ccit.CCITtest = <causalai.models.common.CI_tests.partial_correlation.PartialCorrelation object>, use_multiprocessing: bool | None = False, update_shrink: bool = False)

Grow-Shrink (GS) algorithm for estimating a minimal markov blanket.

Parameters:
  • data (TabularData object) -- It contains data.values, a numpy array of shape (observations N, variables D).

  • prior_knowledge (PriorKnowledge object) -- Specify prior knowledge to the causal discovery process by either forbidding links/co-parents that are known to not exist, or adding back links/co-parents that do exist based on expert knowledge. See the PriorKnowledge class for more details.

  • CI_test (PartialCorrelation, KCI, or CCITtest object) -- This object perform conditional independence tests (default: PartialCorrelation). See object class for more details.

  • use_multiprocessing (bool) -- If True, computations are performed using multi-processing which makes the algorithm faster.

  • update_shrink (bool) -- whether to update the markov blanket during the shrink phase or not. update_shrink=True reduces the size of the conditioning sets tested (which usually increases the quality of the CI test), but makes the algorithm susceptible to cumulative error when a variable from the minimal markov blanket is mistakenly removed due to previous error of the CI test. Note: this option disables multiprocessing at the shrink phase.

run(target_var: int | str, pvalue_thres: float = 0.05) ResultInfoTabularMB

Runs GS algorithm for estimating markov blnaket.

Parameters:
  • target_var (int) -- Target variable index for which parents need to be estimated.

  • pvalue_thres (float) -- Significance level used for hypothesis testing (default: 0.05). Candidate variable with pvalues above pvalue_thres are ignored, and the rest are returned as the markov blanket of the target_var.

Returns:

Dictionary has three keys:

  • markov_blanket : List of estimated markov blanket variables.

  • value_dict : Dictionary of form {var3_name:float, ...} containing the test statistic of a link.

  • pvalue_dict : Dictionary of form {var3_name:float, ...} containing the p-value corresponding to the above test statistic.

Return type:

dict