Prior Knowledge

When performing causal discovery, a user may have some domain knowledge which may allow them to specify one or multiple of the following: 1. A directional link between certain pairs of nodes are forbidden. 2. A directional link exists between certain pairs of nodes. 3. Nodes that are root variables, i.e., they have no parents (incoming causal links). 4. Nodes that are leaf variables, i.e., they have no children (no outgoing links). 5. Two nodes are co-parents (used for Markov Blanket discovery only).

To allow such user specifications, we support the PriorKnowledge class which can be initialized with the relevant prior knowledge about the graph. If a PriorKnowledge instance is created, it can be passed to the causal discovery algorithm being used, where it will enforce these conditions. Note that specifying the PriorKnowledge object is optional and not needed if the user has no prior knowledge about the variables.

The reason for supporting this functionality is that it helps improve the accuracy of the discovered causal graph, which may otherwise contain spurious or missing links due to many possible reasons such as insufficient data or data violating the causal model assumption.

We begin by importing the PriorKnowledge class object.

Note: For example usage of prior knowledge in causal discovery algorithms, see the Tabular PC tutorial and the Grow-Shrink tutorial.

[1]:
from causalai.models.common.prior_knowledge import PriorKnowledge

Tabular Data

We now show an example of how to specify prior knowledge. For this example, consider a tabular data which has 4 variables named A,B,C, and D. Suppose we want to specify that the links C->A, C->B, and D->C are forbidden (read as: C causes A, C causes B, and D causes C are forbidden). This can be done as follows.

[2]:

forbidden_links = {'A': ['C'], 'B': ['C'], 'C': ['D']} prior_knowledge = PriorKnowledge(forbidden_links=forbidden_links)

Suppose that we additionally wanted to specify that the link A->B exists. This can be done as follows:

[3]:

forbidden_links = {'A': ['C'], 'B': ['C'], 'C': ['D']} existing_links = {'B': ['A']} prior_knowledge = PriorKnowledge(forbidden_links=forbidden_links, existing_links=existing_links)

Notice that: 1. forbidden_links and existing_links are specified as dictionaries. 2. if an argument (E.g. existing_links) is not specified, it is assumed to be empty. This holds true for all the four arguments of PriorKnowledge: root_variables, leaf_variables, forbidden_links, existing_links.

Below we show how to specify root_variables and leaf_variables. Note that they are specified as lists.

For this example, suppose we want to specify that D is a leaf variable and A is a root variable.

[4]:

root_variables = ['A'] leaf_variables = ['D'] prior_knowledge = PriorKnowledge(root_variables=root_variables, leaf_variables=leaf_variables)

PriorKnowledge also allow specification of existing and forbidden co-parents, which is used only for Markov Blanket discovery. Furthermore, PriorKnowledge attempts to deduce additional information about co-parents from other provided information (unless the user explicitly sets fix_co_parents as False) as follows:

  1. If existing_links implies some co-parent relationships, those will be added to existing_co_parents

  2. If leaf_variables forbids some co-parent relationships, those will be added as forbidden_co_parents for any variable for which this fix is requested by passing it to var_names.

Because co-parenting is a symmetric relationship, information implied by this symmetry is also added. This happens regardless of the value of fix_co_parents.

Note that the expansion is not guaranteed to include all implications.

[5]:
forbidden_links = {'A': ['C'], 'B': ['C'], 'C': ['D']}
existing_links = {'B': ['A','E']}
root_variables = ['A']
leaf_variables = ['D']
existing_co_parents = {'B': ['E']}
forbidden_co_parents = {'B': ['C']}
var_names = ['E','F']
# var_names: This is used only to expand forbidden_co_parents using the leaf_variables information in prior knowledge.
# var_names: we recommend adding any variable for which you’d like to compute a markov blanket.

prior_knowledge = PriorKnowledge(root_variables=root_variables, leaf_variables=leaf_variables,
                                 existing_links=existing_links, forbidden_links=forbidden_links,
                                 existing_co_parents=existing_co_parents, forbidden_co_parents=forbidden_co_parents,
                                 var_names=var_names)

print(f"existing_co_parents: {prior_knowledge.existing_co_parents}")
print(f"forbidden_co_parents: {prior_knowledge.forbidden_co_parents}")
existing_co_parents: {'B': ['E'], 'A': ['E'], 'E': ['B', 'A']}
forbidden_co_parents: {'B': ['C'], 'E': ['D'], 'F': ['D'], 'C': ['B'], 'D': ['E', 'F']}

Since the links A -> B and E -> B exist, PriorKnowledge deduces that A and E are co-parents. Since D is a leaf variable, it cannot be a co-parent, and so it can be added as a forbidden co-parent for any variable we want (which we specified in var_names).

Time Series Data

For time series data, PriorKnowledge can be specified in the same format as shown above for tabular data. This means that the PriorKnowledge for time series is time index agnostic.

PriorKnowledge: Useful methods

Finally, we describe the class method isValid(parent, child) of PriorKnowledge, which is used internally by our causal discovery algorithms, but optionally may be of use to users.

This method essentially takes the name or index of 2 nodes as input, and outputs whether this causal link is allowed by the PriorKnowledge instance or not. If PriorKnowledge does not specify anything about this causal link, or PriorKnowledge is not instantiated using any arguments at all, the output will be always True.

Let’s use all the conditions specified in the above examples in the example below:

[6]:
forbidden_links = {'A': ['C'], 'B': ['C'], 'C': ['D']} # C cannot be a parent of A and B, and D cannot be a parent of C
existing_links = {'B': ['A']} # A is a parent of B
root_variables = ['A']
leaf_variables = ['D']
prior_knowledge = PriorKnowledge(forbidden_links=forbidden_links,
                                 existing_links=existing_links,
                                 root_variables=root_variables,
                                 leaf_variables=leaf_variables)

[7]:
print(f"Is the link C->A allowed? {prior_knowledge.isValid('C', 'A')}") # specified as forbidden above
print(f"Is the link C->B allowed? {prior_knowledge.isValid('C', 'B')}") # specified as forbidden above
print(f"Is the link D->C allowed? {prior_knowledge.isValid('D', 'C')}") # specified as forbidden above

print(f"\nIs the link A->B allowed? {prior_knowledge.isValid('A', 'B')}") # specified as existing above

print(f"\nIs the link B->A allowed? {prior_knowledge.isValid('B', 'A')}")# A specified as root, thus cannot be a child
print(f"Is the link D->B allowed? {prior_knowledge.isValid('D', 'B')}")# D specified as leaf, thus cannot be a parent


# nothing specified, hence allowed. Note index of B=1, and index of C=2. Just to show that we can use variable indices
print(f"\nIs the link B->C allowed? {prior_knowledge.isValid(1, 2)}")
Is the link C->A allowed? False
Is the link C->B allowed? False
Is the link D->C allowed? False

Is the link A->B allowed? True

Is the link B->A allowed? False
Is the link D->B allowed? False

Is the link B->C allowed? True