1. Conditional independence in general Conditional independence of two random variables A and B given C holds just in case (1) p(A, B | C) = p(A | C) * p(B | C) [cf. R&N's (13.13) on p. 481]. Equivalently, we could have said that A and B are conditionally independent given C just in case B doesn't tell us anything about A if we already know C: (2) p(A | C) = p(A | B, C) It's easy to see that (1) and (2) are equivalent by observing that in general (3) p(A, B | C) = p(A, B, C) / p(C) = p(A | B, C) * p(B, C) / p(C) = p(A | B, C) * p (B | C). Then it's trivial to derive (1) from (2) and vice versa. 2. Conditional independence in Bayesian Networks It's crucial to keep in mind that the discussion of conditional independence in Bayesian Networks is always about nodes/variables that are *necessarily* independent, given the structure of the underlying DAG. One might also call this "topological independence", since one only takes into account properties of this DAG. Necessary/topological independence holds regardless of probability assignments. Even if two nodes are not necessarily/topologically (conditionally) independent, they might still turn out to be independent if one takes into account specific probability assignments. For example, consider a simple DAG with vertices A and B and one directed edge A->B. Then, by the definition of Bayesian Networks, A and B are not topologically independent. However, they might still be independent, if we happen to assign probabilities in such a way that p(B | A) = p(B) This is the case whenever we have a conditional probability table for p(B | A) of the following form: A p(B) t x f x for any real number x in [0;1]. It just means that A tells us nothing about B, i.e., B happening is independent of A happening. Textbooks typically deal with the first notion of topological independence, but it's not always clear what some exercises ask about. Simply asking "are A and B independent" in the above example is ambiguous, since topologically they are not *necessarily* independent, though for some probability tables they actually are. R&N's first definition should probably go like this: (4) A node is necessarily conditionally independent of its non-descendants given its parents. What this means is simply that given a node X and its parents P_1 through P_n (there may not be any parents), and given another node Y not reachable from X, we have p(X, Y | P_1,...,P_n) = p(X | P_1,...,P_n) * p(Y | P_1,...,P_n). This makes sense intuitively, and can also be derived formally from the BN definition of the factorization of the joint distribution of X, Y, P_1,...,P_n, and all their ancestors. In a nutshell, let C stand for the common ancestors of X and Y, let B stand for those parents of X that descend from C and A for those that do not descend from C, and, analogously, let D be the parents of Y that descend from C and E those that don't. This situation can be depicted as follows: C / \ A B D E \ / \ / X Y (This involves serious hand-waving, since we've already marginalized over all ancestors of A, C and E and over intermediate nodes between B and C and between D and C.) By the definition of Bayesian Networks, the overall joint distribution factors as follows: p(X, A, B, C, D, E, Y) = p(X | A, B) * p(A) * p(B | C) * p(C) * p(D | C) * p(E) * p(Y) (More hand-waving regarding the fact that p(B | C) etc. makes sense.) Note that this product from the second factor onwards is also the joint distribution in this Bayesian Network over all nodes except X. Then divide by p(A, B): p(X, Y, C, D, E | A, B) = p(X | A, B) * p(Y, C, D, E | A, B) Then remove C, D, E, which we don't care about, by marginalization, so that p(X, Y | A, B) = p(X | A, B) * p(Y | A, B) as required for showing conditional independence, since A,B together are the parents of X. 3. The specific case of R&N's Figure 14.2 Abbreviate the nodes as B, E, A, J, and M. The simplest illustration of necessary/topological independence involves B and E. By (4), B is independent of E, since E does not descend from B, and unconditionally so, since B has no parents. This is a trivial result if we only consider the DAG over {B, E}, whose edge set is empty. By the semantics of BNs, the overall joint distribution p(B, E) is the product of all conditional distributions, in this case p(B) * p(E). However, B and E are not independent given A. To see this, consider the graph restricted to the nodes {B, E, A}. Now the joint distribution is p(B, E, A) = p(B) * p(E) * p(A | B, E). We can obtain p(B, E | A) as follows: p(B, E | A) = p(B, E, A) / p(A) = p(B, E, A) / sum_b,e p(B=b, E=e, A) = p(B, E, A) / [ p(B=0, E=0, A) + p(B=0, E=1, A) + p(B=1, E=0, A) + p(B=1, E=1, A) ]. This gives us the following table: B E A p(B, E | A) 0 0 0 0.998517709903 0 0 1 0.39619510404 (uh-oh, trigger-happy alarm!) 0 1 0 0.00142215878008 0 1 1 0.230253667678 1 0 0 6.00310646925e-05 1 0 1 0.372796193991 1 1 0 1.00252279046e-07 1 1 1 0.000755034290478 Next we compute p(B | A) as p(B | A) = p(B, A) / p(A) = sum_e p(A | B, E=e) * p(B) * p(E=e) / sum_b,e p(A | B=b, E=e) * p(B=b) * p(E=e), which gives us this table: B A p(B | A) 0 0 0.999939868683 0 1 0.626448771718 1 0 6.01313169715e-05 1 1 0.373551228282 Proceed analogously for p(E | A): E A p(E | A) 0 0 0.998577740968 0 1 0.768991298031 1 0 0.00142225903236 1 1 0.231008701969 Now it's easy to see that p(B, E | A) != p(B | A) * p(E | A). For example: p(B=0, E=1 | A=1) ~= 0.23, but p(B=0 | A=1) ~= 0.626 p(E=1 | A=1) ~= 0.231, whose product is clearly not equal to 0.23. This is just one of eight possible outcomes; for some of the other outcomes, the values we need to compare are approximately equal, but the large discrepancies in a few cases are enough to demonstrate that conditional independence does not hold, not even approximately: B E A p(B | A) * p(E | A) p(B, E | A) 0 0 0 0.998517695173 0.998517709903 0 0 1 0.481733654114 0.39619510404 0 1 0 0.00142217351006 0.00142215878008 0 1 1 0.144715117605 0.230253667678 1 0 0 6.00457946629e-05 6.00310646925e-05 1 0 1 0.287257643918 0.372796193991 1 1 0 8.55223086907e-08 1.00252279046e-07 1 1 1 0.0862935843643 0.000755034290478 In other words, this is an example of a situation which initially might seem counter-intuitive: we said above that B and E are marginally independent; but as soon as we add information about A, that independence is gone. On further reflection this makes perfect sense, though: my assessment of whether there was an earthquake or not, if all I know is that the alarm went off, should change quite a bit when I find out whether or not a burglary took place. Specifically: p(E=1 | A=1) ~= 23.1% p(E=1 | A=1, B=0) ~= 36.8% p(E=1 | A=1, B=1) ~= 0.2% In the last case, the burglary "explains" the alarm going off, so that the conditional probability of an earthquake happening is almost the same as the unconditional probability p(E) of an earthquake happening. In the second case, if the alarm went off and no burglary took place, I'm more inclined to say that it was due to an earthquake than if I had no knowledge one way or another about any burglary. Suppose I replace the alarm sensor with the following improved device: B E p(A=1 | B, E) 0 0 0.001 0 1 0.001 1 0 0.95 1 1 0.95 With this new conditional probability table for the A node in place, it then happens to be the case that B and E are conditionally independent given A. However, this independence does not hold necessarily for the given DAG. Rather, it is due to fact that the new alarm is not sensitive to earthquakes at all: p(A | B, E=0) = p(A | B, E=1). We would get the same net effect (though this time for all probability tables) by removing the E->A arc from the DAG. Finally, observe that J is necessarily conditionally independent from all other nodes, given A. For example: p(J | A, M) = p(J, A, M) / p(A, M) ## BN semantics = p(A) * p(J | A) * p(M | A) / p(A, M) ## chain rule = p(A) * p(J | A) * p(M | A) / [ p(M | A) * p(A)] = p(J | A) Also: p(J | A, E) = p(J, A, E) / p(A, E) = [sum_b p(B=b) * p(E) * p(A | B=b, E) * p(J | A)] / p(A, E) = p(J | A) * [sum_b p(B=b) * p(E) * p(A | B=b, E)] / p(A, E) = p(J | A) * p(A, E) / p(A, E) = p(J | A) The case p(J | A, B) = p(J | A) is exactly parallel, and p(J | A, A) = p(J | A) holds trivially. As another example, consider the so-called "Naive Bayes" classifier: this is an instance of a Bayesian Network whose underlying DAG has vertices {C, F_1, F_2, ...} and directed edges {, , ...}. Per the discussion above, F_i and F_j (i != j) are conditionally independent given C. The overall joint distribution is p(C, F_1, F_2, ...) = p(C) * prod_i p(F_i | C). For fixed values of F_1=f_1, ... this joint probability is proportional to the conditional probability p(C | F_1=f_1, ...), which makes this model usable as a probabilistic classifier.