1. Conditional independence in general

Conditional independence of two random variables A and B given C
holds just in case

(1)  p(A, B | C) = p(A | C) * p(B | C)

[cf. R&N's (13.13) on p. 481].  Equivalently, we could have said that
A and B are conditionally independent given C just in case B doesn't
tell us anything about A if we already know C:

(2)  p(A | C) = p(A | B, C)

It's easy to see that (1) and (2) are equivalent by observing that in
general

(3)  p(A, B | C) = p(A, B, C) / p(C) = p(A | B, C) * p(B, C) / p(C)
     = p(A | B, C) * p (B | C).

Then it's trivial to derive (1) from (2) and vice versa.


2. Conditional independence in Bayesian Networks

It's crucial to keep in mind that the discussion of conditional
independence in Bayesian Networks is always about nodes/variables that
are *necessarily* independent, given the structure of the underlying
DAG.  One might also call this "topological independence", since one
only takes into account properties of this DAG.  Necessary/topological
independence holds regardless of probability assignments.

Even if two nodes are not necessarily/topologically (conditionally)
independent, they might still turn out to be independent if one takes
into account specific probability assignments.  For example, consider
a simple DAG with vertices A and B and one directed edge A->B.  Then,
by the definition of Bayesian Networks, A and B are not topologically
independent.  However, they might still be independent, if we happen
to assign probabilities in such a way that

  p(B | A) = p(B)

This is the case whenever we have a conditional probability table for
p(B | A) of the following form:

  A   p(B)
  t   x
  f   x

for any real number x in [0;1].  It just means that A tells us nothing
about B, i.e., B happening is independent of A happening.

Textbooks typically deal with the first notion of topological
independence, but it's not always clear what some exercises ask
about.  Simply asking "are A and B independent" in the above example
is ambiguous, since topologically they are not *necessarily*
independent, though for some probability tables they actually are.

R&N's first definition should probably go like this:

(4)  A node is necessarily conditionally independent of its
     non-descendants given its parents.

What this means is simply that given a node X and its parents P_1
through P_n (there may not be any parents), and given another node Y
not reachable from X, we have

  p(X, Y | P_1,...,P_n) = p(X | P_1,...,P_n) * p(Y | P_1,...,P_n).

This makes sense intuitively, and can also be derived formally from
the BN definition of the factorization of the joint distribution of X,
Y, P_1,...,P_n, and all their ancestors.  In a nutshell, let C stand
for the common ancestors of X and Y, let B stand for those parents
of X that descend from C and A for those that do not descend from C,
and, analogously, let D be the parents of Y that descend from C and E
those that don't.  This situation can be depicted as follows:

        C
       / \
  A   B   D   E
   \ /     \ /
    X       Y

(This involves serious hand-waving, since we've already marginalized
over all ancestors of A, C and E and over intermediate nodes between B
and C and between D and C.)  By the definition of Bayesian Networks,
the overall joint distribution factors as follows:

  p(X, A, B, C, D, E, Y)
  = p(X | A, B) * p(A) * p(B | C) * p(C) * p(D | C) * p(E) * p(Y)

(More hand-waving regarding the fact that p(B | C) etc. makes sense.)
Note that this product from the second factor onwards is also the
joint distribution in this Bayesian Network over all nodes except X.
Then divide by p(A, B):

  p(X, Y, C, D, E | A, B) = p(X | A, B) * p(Y, C, D, E | A, B)

Then remove C, D, E, which we don't care about, by marginalization, so
that

  p(X, Y | A, B) = p(X | A, B) * p(Y | A, B)

as required for showing conditional independence, since A,B together
are the parents of X.


3. The specific case of R&N's Figure 14.2

Abbreviate the nodes as B, E, A, J, and M.

The simplest illustration of necessary/topological independence
involves B and E.  By (4), B is independent of E, since E does not
descend from B, and unconditionally so, since B has no parents.  This
is a trivial result if we only consider the DAG over {B, E}, whose
edge set is empty.  By the semantics of BNs, the overall joint
distribution p(B, E) is the product of all conditional distributions,
in this case p(B) * p(E).

However, B and E are not independent given A.  To see this, consider
the graph restricted to the nodes {B, E, A}.  Now the joint
distribution is

  p(B, E, A) = p(B) * p(E) * p(A | B, E).

We can obtain p(B, E | A) as follows:

  p(B, E | A)
  = p(B, E, A) / p(A)
  = p(B, E, A) / sum_b,e p(B=b, E=e, A)
  = p(B, E, A) / [ p(B=0, E=0, A) + p(B=0, E=1, A)
	         + p(B=1, E=0, A) + p(B=1, E=1, A) ].

This gives us the following table:

  B E A  p(B, E | A)
  0 0 0  0.998517709903
  0 0 1  0.39619510404     (uh-oh, trigger-happy alarm!)
  0 1 0  0.00142215878008
  0 1 1  0.230253667678
  1 0 0  6.00310646925e-05
  1 0 1  0.372796193991
  1 1 0  1.00252279046e-07
  1 1 1  0.000755034290478

Next we compute p(B | A) as

  p(B | A)
  = p(B, A) / p(A)
  = sum_e p(A | B, E=e) * p(B) * p(E=e)
    / sum_b,e p(A | B=b, E=e) * p(B=b) * p(E=e),

which gives us this table:

  B A  p(B | A)
  0 0  0.999939868683
  0 1  0.626448771718
  1 0  6.01313169715e-05
  1 1  0.373551228282

Proceed analogously for p(E | A):

  E A  p(E | A)
  0 0  0.998577740968
  0 1  0.768991298031
  1 0  0.00142225903236
  1 1  0.231008701969

Now it's easy to see that p(B, E | A) != p(B | A) * p(E | A).
For example:

  p(B=0, E=1 | A=1) ~= 0.23,

but

  p(B=0 | A=1) ~= 0.626
  p(E=1 | A=1) ~= 0.231,

whose product is clearly not equal to 0.23.  This is just one of eight
possible outcomes; for some of the other outcomes, the values we need
to compare are approximately equal, but the large discrepancies in a
few cases are enough to demonstrate that conditional independence does
not hold, not even approximately:

  B E A  p(B | A) * p(E | A)  p(B, E | A)
  0 0 0  0.998517695173	      0.998517709903
  0 0 1  0.481733654114	      0.39619510404
  0 1 0  0.00142217351006     0.00142215878008
  0 1 1  0.144715117605	      0.230253667678
  1 0 0  6.00457946629e-05    6.00310646925e-05
  1 0 1  0.287257643918	      0.372796193991
  1 1 0  8.55223086907e-08    1.00252279046e-07
  1 1 1  0.0862935843643      0.000755034290478

In other words, this is an example of a situation which initially
might seem counter-intuitive: we said above that B and E are
marginally independent; but as soon as we add information about A,
that independence is gone.  On further reflection this makes perfect
sense, though: my assessment of whether there was an earthquake or
not, if all I know is that the alarm went off, should change quite a
bit when I find out whether or not a burglary took place.
Specifically:

  p(E=1 | A=1)      ~= 23.1%
  p(E=1 | A=1, B=0) ~= 36.8%
  p(E=1 | A=1, B=1) ~=  0.2%

In the last case, the burglary "explains" the alarm going off, so that
the conditional probability of an earthquake happening is almost the
same as the unconditional probability p(E) of an earthquake happening.
In the second case, if the alarm went off and no burglary took place,
I'm more inclined to say that it was due to an earthquake than if I
had no knowledge one way or another about any burglary.

Suppose I replace the alarm sensor with the following improved device:

  B E  p(A=1 | B, E)
  0 0  0.001
  0 1  0.001
  1 0  0.95
  1 1  0.95

With this new conditional probability table for the A node in place,
it then happens to be the case that B and E are conditionally
independent given A.  However, this independence does not hold
necessarily for the given DAG.  Rather, it is due to fact that the new
alarm is not sensitive to earthquakes at all:

  p(A | B, E=0) = p(A | B, E=1).

We would get the same net effect (though this time for all probability
tables) by removing the E->A arc from the DAG.

Finally, observe that J is necessarily conditionally independent from
all other nodes, given A.  For example:

  p(J | A, M)
  = p(J, A, M) / p(A, M)  
   ## BN semantics 
  = p(A) * p(J | A) * p(M | A) / p(A, M)
   ## chain rule
  = p(A) * p(J | A) * p(M | A) / [ p(M | A) * p(A)]
  = p(J | A)

Also:

  p(J | A, E)
  = p(J, A, E) / p(A, E)
  = [sum_b p(B=b) * p(E) * p(A | B=b, E) * p(J | A)] / p(A, E)
  = p(J | A) * [sum_b p(B=b) * p(E) * p(A | B=b, E)] / p(A, E)
  = p(J | A) * p(A, E) / p(A, E)
  = p(J | A)

The case p(J | A, B) = p(J | A) is exactly parallel, and p(J | A, A) =
p(J | A) holds trivially.

As another example, consider the so-called "Naive Bayes" classifier:
this is an instance of a Bayesian Network whose underlying DAG has
vertices {C, F_1, F_2, ...} and directed edges {<C,F_1>, <C,F_2>,
...}.  Per the discussion above, F_i and F_j (i != j) are
conditionally independent given C.  The overall joint distribution is

  p(C, F_1, F_2, ...) = p(C) * prod_i p(F_i | C).

For fixed values of F_1=f_1, ... this joint probability is
proportional to the conditional probability p(C | F_1=f_1, ...), which
makes this model usable as a probabilistic classifier.