COMS W3261 CS Theory
Lecture 22: Introduction to PAC Learning

The promise of modern Artificial Intelligence (AI) is that computers can learn novel concepts. How do we formalize the process of learning? Doing so will help us answer fundamental questions like

1. Towards Formalizing the Process of Learning

Dictionary defines learning as "the acquisition of knowledge or skills through experience or study". Thus to learn a concept a learning agent should be able to How can we make this precise? We can use the Probably Approximately Correct (PAC) Learning model proposed by Valiant in 1984. There's some "true concept" c ∈ C that the learning agent trying to find (learn). Given m sample observations x1,...,xm which are drawn independently from D, together with their true classifications c(x1),...,c(xm), the agent's goal is to find a hypothesis h ∈ C such that

Pr[h(x) = c(x)] ≥ 1 − ε

for a future observations x. Furthermore, the learning agent wants to succeed at this goal with probability at least 1 − δ over the choice of xi's. In other words, with high probability we want to output a hypothesis that's approximately correct (hence "Probably Approximately Correct").

2. PAC Learning Finite Concept Classes

First question we can ask concerns sample complexity: how many samples do we need to have seen to learn a concept effectively? It's not hard to show that

m = O( 1ε · log |C|δ )

samples drawn independently from D are sufficient. Any hypothesis h ∈ C we can find that agrees with all of these m samples (i.e., such that h(xi) = c(xi) for all i) will satisfy Pr[h(x) = c(x)] ≥ 1 − ε with probability at least 1 − δ over the choice of m samples.

Proof: (by contrapositive) Let h ∈ C be any "bad" hypothesis: that is, such that Pr[h(x) = c(x)] < 1 − ε. Then if we independently pick m points from the sample distribution D, the hypothesis h will be correct on all of these points with probability at most (1 − ε)m. So the probability that there exists a bad hypothesis in C that nevertheless agrees with all our sample data is at most |C| (1 − ε)m (the total number of hypotheses, good or bad, times the maximum probability of each bad hypothesis agreeing with the sample data).

Note that the above situation is a bad event (we picked a bad hypothesis because it agreed with all m samples). We want to limit the chance of this happening by, say, no more than δ. So we want |C| (1 − ε)m ≤ δ, or equivalently

m ≥ O( 1ε · log |C|δ ).

Note that the true concept c is in C. So the learning agent can use the following algorithm to learn the concept:

  1. Find any hypothesis in h ∈ C that agrees with all the sample data (i.e., such that h(xi) = c(xi) for all x1,...,xm).
  2. Output h.
Such an h will always exist, and by the theorem above it will probably be a good hypothesis. All we need is to see enough sample observations!

3. PAC Learning Infinite Concept Classes

What happens in the case when |C| is infinite? The sample complexity bound derived in previous section becomes useless (it states that we should observe proportional to log |C| observations to find a good hypothesis).

We need to capture the true complexity of a concept class (not just the number of elements in it). The key idea that fully characterizes the complexity of a concept class is called VC-dimension (after two of its inventors, Vapnik and Chervonenkis). We say the set of points x1,...,xm is shattered by a concept class C if for all 2m possible settings of c(x1),...,c(xm) to 0 or 1 (reject or accept), there is some concept c ∈ C that agrees with those values. Then the VC-dimension of C, denoted Dim(C), is the size of the largest set of points shattered by C. (If we can find arbitrarily large finite-size sets of points that can be shattered, then Dim(C) = ∞).

One can show that a concept class C is PAC-learnable if and only if its VC-dimension is finite, and that the number of samples

m = O( Dim(C)ε · log 1εδ )

are sufficent. The learning algorithm again can simply output any hypothesis h ∈ C that is consistent with the m observations.

4. Computational Complexity of Learning

So far we have only discussed the sample complexity of learning (ie, how many samples suffices to learn a concept). But how hard is it as a computational problem to find a hypothesis that is consistent with the m observations? The learning problem is certainly in NP, since given a hypothesis it is easy to check whether it is consistent with the given m samples.

If the learner's goal is to output a hypothesis in some fixed format (such as DNF expressions), it is possible to show that in some cases finding a hypothesis that is consistent with the observations is NP-complete. But if the learner's output hypothesis can in any general format (ie, output a hypothesis that is some polynomial-time algorithm that predicts on data), we don't know whether finding such a hypothesis is NP-complete.

References: Scott Aaronson's lecture notes on Introduction to PAC learning.