Pr[h(x) = c(x)] ≥ 1 − ε
for a future observations x. Furthermore, the learning agent wants to succeed at this goal with probability at least 1 − δ over the choice of x_{i}'s. In other words, with high probability we want to output a hypothesis that's approximately correct (hence "Probably Approximately Correct").
m = O( ^{1}⁄_{ε} · log ^{|C|}⁄_{δ} )
samples drawn independently from D are sufficient. Any hypothesis h ∈ C we can find that agrees with all of these m samples (i.e., such that h(x_{i}) = c(x_{i}) for all i) will satisfy Pr[h(x) = c(x)] ≥ 1 − ε with probability at least 1 − δ over the choice of m samples.Proof: (by contrapositive) Let h ∈ C be any "bad" hypothesis: that is, such that Pr[h(x) = c(x)] < 1 − ε. Then if we independently pick m points from the sample distribution D, the hypothesis h will be correct on all of these points with probability at most (1 − ε)^{m}. So the probability that there exists a bad hypothesis in C that nevertheless agrees with all our sample data is at most |C| (1 − ε)^{m} (the total number of hypotheses, good or bad, times the maximum probability of each bad hypothesis agreeing with the sample data).
Note that the above situation is a bad event (we picked a bad hypothesis because it agreed with all m samples). We want to limit the chance of this happening by, say, no more than δ. So we want |C| (1 − ε)^{m} ≤ δ, or equivalently
m ≥ O( ^{1}⁄_{ε} · log ^{|C|}⁄_{δ} ).
Note that the true concept c is in C. So the learning agent can use the following algorithm to learn the concept:
We need to capture the true complexity of a concept class (not just the number of elements in it). The key idea that fully characterizes the complexity of a concept class is called VC-dimension (after two of its inventors, Vapnik and Chervonenkis). We say the set of points x_{1},...,x_{m} is shattered by a concept class C if for all 2^{m} possible settings of c(x_{1}),...,c(x_{m}) to 0 or 1 (reject or accept), there is some concept c ∈ C that agrees with those values. Then the VC-dimension of C, denoted Dim(C), is the size of the largest set of points shattered by C. (If we can find arbitrarily large finite-size sets of points that can be shattered, then Dim(C) = ∞).
One can show that a concept class C is PAC-learnable if and only if its VC-dimension is finite, and that the number of samples
m = O( ^{Dim(C)}⁄_{ε} · log ^{1}⁄_{εδ} )
are sufficent. The learning algorithm again can simply output any hypothesis h ∈ C that is consistent with the m observations.
If the learner's goal is to output a hypothesis in some fixed format (such as DNF expressions), it is possible to show that in some cases finding a hypothesis that is consistent with the observations is NP-complete. But if the learner's output hypothesis can in any general format (ie, output a hypothesis that is some polynomial-time algorithm that predicts on data), we don't know whether finding such a hypothesis is NP-complete.