- L.G. Valiant. A theory of the learnable.
*Communications of the ACM*, 27(11):1134–1142, 1984- Original PAC learning paper

- A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth. Occam’s razor.
*Information Processing Letters*, 24:377-380, 1987.- Finite hypothesis classes / bit complexity

- V.N. Vapnik, A.Ya. Červonenkis. On uniform convergence of the frequencies of events to their probabilities.
*Teoriya Veroyatnostei i ee Primeneniya*, 16(2):264-279, 1971.- Original VC paper

- A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension.
*Journal of the ACM*, 36(4):929-965, 1989.- Another paper about VC dimension

- P.L. Bartlett, S. Mendelson. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.
*Journal of Machine Learning Research*, 3:463-482, 2002.- Relates Rademacher complexity, Gaussian complexity, and uniform convergence

- R. Meir, T. Zhang. Generalization error bounds for Bayesian mixture algorithms.
*Journal of Machine Learning Research*, 4:839-860, 2003.- Rademacher-based analysis of study Bayesian mixture algorithms; Lipschitz contraction property for Rademacher averages without absolute value; convex duality-based bounds on Rademacher complexity.

- Y. Freund. Boosting a weak learning algorithm by majority.
*Information and Computation*, 121(2):256-285, 1995.- Introduces Boost-by-Majority

- Y. Freund, R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.
*Journal of Computer and System Sciences*, 55(1):119-139, 1997.- Introduces AdaBoost

- R.E. Schapire, Y. Freund, P. Bartlett and W.S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods.
*The Annals of Statistics*, 26(5):1651-1686, 1998.- Margin bound for weighted majority classifiers

- A.R. Klivans, R.A. Servedio. Boosting and hard-core set construction.
*Machine Learning*, 51(3):217-238, 2003.- Connection to hardness amplification

- R.E. Schapire. Drifting games.
*Machine Learning*, 43(3):265-291, 2001.- Introduces drifting games, with analysis of Boost-by-Majority

- P. Long, R. Servedio. Random classification noise defeats all convex potential boosters.
*Machine Learning*, 78(3):287-304, 2010.- Failure of boosting with convex losses when labels are noisy

- T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization.
*The Annals of Statistics*, 32:56-85, 2004- Develops theory of classification calibration with surrogate losses

- P.L. Bartlett, M.I. Jordan, J.D. McAuliffe. Convexity, classification, and risk bounds.
*Journal of the American Statistical Association*, 101(473):138-156, 2006.- Develops theory of classification calibration with surrogate losses

- P. Auer, M. Herbster, M.K. Warmuth. Exponentially Many Local Minima for Single Neurons.
*Advances in Neural Information Processing Systems*8:315–322, 1995.- Introduces the “matching loss” for learning with link functions, and establishes non-convexity of empirical squared loss risk with monotone link functions.

- Stone-Weierstrass theorem on Wikipedia.
- A powerful generalization of the classical Weierstrass approximation theorem.

- K. Hornik, M. Stinchcombe, H. White. Multilayer feedforward networks are universal approximators.
*Neural networks*, 2(5):359-366, 1989.- Uses Stone-Weierstrass to establish approximation capabilities of two-layer neural networks.

- A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.
*IEEE Transactions on Information Theory*, 39:930-944.- Connection between two-layer neural networks and Fourier transforms.

- N. Aronszajn. Theory of reproducing kernels.
*Transactions of the American Mathematical Society*, 68(3):337-404, 1950.- Correspondence between positive definite kernels and reproducing kernel Hilbert spaces.

- Y. Freund, R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.
*Journal of Computer and System Sciences*, 55(1):119-139, 1997.- Introduces Hedge

- Y. Freund, R.E. Schapire. Adaptive game playing using multiplicative weights.
*Games and Economic Behavior*, 29:79-103, 1999.- Analysis of Hedge using relative entropy, and proof of von Neumann’s minimax theorem using Hedge.

- P. Auer, N. Cesa-Bianchi, Y. Freund, R.E. Schapire. The nonstochastic multiarmed bandit problem.
*SIAM Journal on Computing*, 32(1):48–77, 2002.- Introduces Exp3 algorithm (among many others).

- P. Auer, N. Cesa-Bianchi, P. Fischer. Finite-time analysis of the multiarmed bandit problem.
*Machine Learning*, 47:235-256, 2002.- Provides a non-asymptotic analysis of the Upper Confidence Bound algorithm.

- D.A. Freedman. On tail probabilities for martingales.
*Annals of Probability*, 3(1):100-118, 1975.- Tail probability inequalities for martingale difference sequences; ubiquitous in papers about multi-armed bandits and other sequential decision-making problems.

- S. Dasgupta, L. Schulman. A probabilistic analysis of EM for mixtures of separated, spherical Gaussian.
*Journal of Machine Learning Research*, 8:203-226, 2007.- Analyzes EM algorithm (a distance-based algorithm) for learning mixtures of Gaussians.

- S. Vempala, G. Wang. A spectral algorithm for learning mixture models.
*Journal of Computer and System Sciences*, 68(4):841-860, 2004.- Applies PCA to learning mixtures of Gaussians.

- F. McSherry. Spectral partitioning of random graphs.
*Proceedings 42nd IEEE Symposium on Foundations of Computer Science*, 2001.- Spectral clustering approaches to finding planted partitions in random graphs (related to stochastic block models).

- N. Alon, M. Krivelevich, B. Sudakov. Finding a large hidden clique in a random graph.
*Random Structures & Algorithms*, 13(3-4):457-466, 1998.- Spectral algorithm for finding cliques of size \(\Omega(\sqrt{n})\) planted in \(G(n,1/2)\) random graphs.