11.3.2
Low-density separation
Another major class of methods attempts to place bound- aries in regions where there are few data points (labeled or unlabeled). One of the most commonly used algorithms is the transductive support vector machine, or TSVM (which, despite its name, may be used for inductive learn- ing as well). Whereassupport vector machinesfor su- pervised learning seek a decision boundary with maximal marginover the labeled data, the goal of TSVM is a label- ing of the unlabeled data such that the decision boundary has maximal margin over all of the data. In addition to the standardhinge loss(1− yf(x))+for labeled data, a loss function (1− |f(x)|)+is introduced over the unla- beled data by letting y = sign f (x) . TSVM then selects f∗(x) = h∗(x) + bfrom areproducing kernel Hilbert spaceH by minimizing theregularized empirical risk:
f∗=argmin f ( l ∑ i=1 (1− yif (xi))++ λ1||h||2H+ λ2 l+u ∑ i=l+1 (1− |f(xi)|)+ )
An exact solution is intractable due to the non-convex term (1− |f(x)|)+, so research has focused on finding useful approximations.[8]
Other approaches that implement low-density separation include Gaussian process models, information regulariza- tion, and entropy minimization (of which TSVM is a spe- cial case).
11.3.3
Graph-based methods
Graph-based methods for semi-supervised learning use a graph representation of the data, with a node for each labeled and unlabeled example. The graph may be con- structed using domain knowledge or similarity of exam- ples; two common methods are to connect each data point to its k nearest neighbors or to examples within some dis- tance ϵ . The weight Wij of an edge between xiand xj is then set to e−||xi−xj ||2ϵ .
Within the framework of manifold regularization,[9] [10] the graph serves as a proxy for the manifold. A term is added to the standardTikhonov regularization prob- lem to enforce smoothness of the solution relative to the manifold (in the intrinsic space of the problem) as well as relative to the ambient input space. The minimization problem becomes argmin f∈H ( 1 l l ∑ i=1 V (f (xi), yi) + λA||f||2H+ λI ∫ M||∇M f (x)||2dp(x) ) [8]
where H is a reproducing kernel Hilbert space and M is the manifold on which the data lie. The regularization parameters λAand λIcontrol smoothness in the ambient
and intrinsic spaces respectively. The graph is used to ap- proximate the intrinsic regularization term. Defining the graph LaplacianL = D− W where Dii =∑l+uj=1Wij and f the vector [f (x1) . . . f (xl+u)], we have
fTLf = l+u ∑ i,j=1 Wij(fi− fj)2≈ ∫ M||∇M f (x)||2dp(x) The Laplacian can also be used to extend the supervised learning algorithms: regularized least squares and sup- port vector machines (SVM) to semi-supervised versions Laplacian regularized least squares and Laplacian SVM.
11.3.4 Heuristic approaches
Some methods for semi-supervised learning are not in- trinsically geared to learning from both unlabeled and la- beled data, but instead make use of unlabeled data within a supervised learning framework. For instance, the la- beled and unlabeled examples x1, . . . , xl+umay inform a choice of representation,distance metric, orkernelfor the data in an unsupervised first step. Then supervised learning proceeds from only the labeled examples. Self-training is a wrapper method for semi-supervised learning. First a supervised learning algorithm is used to select a classifier based on the labeled data only. This classifier is then applied to the unlabeled data to generate more labeled examples as input for another supervised learning problem. Generally only the labels the classifier is most confident of are added at each step.
Co-trainingis an extension of self-training in which mul- tiple classifiers are trained on different (ideally disjoint) sets of features and generate labeled examples for one an- other.
11.4 Semi-supervised learning in
human cognition
Human responses to formal semi-supervised learning problems have yielded varying conclusions about the de- gree of influence of the unlabeled data (for a summary see[11]). More natural learning problems may also be viewed as instances of semi-supervised learning. Much of human concept learning involves a small amount of direct instruction (e.g. parental labeling of objects dur- ing childhood) combined with large amounts of unlabeled experience (e.g. observation of objects without naming or counting them, or at least without feedback).
Human infants are sensitive to the structure of unlabeled natural categories such as images of dogs and cats or male and female faces.[12]More recent work has shown that in- fants and children take into account not only the unlabeled
examples available, but thesamplingprocess from which labeled examples arise.[13][14]
11.5 See also
• PU learning
11.6 References
[1] Chapelle, Olivier; Schölkopf, Bernhard; Zien, Alexan- der (2006). Semi-supervised learning. Cambridge, Mass.: MIT Press.ISBN 978-0-262-03358-9.
[2] Stevens, K.N.(2000), Acoustic Phonetics, MIT Press,
ISBN 0-262-69250-3, 978-0-262-69250-2
[3] Scudder, H.J. Probability of Error of Some Adaptive Pattern-Recognition Machines. IEEE Transaction on In- formation Theory, 11:363–371 (1965). Cited in Chapelle et al. 2006, page 3.
[4] Vapnik, V. and Chervonenkis, A. Theory of Pattern Recognition [in Russian]. Nauka, Moscow (1974). Cited in Chapelle et al. 2006, page 3.
[5] Ratsaby, J. and Venkatesh, S. Learning from a mixture of labeled and unlabeled examples with parametric side information. In Proceedings of the Eighth Annual Confer- ence on Computational Learning Theory, pages 412-417 (1995). Cited in Chapelle et al. 2006, page 4.
[6] Zhu, Xiaojin.Semi-supervised learning literature survey. Computer Sciences, University of Wisconsin-Madison (2008).
[7] Cozman, F. and Cohen, I. Risks of semi-supervised learn- ing: how unlabeled data can degrade performance of gen- erative classifiers. In: Chapelle et al. (2006).
[8] Zhu, Xiaojin. Semi-Supervised LearningUniversity of Wisconsin-Madison.
[9] M. Belkin, P. Niyogi. Semi-supervised Leifolds. Machine Learning, 56, Special Issue on Clustering, 209-239, 2004. [10] M. Belkin, P. Niyogi, V. Sindhwani. On Manifold Regu-
larization. AISTATS 2005.
[11] Zhu, Xiaojin; Goldberg, Andrew B. (2009). Introduction to semi-supervised learning. Morgan & Claypool. ISBN 9781598295481.
[12] Younger, B. A. and Fearing, D. D. (1999), Parsing Items into Separate Categories: Developmental Change in In- fant Categorization. Child Development, 70: 291–303. [13] Xu, F. and Tenenbaum, J. B. (2007), Sensitivity to sam-
pling in Bayesian word learning. Developmental Science, 10: 288–297.
[14] Gweon, H., Tenenbaum J.B., and Schulz L.E (2010), In- fants consider both the sample and the sampling process in inductive generalization. Proc Natl Acad Sci U S A., 107(20):9066-71.
11.7 External links
• A freely availableMATLABimplementation of the graph-based semi-supervised algorithms Laplacian support vector machines and Laplacian regularized least squares.
Chapter 12
Grammar induction
Grammar induction, also known as grammatical in- ference orsyntactic pattern recognition, refers to the pro- cess inmachine learningof learning aformal grammar (usually as a collection of re-write rules orproductionsor alternatively as afinite state machineor automaton of some kind) from a set of observations, thus constructing a model which accounts for the characteristics of the ob- served objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.
There is now a rich literature on learning different types of grammar and automata, under various different learning models and using various different methodologies.