Selección de proteínas que interaccionan con los extremos del genoma del Ra

3. Estudio funcional de proteínas que interaccionan con elementos del RaV

3.1 Selección de proteínas que interaccionan con los extremos del genoma del Ra

This section provides an overview of semi-supervised learning and assumptions specific to this learning task. The definitions and terminology used throughout the section follow along the lines of the edited volume ‘Semi-supervised learning’ by Chapelle et al. (2006).

2.2.1 Problem Setting

Supervised learning is a learning task in which the goal is to find a mapping from an instance space X to a space of labels Y based on a training sample z = {(x1, y1),...,(xn, yn)} of n

examples sampled independently from a Borel probability measure ρ defined on Z = X × Y. The task can be evaluated on test examples sampled independently from ρ and unavailable to the learning algorithm during training. Semi-supervised learning is a class of supervised learning tasks where the algorithm required to extract a functional dependence from training data, in addition to having a training sample z ∈ Zn_{, has also a set of unlabeled instances}

at its disposal. More formally, in semi-supervised learning the training data consists of a training sample z ∈ Zn _{and a set X}0 _{= {x}

n+1, . . . , xn+n0} of n0 unlabeled instances that are sampled independently from the marginal probability measure ρ_X defined on X.

A related learning task, sometimes confused with semi-supervised learning, is that of transductive learning. In contrast to semi-supervised learning, the goal in such tasks is to predict correct labels on unlabeled instances X0_{= {x}

n+1, . . . , xn+n0}. Typically, semi-supervised learning algorithms are evaluated in transductive setting with test samples consisting only of unlabeled instances available during training (i.e., a subset of X0_{) and their labels.}

2.2 Semi-Supervised Learning 19

2.2.2 When Can Unlabeled Data Aid in Learning?

In general, unlabeled instances do not necessarily aid in semi-supervised learning tasks. Moreover, there are cases when additional unlabeled instances negatively affect the predictive performance of learning algorithms (e.g., see Chapter 4 in Chapelle et al., 2006). For unlabeled data to be useful in a semi-supervised learning task it needs to contain information relevant to the inference of a target concept. More formally (Chapelle et al., 2006), the knowledge on ρ_X extracted from unlabeled instances has to carry information that is useful in the inference of ρ(y | x). Thus, for semi-supervised learning to work certain assumptions on the data distribution will need to hold. In their edited volume on semi-supervised learning, Chapelle et al. (2006) formulate three standard assumptions of semi-supervised learning: i) smoothness assumption, ii) cluster assumption, and iii) manifold assumption. At least one of these assumptions on the data distribution will need to be satisfied for unlabeled data to aid in learning. In the remainder of the section, similar to Chapelle et al. (2006), we cover each of these three assumptions by focusing on the problem of classification.

Smoothness assumption:If two instances x₁and x₂from a high-density region of

ρ_X are close, then so should be the corresponding outputs y1and y2.

This is an adaptation of a standard smoothness assumption for supervised learning where it is assumed that if two instances are close in the instance space then so should be the corresponding outputs y1and y2. Such assumptions are required to be able to generalize from

training data to unseen instances. In contrast to the smoothness assumption for supervised learning, the smoothness assumption for semi-supervised learning depends on the marginal distribution of instances and this is precisely the source of additional information that allows improvement in predictive performance of learning algorithms as a result of taking into account the unlabeled instances in addition to labeled training examples.

Cluster assumption:If instances are in the same cluster, then they are likely to of the same class.

A cluster is often defined as a set of instances that can be connected by short curves which traverse only high-density regions of an instance space (Chapelle et al., 2006). Thus, for classification problems the cluster assumption is equivalent to the semi-supervised smoothness assumption. The motivation for the cluster assumption comes from datasets in which each class tends to form a cluster and in those cases unlabeled data can aid in determining boundaries of clusters (i.e., curves encompassing sets of instances) which correspond to decision boundaries separating the classes. As the boundary of a cluster cannot pass through a high-density region of the instance space, the assumption implies that the boundary lies in a low-density region. Here it is also important to note that the assumption does not state that clusters are compact structures consisting only of instances of the same class, but that frequently instances from the same class are observed close together in a high-density region of the instance space.

Manifold assumption:The instances lie (roughly) on a low-dimensional manifold. A manifold is a topological space that is locally Euclidean, i.e., around every point, there is a neighborhood that is topologically the same as the open unit ball in a Euclidean space (Row- land, 2017). To illustrate it consider the Earth which is roughly spherical in shape but in a small neighborhood it looks flat and not round. Such small neighborhoods can be accurately

represented by planes (e.g., geographical maps) unlike the Earth itself. In general, any object that is nearly flat on small scales is a manifold. The manifold assumption for semi-supervised learning can be seen as a link between the smoothness assumptions for supervised and semi-supervised learning. In particular, a manifold can be seen as an approximation to a high-density region of the instance space and in this case the semi-supervised smoothness assumption is identical to the supervised smoothness assumption restricted to the data on the manifold. When the manifold assumption is satisfied, additional unlabeled instances can aid in approximating the manifold boundaries and allow embedding of data from a possibly high-dimensional input space to a low-dimensional space of the manifold. In this way, learning algorithms can overcome problems faced in high-dimensional spaces where exponentially many samples are needed for consistent estimation due to the fact that volume grows exponentially with the dimension of the problem.

In document Identificación y estudio de las interacciones virus-célula del vesivirus de conejo (RaV) (página 103-107)