• No se han encontrado resultados

1.2. Bases teórico científicas

1.2.8. Los cimientos de un buen clima laboral

Similar to the setup we introduced in section 2.2.2, we are still aiming at generalizing to a target distribution for which we have few or no labels. However, in this case, we assume there are multiple source domains, and each is associated with a distinct distribution. Figure 3.1 shows a Venn diagram that illustrates the setup we have in learning from multiple

source domains. Let X ∈ Rd denote the feature space, and Y the output space. Let D

T denote a joint distribution onX ×Y. It is the target distribution we are ultimately interested in. We assume there are N distinct source domains. Each source domain Sj, j = 1,· · · , N is associated with an unknown underlying distribution DSj. In the example of figure 3.1, two source domains are available.

S(D

T

)

D

S1

6= D

S2

6= D

T

S(D

S1

)

S(D

S2

)

Figure 3.1: Venn diagram on learning from multiple source domains.

Ben-David et al. [10] extends their theory to the case of multiple source domains, in which the labeled data do not have to come from a single source domain, but multiple domains of

different distribution. A similar bound can be reached, however, the assumption needs to be strengthened. In this case, they assume there exists a hypothesis h∗ which has low error on both α-weighted combination of sources and the target domain. The bound obtained is depending on the divergence of the target distribution and the mixture of sources.

Notation. Each source domain Sj, j = 1,· · · , N is associated with a distinct underly-

ing distribution DSj on X × Y, from which we can sample i.i.d a set of labeled data

LSj = {(x1, y1),· · · , (xnSj, ynSj)} as well as a much larger set of unlabeled data USj = {xnSj+1,· · · , xmSj}. Note that the labeled set might be empty for some source domains. From the target distribution DT, we can sample a small (or empty) set of labeled examples LT ={(x1, y1),· · · , (xnT, ynT)} , and another set of unlabeled data UT ={xnT+1,· · · , xmT} with mT  nT.

3.1.3

Background and Related Work

We assume that our data originates from N source domains S1,· · · , SN and one target domain T . From each source domains, we sample unlabeled data USj = {x1,· · · , xmSj} ⊂ Rd, j = 1,· · · , N. We follow the setup in [69] and assume that part of the data from one source domain, w.l.o.g., S1, come with labels LS1 ={y1,· · · , ynS1}. Whereas from the target domain we are only able to sample data without labels UT = {x1,· · · , xmT} ⊂ R

d. We do not assume that these domains use identical features and we pad all input vectors with zeros to make both domains be of equal dimensionality d. Our goal is to learn a classifier h∈ H that accurately predicts the labels of the data from the target domain T .

Structural Correspondence Learning

The first class of methods that is closely related to ours are the domain adaptation algo- rithms we discussed in section 2.2.3, in particular the Structural Correspondence Learning (SCL) method of Blitzer et al. [21]. SCL also learns a joint source/target representation for domain adaptation in an unsupervised fashion. The difference is that SCL requires the

identification of pivot features, which appears frequently in both domains and behave simi- larly, to put domain specific words in correspondence. The low-rank representation learned in SCL essentially encodes the covariance between non-pivot features and the pivot features. As we are going to detail later that single-layer mSDA also learns the correlations between all the features. In other words, mSDA no longer uses pivot features, which can be hard to identify as pointed out the authors [21]. We learn more robust and powerful features by introducing corruption and by stacking multiple layers. Meanwhile, we are able to keep the computational time comparable to SCL with our new framework.

Stacked Denoising Autoencoder

The second class of methods that are closely related is the autoencoders, which motivates our algorithm. Various forms of autoencoders have been developed in the deep learning literature [131, 6, 91, 97, 159, 126]. In its simplest form, an autoencoder has two components, an encoder h(·) maps an input x ∈ Rd to some hidden representation h(x)∈ Rdh, and a decoder g(·) maps this hidden representation back to a reconstructed version of x, such that g(h(x))≈x. The parameters of the autoencoders are learned to minimize the reconstruction error, measured by some loss `(x, g(h(x))). Choices for the loss include squared error or Kullback-Leibler divergence when the feature values are in [0, 1].

Denoising Autoencoders (DAs) incorporate a slight modification to this setup and corrupt the inputs before mapping them into the hidden representation. They are trained to reconstruct (or denoise) the original input x from its corrupted version ˜x by minimizing `(x, g(h(˜x))). Typical choices of corruption include additive isotropic Gaussian noise or binary masking noise. As in [159], we use the latter and set a fraction of the features of each input to zero. This is a natural choice for bag-of-word representations of texts, where typical class-specific words can be missing due to the writing style of the author or differences between train and test domains.

The stacked denoising autoencoder (SDA) of [159] stacks several DAs together to create higher-level representations, by feeding the hidden representation of the tthDA as input into the (t + 1)th DA. The training is performed greedily, layer by layer.

Feature Generation. Many researchers have seen autoencoders as a powerful tool for au- tomatic discovery and extraction of nonlinear features. For example, Lee et al. [97] demon- strate that the hidden representations computed by either all or partial layers of a convo- lutional neural network (CNN) make excellent features for classification with SVMs. The pre-processing with a CNN improves the generalization by increasing robustness against noise and label-invariant transformations.

Glorot et al. [69] successfully apply SDAs to extract features for domain adaptation in document sentiment analysis. The authors train an SDA to reconstruct the input vectors (ignoring the labels) on the union of the source and target data. A classifier (e.g. a linear SVM) trained on the resulting feature representation h(x) transfers significantly better from

source to target than one trained on x directly. Similar to CNNs, SDAs also combine

correlated input dimensions, as they reconstruct removed feature values from uncorrupted features. It is shown that SDAs are able to disentangle hidden factors which explain the variations in the input data, and automatically group features in accordance with their relatedness to these factors [69]. This helps transfer across domains as these generic concepts are invariant to domain-specific vocabularies.

As an intuitive example, imagine that we classify product reviews according to their senti- ments. The source data consists of book reviews, the target of kitchen appliances. A classifier trained on the original source never encounters the bigram “energy efficient” during train- ing and therefore assigns zero weight to it. In the learned SDA representation, the bigram “energy efficient” would tend to reconstruct, and be reconstructed by, co-occurring features, typically of similar sentiment (e.g. “good” or “love”). Hence, the source-trained classifier can assign weights even to features that never occur in its original domain representation, which are “re-constructed” by the SDA.

Although SDAs generate excellent features for domain adaptation, they have several draw- backs: 1) Training with (stochastic) gradient descent is slow and hard to parallelize (although a dense-matrix GPU implementation exists [12] and an implementation based on reconstruc- tion sampling exists [53] for sparse inputs); 2) There are several hyper-parameters (learning rate, number of epochs, noise ratio, mini-batch size and network structure), which need to be set by cross validation — this is particularly expensive as each individual run can take several hours; 3) The optimization is inherently non-convex and dependent on its initialization.