• No se han encontrado resultados

In this section, we provide a theoretical justification for our proposed UDA algorithm and explain why it can be effective. We employ existing theoretical results on the suitability of optimal transport for domain adaptation [104] within our framework and demonstrate why our algorithm can train models that generalize well on the target domain. In short, we demonstrate that our algorithm reduces an upper-bound for the true risk of the target domains.

First, note that the hypothesis class within our framework is the set of all deep network modelsfθ(·)

that are parameterized byθ. For any given model in this hypothesis class, we denote the observed risk on the source domain byeS. Analogously,eT denotes the observed risk on the target domain in the

UDA setting. Also, letµˆS = N1 PnN=1δ(xsn)denote the empirical source distribution, obtained from

the drawn training samples. We can define the empirical target distributionµˆT = M1 PMm=1δ(xtm)

similarly. Moreover, letfθ∗ denote the ideal model that minimizes the combined source and target

riskseC(θ∗), i.e.,θ∗ = arg minθeC(θ) = arg minθ{eS+eT}. If the hypothesis class is complex

enough and the presence of enough labeled target domain data, the joint model can be learned such that it generalizes well on both domains. However, in the UDA setting, computing the joint model in a supervised setting is not feasible. In order to justify that our algorithm is a feasible alternative, we rely on the following theorem [30] that provides an upper-bound on the target domain risk given the source domain risk.

Theorem 4.5.1. Under the assumptions described above for UDA, then for anyd0 > dandζ <2,

there exists a constant numberN0 depending ond0 such that for any ξ > 0 andmin(N, M) ≥ max(ξ−(d0+2),1)with probability at least1ξfor allf

θ, the following holds:

eT ≤eS+W(ˆµT,µˆS) +eC(θ∗)+ r 2 log(1 ξ)/ζ r 1 N + r 1 M . (4.10)

For simplicity, Theorem 4.5.1 originally is proven in the binary classification setting and consider 0-1 binary loss functionL(·)(thresholded binary softmax). We also limit our analysis to this setting but

note that these restrictions can be loosened to be broader multi-class classification case. The initial consequence of the above theorem might seem that if for a given model with good performance on the source domain, we minimize the Wasserstein distance between the source and the target distributions in an embedding, then we can improve generalization error on the target domain because this will make the inequality in Eq. (4.10) tighter on the target risk. Thus, performance on the target domain will be similar to the source domain. However, it is crucial to note that Wasserstein distance cannot be minimized independently from minimizing the source risk. In other words, we need to use simultaneous joint optimization on both domains to make the inequality tight. More importantly, there is no guarantee that doing so, an optimal joint modelfθ∗ with small joint error would exist

after mapping the data points into the learned embedding space. This is important as the third term in the right-hand side of Eq. (4.10) would become small only if such a joint model exists. For example, in a binary classification scenario, if the opposite classes are matched in the embedding space, then a good joint model would not exist. This suggests that it is essential that the feature extractor network needs to be learned such that the third term in Eq. (4.10) becomes small. Note that we cannot even approximateeC(θ∗)in the UDA framework as there is no labeled data in the

target domain. Hence, this theorem justifies why minimizing the Wasserstein distance can be helpful but why is not sufficient, and why we should minimize the source empirical risk simultaneously and minimize the Wasserstein distance such that corresponding classes align in the embedding to consider all terms in Theorem 4.5.1. Note that although we minimize SWD in our framework and our theoretical results are originally driven for the Wasserstein distance, it has been theoretically demonstrated that SWD is a good approximation for computing the Wasserstein distance [131]:

SW2(pX, pY)≤W2(pX, pY)≤αSW2β(pX, pY) (4.11)

whereαis a constant andβ= (2(d+ 1))−1(see [134] for more details).

Building above these existing results, we propose the following theorem to justify our UDA algorithm.

Theorem 4.5.2. Consider we use the pseudo-labeled target datasetDPL={xti,yˆti}Mi=1P L, which we

inequality holds: eT ≤eS+W(ˆµS,µˆPL) +eC0(θ∗) + (1−τ)+ r 2 log(1 ξ)/ζ r 1 N + r 1 MP L , (4.12)

whereeC0(θ∗)denotes the expected risk of the optimally joint modelfθ∗on both the source domain

and the confident pseudo-labeled target data points.

Proof:since all the pseudo-labeled data points are selected according to the thresholdτ, if we select a pseudo-labeled data point randomly, then the probability of the pseudo-label to be false is equal to

1−τ. We can define the difference between the error based on the true labels and the pseudo-label for a particular data point as follows:

|L(fθ(xti),yit)− L(fθ(xti),yˆti)|=        0, ifyt i = ˆyit. 1, otherwise . (4.13)

We can compute the expectation on the above error as:

E |L(fθ(xti),yti)− L(fθ(xti),yˆit)|

|ePL−eT| ≤(1−τ) .

(4.14)

Using Eq. (4.14) we can deduce:

eS+eT =eS+eT +ePL−ePL≤ eS+ePL+|eT −ePL| ≤

eS+ePL+ (1−τ) .

(4.15)

Note that since Eq. (4.15) is valid for allθ, if we consider the joint optimal parameterθ∗in Eq. (4.15),

we deduce:

By considering Theorem 4.5.1, where the pseudo-labeled data points are the given target dataset, and then applying Eq. (4.16) on Eq. (4.10), Theorem 4.5.2 follows

Theorem 4.5.2 demonstrates why our algorithm can learn models that generalize well on the target domain. We can see that at any given iteration, we minimize the upper-bound of the target error as given in (4.12). We minimize the source risk eS through the supervised loss. We minimize

the Wasserstein distance by minimizing the SWD loss. The termeC0(θ∗)is minimized because

the pseudo-labeled data points by definition are selected such that the true labels can be predicted with high probability. Hence, the optimal model with parameterθ∗can perform well both on the

source domain and the pseudo-labeled data points. The term1τ is also small because we only select confident data points. We also note that if the number of confident pseudo-labeled data points increase progressively,MP L, the constant term in the right-hand side of Eq. (4.12) (in the second

line) decreases, making generalization tighter. Hence our algorithm minimizes all the terms in Eq. (4.12), which would reduce the true risk on the target domain.

Documento similar