CAPITULO IV 4 RESULTADOS Y DISCUSION
POBLACIÓN Y MUESTRA
We provide a reference to the notation used throughout this chapter in Table5.2. The goal of the bootstrapping framework is to augment target domain labeled data Dltwith a subset of instances from source domain labeled data Ds
l to improve overall classification accuracy
on the target domain unlabeled data Dt
u. For this purpose, we first build a classifier using Dlt
and apply it to Ds
l to select a subset of informative instances. If the label of a source domain
instance is correctly predicted by the classifier, this instance is regarded as redundant, i.e., this knowledge is already contained in the target domain instances. If the predicted label is incorrect, then we consider this source domain instance as a candidate for addition, because it may contain knowledge that is lacking in the target domain labeled data. A scoring function (as presented in Section 5.3.2) is used to determine the informativeness of the
Table 5.2: Table of notation
Symbol Description
c Classifier
c0 Initial classifier trained on Dlt
ˆ
c Final adaptive classifier
δ Threshold for selecting informative instances
γ(·, ·) Scoring function for the consistency between the content of
an instance and a label
k The number of informative instances to be selected per iteration
φ(·, ·) Scoring function for informativeness
λc(·, ·) Scoring function for consistency factor
λd(·) Scoring function for diversity factor
λs(·, ·) Scoring function for similarity factor
πc(·, ·) Scoring function for the content similarity between two instances
πl(·, ·) Scoring function for the label similarity between two instances
πu(·) Scoring function for the uncertainty factor
Ds
l Source domain labeled data
Dt
l Target domain labeled data
Dtu Target domain unlabeled data
Dt = Dt
l ∪ Dut Overall target domain data
T Training Data for classifier c
Tcorrectt Set of instances from Dltthat can be correctly classified by c0
Tt
wrong Set of instances from Tcorrectt that are misclassified by c
Ts Remaining source domain labeled data after selecting infor-
mative instances in each iteration Ts
wrong Set of instances from Tsthat are misclassified by c
Ts
inf o Set of informative instances selected from Ts
X The observable feature space
Y The label space
candidate, and decide whether to select the candidate. The addition of informative source instances to Dt
l can be used to obtain a new clas-
sifier. Ideally, one would expect this new classifier to correctly classify more target do- main instances. However, it may misclassify the target domain labeled instances that were correctly classified initially, if a few false informative instances containing inconsistent knowledge were selected. When such misclassification happens, we resort to a “counter- balancing” process to recover. This is achieved by adding these misclassified target domain labeled instances with their correct labels to improve the classification accuracy. In other
Algorithm 1: The bootstrapping framework Input: Ds
l, Dtl, Dut, k, δ
Output: Adaptive classifier ˆc : X → Y
1 Train an initial classifier c0with Dlt;
2 Tcorrectt ← Set of instances from Dtl that can be correctly classified by c0; 3 Initialize T ← Dtl, Tinf os ← ∅, Twrongt ← ∅, Ts ← Dsl;
4 repeat
5 T ← T ∪ Tinf os ∪ Twrongt ; 6 Train a classifier c with T ;
7 Twrongs ← Set of instances from Tsthat are misclassified by c ;
8 Tinf os ← Top k instances with informativeness φ(·, ·) greater than δ from Twrongs ; 9 Ts ← Ts− Tinf os ;
10 Twrongt ← Set of instances from Tcorrectt that are misclassified by c; 11 until |Tinf os | < k;
12 return c
words, those misclassified instances are given extra weight in the training data.
Algorithm1illustrates the bootstrapping framework. Specifically, the algorithm takes as input Ds
l, Dtl, Dut, a natural number k indicating the number of source instances to
be added per iteration, and a real number δ indicating the informativeness threshold for selecting source instances. The output is an adaptive classifier ˆc.
We start with training an initial classifier c0 using Dlt (line 1). We initialize Tcorrectt
with instances from Dtl that can be correctly classified by c0 (line 2). We initialize the
overall training data T to Dlt, newly selected informative source domain instances Tinf os to ∅, counterbalancing target domain instances Twrongt to ∅, and source domain candidate instances Tsto Ds
l (line 3).
In every iteration, we first add the newly selected informative instances Ts
inf o and
counterbalancing target domain instances Twrongt into the overall training data T (line 5) that will be used to train a new classifier c (line 6). We set Twrongs to the instances in Ts whose labels are different from those predicted by classifier c (line 7). As discussed earlier, these instances have a potential to augment target domain training data by complement- ing them with the knowledge that they lack. We then set Tinf os to the top k informative
instances selected from Twrongs based on a scoring function that will be explained in Sec- tion5.3.2(line 8). We remove the newly selected informative source instances Ts
inf ofrom
source domain instances Ts(line 9). If a few false informative instances that contain incon-
sistent knowledge were selected and added to the training data, classifier c may misclassify instances in Tcorrectt that were initially correctly classified by c0. To counterbalance such ef-
fect, we set Twrongt to the instances in Tcorrectt that are misclassified by classifier c (line 10). The instances in Twrongt will be added to the training data again (i.e., given extra weight) in a new iteration. As we iteratively select informative instances out of Ts, the remaining informative instances in Tswill be less and less. The whole process will stop when we can-
not select sufficient number (a predefined number k) of instances in an iteration (line 11). The classifier c trained during the last iteration will be returned as the adaptive classifier.