VI. PERIODO DE LA DENTICION PERMANENTE
VI.3 Cronología de erupción de los dientes permanentes
Basically, the method is based on nearest centroid classification, where ex- plicit representations or profiles are built for each category in the same form of bags of words for documents, so that each new document is as- signed to the category having the profile most similar to its bag of words. These profiles are often simply constituted by the mean point (centroid) of
3.4. Cross-Domain Text Classification through Iterative Refining of
Target Categories Representations 53
bags for known documents belonging to respective categories. As a measure to compare vector-like representations of documents and categories, cosine similarity (Eq. 3.1) is usually employed.
In standard text categorization, these profiles can be created from doc- uments of the training set and this potentially leads to optimal representa- tions of categories. In the cross-domain case, profiles can be created from documents of the source domain, whose labels for each are known, but these are presumably not optimal to classify documents under the target domain, so some sort of adaptation is needed to use them.
The proposed idea is to use profiles extracted from the source domain as a starting point, expecting that at least some documents of the target domain will be significantly similar to them. Considering these documents as correctly classified, updated profiles for categories can be computed by averaging them, assuming them to constitute better representations of the categories in the target domain. By iteratively repeating this step, some- what similarly to what happens in the k-means clustering algorithm, cate- gory profiles can be furtherly improved, as they are progressively extracted from more documents. After a number of iterations, the final profiles are used as a reference to classify documents.
Document Pre-Processing and Term Weighting
As a first step, as typically happens in text categorization, a pre-processing phase is run where all documents in D are reduced to bags of words.
For each document, single words are extracted, then those shorter than three letters or found in a list of stopwords (articles, prepositions, etc.) are removed. Among all distinct words found within documents, likely to many other cross-domain methods, only those appearing in at least three of them are used as features or terms: the global set of selected features is denoted with W. Each document d is then represented with a vector
wd ∈ R|W|, containing for each term t a weightwt,d of its relevance within the document.
As described in Section 3.2.4, each weight wt,d is generally obtained by product of a local factor denoting the relevance oft ind itself and a global factor equal for all documents, indicating how important is t across them. These schemes can be combined in different ways: Sect. 3.4.3 will initially present various composite schemes before establishing one of them to be used.
Cross-Domain Learning Algorithm
After processing documents, a profile w0c for each category c ∈ C is com- puted as the centroid of source documents labeled with c, whose set is denoted with R0c. Rc0 ={d∈ DS :CS(d) =c} (3.24) w0c = 1 |R0 c| X d∈R0 c wd (3.25)
The “0” index denotes that these are initial profiles, which constitute the starting point for the iterative phase, which is explained in the following. The i index in the following is the iterations counter, which starts from 0.
Firstly the similarity scoresi(d, c) between each target documentd∈ DT and each categoryc∈ C is computed. In the base method, it is simply the cosine similarity between the bag of words ford and the current profile for
c.
si(d, c) = cos(wd,wic) (3.26) This similarity is considered as theabsolute likelihood ofcbeing related tod, i.e. the probability that it is the correct category ford. In order to be confident in the assignment of a category c to a document d, the relevant scoresi(d, c) must be significantly higher than those for the same document and other categories. To evaluate where this is the case, relative scores are computed by normalizing those of each document for all categories: this makes them constitute in practice a probability distribution among categories. pi(d, c) = s i(d, c) P γ∈Csi(d, γ) (3.27) The value of pi(d, c) indicates the estimated probability of c being the correct category ford, considering similarities betweend and all categories. For each document, the most likely category is obviously that with the highest score: we denote with Aic ⊆ DT the set of target documents for whichcis the predicted category. However, the score could indicate more or less certainty about the prediction. Setting a thresholdρ, we can define for
3.4. Cross-Domain Text Classification through Iterative Refining of
Target Categories Representations 55
each categoryc∈ C a set Ric+1 ⊆Aicof documents for which the assignment of the c label is “sure enough”.
Aic={d∈ DT :c= argmax γ∈C
pi(d, γ)} (3.28)
Ric+1 ={d∈Aci :pi(d, c)> ρ} (3.29)
Ric+1 is a set ofrepresentative documents for the categorycin the target domain. A new profile for the category is built by averaging these docu- ments. wic+1 = 1 |Ri+1 c | X d∈Ri+1 c wd (3.30)
At this point, conditions for the termination of the iterative phase are checked. A maximum number NI of iterations is set to ensure that the algorithm does not run for an excessively long time. However, after a limited number of iterations, category profiles usually tend to cease to change from one iteration to another. At a certain iteration, if category profiles are identical to those of the previous one, the same representative documents as before will be selected and the same profiles will keep to be computed, so in this case the iterative phase can be safely terminated. This leads to the following termination condition.
∀c∈ C :wic+1 =wci (3.31)
If this condition does not hold and the number of finished iterations (i+ 1) is below NI, all steps from (3.26) up to here are repeated with the iteration counter i incremented by 1. Otherwise, the iterative phase terminates with an iteration count nI = i+ 1. When this happens, the final predicted category for each target document dis computed as the one whose latest computed profile is most similar to the bag of words for d.
ˆ
CT(d) = argmax c∈C
cos(wd,wncI) (3.32)
Other than to documents of the target domain known and used in the iterative phase, this formula can be applied to any previously unseen doc- ument of the same domain, comparing its bag of words to final category profiles.
Computational Complexity
The process performs many operations on vectors of length |W|: using suitable data structures, both storage space and computation time can be bound linearly w.r.t. the mean number of non-zero elements, which will be denoted with lD and lC for bags of words for documents and categories, respectively. By definition, we have lD ≤ |W| and lC ≤ |W|; from our experiments we also generally observed lDlC<|W|.
Initial profiles for categories are built in O(|DS| ·lD) time, as all values of all bags of words for documents must be summed up. Cosine similar- ity between vectors with lD and lC non-zero elements respectively can be computed in O(lD +lC) time, which can be written as O(lC) given that lD < lC.
In each iteration of the refining phase, the method computes cosine similarity forNT =|DT| · |C| document-category pairs and normalizes them to obtain distribution probabilities in O(NT ·lC) time; then, to build new bags of words for categories, up to |DT| document bags must be summed up, which is done in O(|DT| ·lD) time. The sum of these two steps, always considering lD < lC, is O(|DT| · |C| ·lC), which must be multiplied by the final number nI of iterations.
Summing up, the overall complexity of the method is O(|DS| ·lD+nI·
|DT|·|C|·lC), which can be simplified toO(nI·|D|·|C|·lC), withlC ≤ |W|. The complexity is therefore linear in the number|D| of documents, the number |C| of top categories (usually very small), the mean number lC of mean terms per category (having |W| as an upper bound) and the number nI of iterations in the final phase, which in our experiments is almost always within 20. This complexity is comparable to many other cross-domain classification methods.