Cronología de erupción de los dientes permanentes

VI. PERIODO DE LA DENTICION PERMANENTE

VI.3 Cronología de erupción de los dientes permanentes

Basically, the method is based on nearest centroid classification, where ex- plicit representations or profiles are built for each category in the same form of bags of words for documents, so that each new document is as- signed to the category having the profile most similar to its bag of words. These profiles are often simply constituted by the mean point (centroid) of

3.4. Cross-Domain Text Classification through Iterative Refining of

Target Categories Representations 53

bags for known documents belonging to respective categories. As a measure to compare vector-like representations of documents and categories, cosine similarity (Eq. 3.1) is usually employed.

In standard text categorization, these profiles can be created from documents of the training set and this potentially leads to optimal representations of categories. In the cross-domain case, profiles can be created from documents of the source domain, whose labels for each are known, but these are presumably not optimal to classify documents under the target domain, so some sort of adaptation is needed to use them.

The proposed idea is to use profiles extracted from the source domain as a starting point, expecting that at least some documents of the target domain will be significantly similar to them. Considering these documents as correctly classified, updated profiles for categories can be computed by averaging them, assuming them to constitute better representations of the categories in the target domain. By iteratively repeating this step, some- what similarly to what happens in the k-means clustering algorithm, category profiles can be furtherly improved, as they are progressively extracted from more documents. After a number of iterations, the final profiles are used as a reference to classify documents.

Document Pre-Processing and Term Weighting

As a first step, as typically happens in text categorization, a pre-processing phase is run where all documents in _D are reduced to bags of words.

For each document, single words are extracted, then those shorter than three letters or found in a list of stopwords (articles, prepositions, etc.) are removed. Among all distinct words found within documents, likely to many other cross-domain methods, only those appearing in at least three of them are used as features or terms: the global set of selected features is denoted with _W. Each document d is then represented with a vector

wd ∈ R|W|, containing for each term t a weightwt,d of its relevance within the document.

As described in Section 3.2.4, each weight wt,d is generally obtained by product of a local factor denoting the relevance oft ind itself and a global factor equal for all documents, indicating how important is t across them. These schemes can be combined in different ways: Sect. 3.4.3 will initially present various composite schemes before establishing one of them to be used.

Cross-Domain Learning Algorithm

After processing documents, a profile w0_c for each category c _{∈ C} is computed as the centroid of source documents labeled with c, whose set is denoted with R0_c. R_c0 =_{d_{∈ D}S :CS(d) =c} (3.24) w0_c = 1 |R0 c| X d∈R0 c wd (3.25)

The “0” index denotes that these are initial profiles, which constitute the starting point for the iterative phase, which is explained in the following. The i index in the following is the iterations counter, which starts from 0.

Firstly the similarity scoresi(d, c) between each target documentd_{∈ D}T and each categoryc_{∈ C} is computed. In the base method, it is simply the cosine similarity between the bag of words ford and the current profile for

si(d, c) = cos(wd,wic) (3.26) This similarity is considered as theabsolute likelihood ofcbeing related tod, i.e. the probability that it is the correct category ford. In order to be confident in the assignment of a category c to a document d, the relevant scoresi(d, c) must be significantly higher than those for the same document and other categories. To evaluate where this is the case, relative scores are computed by normalizing those of each document for all categories: this makes them constitute in practice a probability distribution among categories. pi(d, c) = s i₍_{d, c}₎ P γ∈Csi(d, γ) (3.27) The value of pi(d, c) indicates the estimated probability of c being the correct category ford, considering similarities betweend and all categories. For each document, the most likely category is obviously that with the highest score: we denote with Ai_c _{⊆ D}T the set of target documents for whichcis the predicted category. However, the score could indicate more or less certainty about the prediction. Setting a thresholdρ, we can define for

3.4. Cross-Domain Text Classification through Iterative Refining of

Target Categories Representations 55

each categoryc_{∈ C} a set Ri_c+1 _⊆Ai_cof documents for which the assignment of the c label is “sure enough”.

Ai_c=_{d_{∈ D}T :c= argmax γ∈C

pi(d, γ)_} (3.28)

Ri_c+1 =_{d_∈A_ci :pi(d, c)> ρ_} (3.29)

Ri_c+1 is a set ofrepresentative documents for the categorycin the target domain. A new profile for the category is built by averaging these documents. wi_c+1 = 1 |Ri+1 c | X d∈Ri+1 c wd (3.30)

At this point, conditions for the termination of the iterative phase are checked. A maximum number N_I of iterations is set to ensure that the algorithm does not run for an excessively long time. However, after a limited number of iterations, category profiles usually tend to cease to change from one iteration to another. At a certain iteration, if category profiles are identical to those of the previous one, the same representative documents as before will be selected and the same profiles will keep to be computed, so in this case the iterative phase can be safely terminated. This leads to the following termination condition.

∀c_{∈ C} :wi_c+1 =w_ci (3.31)

If this condition does not hold and the number of finished iterations (i+ 1) is below NI, all steps from (3.26) up to here are repeated with the iteration counter i incremented by 1. Otherwise, the iterative phase terminates with an iteration count nI = i+ 1. When this happens, the final predicted category for each target document dis computed as the one whose latest computed profile is most similar to the bag of words for d.

CT(d) = argmax c∈C

cos(wd,wncI) (3.32)

Other than to documents of the target domain known and used in the iterative phase, this formula can be applied to any previously unseen document of the same domain, comparing its bag of words to final category profiles.

Computational Complexity

The process performs many operations on vectors of length _|W|: using suitable data structures, both storage space and computation time can be bound linearly w.r.t. the mean number of non-zero elements, which will be denoted with lD and lC for bags of words for documents and categories, respectively. By definition, we have lD ≤ |W| and lC ≤ |W|; from our experiments we also generally observed l_Dl_C<_|W|.

Initial profiles for categories are built in O(_|DS| ·lD) time, as all values of all bags of words for documents must be summed up. Cosine similarity between vectors with l_D and l_C non-zero elements respectively can be computed in O(lD +lC) time, which can be written as O(lC) given that lD < lC.

In each iteration of the refining phase, the method computes cosine similarity forNT =|DT| · |C| document-category pairs and normalizes them to obtain distribution probabilities in O(NT ·lC) time; then, to build new bags of words for categories, up to _|D_T_| document bags must be summed up, which is done in O(_|DT| ·lD) time. The sum of these two steps, always considering lD < lC, is O(|DT| · |C| ·lC), which must be multiplied by the final number n_I of iterations.

Summing up, the overall complexity of the method is O(_|DS| ·lD+nI·

|DT|·|C|·lC), which can be simplified toO(nI·|D|·|C|·lC), withlC ≤ |W|. The complexity is therefore linear in the number_|D| of documents, the number |C| of top categories (usually very small), the mean number lC of mean terms per category (having _|W| as an upper bound) and the number nI of iterations in the final phase, which in our experiments is almost always within 20. This complexity is comparable to many other cross-domain classification methods.

In document Desarrollo de la dentición fase intrauterina, erupción evolución de las denticiones temporales y permanentes (página 63-73)