• No se han encontrado resultados

The purpose of the semantic matching algorithm, denoted with Ψ : (W → R+0)×(W → R+0)→ Rn, is to extract a vector pd,c ∈ Rnfrom the analysis of

the semantic relationships between words characterizing a document d and a category c, which are taken from their respective structured representations ωd and ωc.

The working of the SMA is influenced by two positive integer parameters nW pD and nW pC, which indicate the number of words to be considered for

representations of each document and of each category, respectively. The limit set by these parameters is posed principally to reduce the number of queries to the Θ function for semantic relationships. Obviously, for docu- ments and categories represented by less than nW pD or nW pC words, all of

them are considered. The numbers of words considered by the SMA for d and c are denoted in the following with ld and lc respectively.

112 Chapter 5. Domain-Independent Semantic Relatedness Model lc= min(nW pC,|w ∈ W : ωc(w) > 0|)

As only a part of the words (if more than nW pD or nW pC) representing

the document and the category are considered, it is desirable to pick those having more importance within them. For this, words of each bag are put in order by descending weights1 and throughout the SMA are considered

only the first ld for the document and the first lc for the category, denoted

respectively with td 1, td2, . . . , tdld and t c 1, tc2, . . . , tclc. ωx(tx1)≥ ωx(tx2)≥ . . . ≥ ωx(txlx)≥ ωx(τ )∀τ ∈ W − {t x 1, tx2, . . . , txlx}

At this point, every possible couple (td

i, tcj) of one of the considered words

for d and one for c could be compared. Anyway, to further reduce the num- ber of comparisons with a limited loss of information, a subset Td,c of most

relevant couples can be considered. Following some experiments, a criterion which resulted to grant a reduction of comparisons with a negligible loss of final classification accuracy is the following.

Td,c= ( (tdi, tcj) : i− 1 ld 2 + j− 1 lc 2 < 1 )

In practice, picturing the possible couples disposed in a grid, this corre- sponds to discarding those which lay outside of an ellipse having its center at (0,0) and passing for (0, ld) and (lc, 0), which are about π/4 of the total.

In this way, the least relevant terms of each document (within the limit set by nW pD) are still coupled to those most relevant for each category and vice

versa, but pairs where both words are not much important are discarded. This is shown graphically in Figure 5.3.

For each selected couple (wd, wc)∈ Td,c, the vector Θ(wd, wc) weighting

the semantic relationships between the two is computed. The vectors for all the considered couples are then weighted by the product of the weights of the two involved words before being summed up. The sum is then nor- malized by dividing it for the length of the vectors constituted by weights of considered words of both document and category.

1No specific indication is given on how to break ties between words with equal

weights, as with most commonly used weighting schemes (especially those based on combination of factors, such as tf.idf and alike) they very rarely have any influence on the results.

5.3. General working scheme for text categorization 113 t d 1 ·· · top terms of do c. d ·· · t d ld tc

1 tc2 · · · top weighted terms of category c · · ·tclc−1t c lc

Figure 5.3 – Schema of selection of relevant pairs of words between repre- sentations of a document d and a category c: pairs corresponding to white cells are considered in the computations, while those with a gray cell are discarded. rd,c= 1 q Pld i=1ωd(tdi)· q Plc j=1ωc(tcj) · X (td i,tcj)∈Td,c ωd(tdi)· ωc(tcj)· Θ(tdi, tcj)

The formula is somehow based on cosine similarity (§2.2.1). Notably, if the number of considered words would not be limited to nW pDand nW pC and

the search of semantic relationship would simply result in Θ(wa, wb) = (1)

if wa = wb and (0) otherwise, the formula would return a vector with a

single value, which would correspond to the cosine similarity of the bags of words.

This vector rd,c constitutes a summary of which semantic relationships

occur between the relevant words of d and c and how much they are present across it, also considering how much each word is relevant. The final output of the SMA is constituted by this vector along with the sum of its values as an additional feature, which can be helpful for certain learning algorithms. The total count of features is then |R| + 1.

114 Chapter 5. Domain-Independent Semantic Relatedness Model This is the vector returned by the SMA in response to two representa- tions ωd and ωc: in the training phase described above it is then labeled

with the actual relationship between d and c to be put in the training set for the semantic model, while in the classification phase, as discussed below, is passed to the existing model to predict such relationship.