• No se han encontrado resultados

1. Marco teórico

1.4 Principales características de los elementos analizados

1.4.1 Hierro

In CS-VS, we will combine the vector space similarity measure and the citation semantic similarity measure in calculating the similarities between documents. Due to the special property of citation semantics, there is no suitable way to find the “mean” of the citation semantics of documents. Therefore, instead of using K-Means, the most popu- lar clustering algorithm, we use K-Medoids (Definition 4.1.4) to do document clustering.

document cluster. And hence, our major issue here is how to calculate the combined sim- ilarity between every two documents. The remaining part of this section will be dedicated to the discussion on the similarity measure in the CS-VS approach.

Similarity Measure In CS-VS, we utilize the citation semantics in document clustering by combining the similarity Ssm(d1, d2) between semantics and the similarity

Svs(d1, d2) between vectors in VSM. In the meantime, we also consider the similarities

between document titles (if both have titles), keywords (if any), and the co-citation infor- mation. So the similarity between two documents could be computed by either using the harmonic mean of Ssm(d1, d2) and Svs(d1, d2) (4.2) or the simple addition of them (4.3).

Sh(d1, d2) =

2W1Svs(d1, d2)W2Ssm(d1, d2)

W1Svs(d1, d2) + W2Ssm(d1, d2)

(4.2)

Ss(d1, d2) = W1Svs(d1, d2) + W2Ssm(d1, d2) (4.3)

Where Svs(d1, d2) is the similarity between the corresponding vectors of these two doc-

uments in VSM, and Ssm(d1, d2) is the similarity between the semantics of these two documents including citation semantics, tiles, keywords, and co-citations. They in turn can be obtained through the following formulas.

Svs(d1, d2) = v~1· ~v2 k ~v1 kk ~v2 k (4.4) Ssm(d1, d2) = W3St(d1, d2) + W4Scise(d1, d2) + W5 2Nco Nr1 + Nr2 + W6Sk(d1, d2) (4.5)

Where St(d1, d2) is the similarity between the titles of these two documents, which can be

computed using equation 3.2, Scise(d1, d2) is the similarity between citation semantics of

is used to quantify the co-citations between these two documents,Nco is the number of common references the two documents cite, Nr1 and Nr2 are the total number of refer- ences of d1 and d2, respectively, and the last part Sk(d1, d2) is the similarity between keywords provided by these two documents, which can also be calculated with equation 3.2. Scise(d1, d2) = 1 2 M X i=1 SLi( 1 Nc1 + 1 Nc2 ) (4.6)

SLi = Max(2Ncli1 Ntli1 mRi1, 2Ncli2 Ntli2 mRi2, ..., 2NcliN NtliN mRiN) (4.7) mRij = Min(Rri, Rrj) Max(Rri, Rrj) (4.8) Rrk = Ncrk Nr (4.9) M = Min(Nc1, Nc2) (4.10) N = Max(Nc1, Nc2) (4.11)

WhereNc1 and Nc2 are the number of clusters of documentd1 and d2 respectively, Rrk in equation 4.9 is the ratio of the number of references in cluster k to the number of total references of a document, Rri andRrj are calculated using this equation, mRij is the meta ratio of Rri and Rrj, which is used to adjust the similarity of two reference clusters. Its maximum value will be 1. The reason for using the meta ratio instead of the simple ratio is that the sizes of two similar reference clusters might vary greatly, yet their relative sizes compared to the total number of references of the documents that they belong to may not differentiate much. Nclij, j = 1, ..., N is the number of common terms shared by the labels of cluster i (in document d1) and cluster j (in document d2), and

Ntlij, j = 1, ..., N is the number of total terms in labels of cluster i (in document d1) and clusterj (in document d2).

To calculate Scise(d1, d2), we first find the document with fewer number of refer- ence clusters, say,d1, that is, M = Nc1,N = Nc2, according to equations 4.10 and 4.11. Then for each reference cluster in d1, we compare its label (which could have multiple

terms) with the label of each cluster in documentd2, to find the most similar cluster. The

maximum similarity is calculated using equation 4.7. If there is only one term allowed for each label, SLi could only be either 0 or 1. However, we use multiple terms (such as five or ten terms) to label each cluster that provides richer semantics. After getting the maximum similarities for all the reference clusters in documentd1, we can compute the

similarity between the citation semantics of documentd1 and d2 using equation 4.6.

Let us use the example as shown in Figure 4 to further explain how to calculate the semantic similarity. In this example, the total number of references of document d1 is 22, d2 24. The number of reference clusters of d1 is 4, 3 for d2. Thus, we take each cluster label in d2 to find the most similar one in d1. For example, the first cluster label (CL21)in d2 contains “t5”, “t7”, and “t3”. And the cluster contains 10 references. The first cluster label (CL11) in d1 contains “t1”, “t2”, “t3”, and “t8”. So the similarity between these two reference clusters would beS(CL21, CL11) = 27

6 22 10

24 ≈ 0.187 which is shown in Figure 4.

Similarly, we can calculate the similarities betweenCL21 and the other three clusters of d1. They are 0.392, 0.181, and 0.0, respectively. In other words,CL21 is most similar to the second reference cluster of document d1, and the similarity is 0.392. Likewise, we can find that the second reference cluster of d2 is most similar to the first reference cluster

of d1 with a similarity 0.515, and the third reference cluster of d2 is the most similar to the third one of d1 with a similarity 0.227. Therefore the citation semantics between d1 and d2 is 12(0.392 + 0.515 + 0.227)(1

3 + 1

4) ≈ 0.331.

The similarity between these two titles can be easily figured out as 0.375. The similarity considering co-citation is 22+242 ≈ 0.043. Using equation 4.5, and supposing

W3 = W4 = W5 = 1, and W6 = 0 (no keyword), we get the semantic similarity between documents d1 and d2 as0.375 + 0.331 + 0.043 = 0.749.

Title: t1 t2 t3 t4 t5

Reference Clusters Labels with number of references:

t1; t2; t3; t8 6

t2; t5 9

t4; t6; t7 5

t9; t1; t10 2

Title: t2 t4 t6 t7 t3 t8

Reference Clusters Labels with number of references:

t5; t7; t3 10 t2; t1; t4; t8 6 t9; t11; t7 8 0.187 0.392 0.181 0.0 0.227 0.515 0.375 0.043 Co-citation=1 Co-citation=1 Document d2 Document d1

Figure 4: An Example of the semantic similarity of Two Documents

Documento similar