• No se han encontrado resultados

ORIGEN Y CARACTERÍSTICAS DE LA ENERGÍA SOLAR.

2.1. Fundamentos sobre energía solar

2.1.1. ORIGEN Y CARACTERÍSTICAS DE LA ENERGÍA SOLAR.

At level 1 feature selection, we choose the top 10 terms to form the feature vector of each reference cluster. Through subjective evaluation (manually checking the signifi- cance of the labels)and objective evaluation (comparing the accuracy of resulting clusters using different number of top terms) during our experimental test for our paper [99], this number is a good cutoff regarding the citation semantics.

At level 3 feature selection, the length of the feature vector of each cluster is determined by the length of the feature vector of each document belonging to it and the total number of documents in that cluster. So, the only issue left here is how to determine the length of the feature vector of each document, which is at the level 2 feature selection. A single document could be one in an existing cluster, or the new document to be added to a cluster. In dealing with the length of the feature vector of a single document, we must be aware of the two different situations. This is because we use the feature vectors of single documents to form the feature vector of the cluster they belong to, whereas we use the feature vector of the new document to compare with the feature vectors of existing clusters to decide where to put it. The objective criteria in both situations is which length of the feature vector of a document can lead to the best quality of document clustering.

When forming feature vectors of different clusters, we want each feature vector to be different from all others. We want the distance between every two cluster feature vectors to be as big as possible. Suppose a matrixM is formed with these feature vectors

in its columns, we want at least the following criteria to be satisfied.

Rank(M) = k (5.14)

Wherek is the number of existing clusters. That is, no cluster feature vector would be a

linear combination of others. However, our situation here is different from latent semantic analysis [48], where SVD (singular value decomposition [50]) is used toreduce the rank

of the term-document matrix, in order to reveal the hidden similarity among documents and hence, to improve the recall in information retrieval. We do not want to reduce the rank of the matrixM. Instead, we want to keep its rank. We have the following theorem

about the rank of this matrix.

Theorem 5.6.1. If the number of unique terms in each cluster is bigger than the number

of clusters, the lengths of the feature vectors that can satisfy equation 5.14 are not unique. Proof It can be shown by counterexamples. Let us suppose the number of clusters

isk. First, since each cluster has more than k unique terms, for each cluster we can find

a different term to form its feature vector. Then, the feature vectors will certainly satisfy the equation 5.14. If the theorem is false, we cannot find another length that satisfies equation 5.14. However, if we just add one term that is different from all the existing terms to one of these feature vectors, the resulted feature vectors still satisfy the equation 5.14 and therefore, we complete the proof of this theorem.

Since the number of unique terms of cluster feature vectors are more than the num- ber of clusters in most cases, there are so many different lengths that can satisfy equation 5.14. The lengths of cluster feature vectors are usually not a problem regarding this equa-

noise and to shorten runtime. In the meantime, we do not want to lose useful informa- tion. For example, if we have three clusters and the three feature vectors are “social: 1”, “database:1”, and “network:1”, then the length of each feature vector is one. Even though the rank ofM will be 3, we may lose useful information that in turn may result in a low

accuracy of clustering. Suppose the feature vector of a new document is “social:0.5, net- work: 0.5 “ with two words. For fuzzy clustering, the new document will go to clusters 1 and 3. Otherwise, it may only go to cluster 3. But if using two words for the feature vectors of these three clusters, they may be “social:1, network: 1”, “database:1, web 0.6”, “network:1, wireless: 0.8”, certainly, the new document should belong to cluster 1. (No- tice that the weights of the words in this example will be normalized before comparison.) Therefore, we need to find the cutoff point of the length of the feature vectors of the exist- ing documents. The principle rules of these cutoffs are that we want equation 5.14 to be satisfied (that is easy to achieve), and in the meantime we want to maximize the accuracy of resulted clustering.

While the length of the feature vector of an existing document has to be set heuris- tically with the requirement of equation 5.14 met, the length of the feature vector of a new document could be found automatically by searching for the following ratioR within a

range of lengths[Ll, Lr]. R = Max{Rj, j = Ll, ..., Lr} (5.15) Rj = Max{ Si Pk i=1Si , i = 1, ..., k} (5.16)

means, we would use the length of the feature vector of a new document that makes it most similar to one of these feature vectors of clusters. In the case of fuzzy clustering, the numerator of 5.16 would be the top x of the similarities of a given length j. Even

though the program needs to search a range of lengths, the time used is ignorable given the numberk of the feature vectors of clusters is usually small. Furthermore, we designed

two algorithms to speed up this search process: Exponential Increment Search and Linear Increment Search. Instead of checking each length in the range[Ll, Lr], we only sample some of them to find the right length in less time. Our experimental results show the dif- ferences between using these two sampling search algorithms and the brute force search (check each length within the range[Ll, Lr]) are ignorable (as shown in Chapter experi- mentalResults). And the Exponential Increment Search requires the least amount of time. It is shown below.

(1)Rmax=0;

(2) Increment=1;

(3) For(j = Ll; j ≤ Lr; j = j + increment){

(4) ComputeRj using 5.16 with the current length;

(5) If(Rj > Rmax){

(6) Rmax = Rj;

(7) Increment = 1;

(8) Record the cluster that makes thisRj;

(9) }

(11) Increment = Increment*2;

(12)}

For the Linear Increment Search algorithm, we only need to replace “Increment = Increment*2” with “Increment = Increment+1”.

Note that for each new document, we actually form two feature vectors. First, we form a feature vector to compare with the feature vectors of the existing clusters to decide where to put the new document. Second, we form another one to update the feature vector(s) of the cluster(s) to which this new document is added. They could be the same or different depending on the length set for the existing documents and that obtained for the new document. However, we could also use the same feature vector of the new document to update the cluster feature vector(s). Our experimental results showed the difference of the clusterings by using the fixed length of existing documents or the same length of the new document was ignorable (Table 35 in Subsection 7.3.7).