BREVE JUSTIFICACIÓN - ***I INFORME. ES Unida en la diversidad ES A7-0025/

Applying the metrics processor results in every publication obtaining a rank score relat- ing to each algorithm. Although it is possible to analyse all 200,000+ publications, this does not help track sets of publication through their life cycle. The aim is not only to find a metric which still performs well when compared to Citation Count, but to evaluate if any metric can additionally be used to provide an early indication of subsequent impact.

In Figure 6.1, in can be observed that the majority of the 200,000 publications in Citebase will have very few citations, meaning earlier and subseqent impact are likely to be very similar. In order to properly evaluate the family of applied metrics, publications are required which gain subsequent impact. To alleviate this problem it was decided that a set of 100 publications, all of the same age, should be tracked over a three year life cycle. To ensure that this 100 does not consist of the 60,000 which only ever receive a single citation, the top 100 by Citation Count at the end of a three year life cycle were selected. Even after careful selection, there is still a chance of selecting 100 publications which never get ranked highly by CoRank, however when considering Citation Count as the target metric (due to adoption) than the others should be capable of reflecting this standing. If these 100 are ranked highly, earlier in the publication life cycle, then CoRank can be said to be an early indication metric.

To perform this comparison, the selected set of 100 publications, ordered by Citation Count, become the target set of publications which all algorithms are looking to match at some point in their life cycle. With the target set of publications selected, these must then be located in all of the snapshots for each algorithm, the rank position recorded and finally correlated against the target list.

Again it was decided to make the system used for locating and processing results as generic as possible and add this as a series of Web services on top of the Co-Ordinator. In all three primary services were added on top of the Co-Ordinator, listed as follows:

• Correlation Calculator - month by month - Provides early indication metric data pertaining to the order of the sample set of publications

• Rank Comparator - month by month - Tracks the overall positions of the sample set of publications in the whole dataset.

• Publication Ages (Summariser) - Provides a breakdown of the age of publications contained in many snapshots.

Figure 6.6 shows an overview of the system and the data required for each of the three tests. This diagram shows the Citebase snapshots along the top, with each algorithm

Chapter 6 Applying CoRank 113

represented by a star. Thus each document within the table represents the set of results generated by the metrics processor. Figure 6.6 additionally shows the two basic sets of data required for each of the tests and which test requires which data set.

Feb 2004 Mar 2004 Apr 2004 May 2004 Jun 2004 Jul 2004 Aug 2004 Sep 2004 Oct 2004 Nov 2004

A

B

Snapshots Metrics Metric Ranked Publication Lists

Figure 6.6: Overview of the Co-Ordinator’s Result Processor

The following section introduces each of the three tests in more detail. Each test requires a number of different combinations of base and processed data, in order to not only evaluate each metric against each other, but also to do this in a temporal manner.

The first test is the Correlation Calculator, perhaps the most complex. This test requires both datasets and compares the rank of the target set of publications (shown as A in Figure 6.6) to each snapshot (one of the set B in Figure 6.6) generated by a each algorithm. To pick these datasets a number of variables can be defined, including which metric dictates where the target (A) is, and which metric to take the snapshot results from (B). The full list of parameters which can be provided to the Correlation Calculator is as follows:

• Target Metric - The metric used to obtain the target publications (A).

• Target Snapshot - The date of the snapshot which is regarded as the target (A).

• First Snapshot - Usually represents the snapshot three years prior to the target one for the start of snapshot (B).

• Trial Metric - The metric being trialled and about which all the results should be selected (B).

• No. of Publications - The number of target publications (if different from 100) (A and B).

114 Chapter 6 Applying CoRank

Each set of publications selected as part of B consists of only the 100 selected during the sampling of A. Each set of 100 is returned in rank order where the rank position in the total dataset has been discarded; thus they are now ranked between 1 and 100. Both sets (A and B) are then processed by the correlation calculator resulting in a correlation coefficient being returned, representing the similarity between one of the sample sets in B and the target set A. Once a correlation coefficient has been calculated for all 36 snapshots in the set B, this can then be graphed over time to show how the correlation changes.

The second test involves the Rank Comparator and the same sets of information as the Correlation Calculator except this time, both the A and B datasets contain the actual rank position of the set of n publications in amongst all of the other publications. This is then used to assess the relative standing of the publications within the entire dataset.

It is important to examine that as well as the publications being ranked in a similar order to the target algorithm, they are also listed at a similar point in the overall standing when considering all publications. The output from this test will be an average rank value for the position of the set of publications and a value for the deviation to indicate how distributed they are from this result. This average rank can then be compared to the target rank average (from A).

Finally, the third test examines the age of the top ranked papers in each snapshot. This will provide a good indication of the behaviour of each algorithm on a real dataset, as well as help explain the expected variations from our target dataset. This final test, only requires the data pertaining to the algorithm being trialled. There is no target set of publications, rather the top n%, as defined by the input to the test, are required to calculate the average age of this n%. Thus this test can take the following inputs:

• Trial Metric - The metric being trialled and about which all the results should be selected.

• No. of Publications - The number of publications to be considered (as an alternative to percentage).

• % of Publications - The percentage of overall publications to be selected.

In the case of the results presented in this work, this test will be examining only the top 5% of all publications in rank order (as per the trial metric), not the previously selected 100. The age of the top 5% of publications will be recorded and each publication will be grouped into one of four age brackets: less than a year old, one to two years old, two to three years old and older than three years. The results will be presented in the form of a percentage breakdown of publications which are present in each category.

Chapter 6 Applying CoRank 115

Previous research has found that a publication’s impact can only be judged accurately after around three years (Moed 2005). Logically this would imply that the majority of high impact publications by Citation Count would be three years of age or older. Al- though too early to tell, it would be expected that the distribution in age of publications would look similar to that shown in Figure 6.7. In the top 5% it is anticipated there should be less publications younger than one year than aged between one and two years old, with the same applying to the subsequent age brackets. Even a metric revealing a good number of more recent publications should still maintain the high rank of older, highly prestigious material, fitting the same age pattern as shown in Figure 6.7.

1 2 3 N o . o f Pu b lica ti o n s Age (Years)

Figure 6.7: Expected distribution of publications by age (top 5%)

Table 6.2 gives a summary of the various tests outlined in this section which are designed to reflect those which were performed on the theoretical network.

Test Name Description

Rank Reveal Correlation between rank order of publications in current snapshot to target snapshot. Mean Rank Examine the mean rank and distribution of

the publications in the snapshot.

Publication Age What is the average age of a publication in the top 5% of each snapshot

Table 6.2: Summary of Test Strategy

In document ***I INFORME. ES Unida en la diversidad ES A7-0025/ (página 62-71)