MARCO FINANCIERO Y DE EJECUCIÓN
4. La Comisión consultará a la Agencia Frontex sobre los proyectos de programas nacionales, en particular sobre las actividades financiadas en el marco del apoyo
Citation Count is the simplest of all the metrics, calculated by adding together all the citations towards each publication and recording this as the total. The publication with the highest citation count is thus the most highly ranked. Figure 5.6 shows the citation count of each publication and each node has been scaled to reflect its citation count. Therefore the larger the node, the higher its rank. Node 17 is the most cited, followed by node 16 and then nodes 9, 10, 12 and 14 sharing three citations each.
This section will be constantly referring back to the result obtained through application of Citation Count on this network, thus this result, shown by Figure 5.6, is acting as a control. In this figure the nodes have been scaled to show which are the most (biggest) and least (smallest) cited. By leaving the nodes scaled by Citation Count in all subsequent results, the top five or so nodes by each algorithm can quickly be compared to the Citation Count (accepted metric) result, enabling a visual comparison between each metric. While the size of the node will remain constant, the numbers contained inside each node will then represent the position of that node according to the algorithm being trialled.
Citation Count is a basic measure of popularity resulting in many of the nodes in the artificial network are ranked the same, i.e. the same number of citations. While this is not so significant in a publication network due to the fact that impact is a boundary measurement (e.g. a publication is typically regarded as having very few, a good number or a large amount of citations) not a comparative, this will have a greater effect when comparing ranking algorithms. The starting point for looking at the theory of how differ- ent ranking algorithms affect the network is to look at a number of existing algorithms, including PageRank and analyse how these compare to Citation Count. Consequently this will allow a direct comparison of Web based to traditional publication metrics.
92 Chapter 5 Co-Citation Metrics
Figure 5.6: Citation Count applied to artificial network — Nodes scaled and internally labelled with their Citation Count
5.5.2 Hubs and Authorities
This is the first of the Web based metrics introduced in Section 3.6.1. Hubs and Author- ities (HITS) (Kleinberg 1999) consists of two interacting recursive algorithms, one which calculates scores for Hubs and one which calculates Authority scores. Thus a good Hub is something which points to good Authorities, thus is calculated from the Authority scores, and a good Authority is pointed to by good Hubs, thus indirectly depending on the Hub scores.
Using HITS to locate individual high ranking articles can be achieved by only considering the results of the Authority score calculation. Figure 5.7 shows the Authority result of five iterations of the HITS algorithm over the artificial network.
With each node initially receiving a score of one, the algorithm is required to be iterative. The number of iterations depends mainly on the initial values used for each node (here 1/|V | where |V | represents the number of nodes) and a bit of experimentation. Brin et al. (1998) found the number of iterations required for their PageRank algorithm (a metric similar in requirement to HITS) to converge to be linear in logn. At the time their
experimentation over a link graph of 322 million links converged in roughly 52 iterations with half this number of links requiring 45 iterations. With only 32 links in the test network it was found, by experimentation, that five iterations was perfectly suitable for the results of HITS, PageRank and CoRank to converge and stabilize.
Chapter 5 Co-Citation Metrics 93
Figure 5.7: Nodes ranked by Authority Score — Internal numbers represent rank position while size remains an indicator of citation count standing
In Figure 5.7 the nodes remain the same size as dictated by their Citation Count, thus the bigger the node the more citations it receives. The authority rankings are represented by the numbers contained inside the nodes (unlike in Figure 5.6), thus node 16 (the node numbers are outside the nodes) is ranked as the most authoritative. By citation count the most cited nodes were 17, 16, 9, 10, 12 and 14 all of which received three or more citations, compared with citation count the order by authority scores changes to 16, 9, 10, 17, 14. Only node 12 is missing from this list of the top five, as even though node 12 receives three citations these have not be deemed to be from strong Hubs.
5.5.3 PageRank
PageRank (Brin et al. 1998), explained in Section 3.6.2, works on the basis that each link in the network does not obtain an equal weight. The introduction of this weighted system, where a rank is dependant on the rank of the linking item, is seen as an ideal mechanism through which false positives can be handled whilst maintaining a high position for important articles.
Although in a peer reviewed environment false positives are rare, as publication mech- anisms become more open and the amount of available material grows, the suitability and necessity to consider such factors may become apparent. On the Web, if the rank of a Web page was calculated by simply adding up all the links towards a Web page then
94 Chapter 5 Co-Citation Metrics
it would be relatively easy to simply publish new pages, which provide links purely for the purpose of increasing rank.
In the context of this study, comparing the performance of PageRank against citation count will show how the two algorithms are related and help to place CoRank, which is based upon PageRank.
Figure 5.8: Nodes ranked by PageRank — Internal numbers represent rank position while size remains an indicator of Citation Count standing
Figure 5.8 shows the results of applying PageRank to the artificial network for a series of five iterations (the minimum number for the rank positions to stabilise on this network). From this, it can be seen that the results are very similar to that obtained by Citation Count; the top six including all the nodes which have a citation count of three or more. With PageRank being a more complex algorithm than Citation Count, a more finite rank order of papers is obtained separating those that have the same Citation Count. Figure 5.8 also demonstrates how PageRank works. Both nodes 10 and 12 are cited by three nodes of equal total weight (here defined as total citation count), the difference between the two is node 3. Node 3 cites node 12 and is also cited by node 10. Due to the high PageRank of node 10, the rank of node 12 is increased via the direct citation from node 3, thus demonstrating perfectly the iterative nature and dependencies of PageRank.
Node 10 is also newer than node 12 (from Figure 5.2) thus it would be interesting to see if PageRank reveals newer publications; one of the aims of finding a “better” algorithm. This is something very unlikely however, as each publications PageRank is based directly
Chapter 5 Co-Citation Metrics 95
upon the citing publications PageRank recursively, thus PageRank should take longer to establish than Citation Count.
To obtain the result shown in Figure 5.8, a number of iterations were required. From the work of Brin et al. (1998), who looked at ways to optimise PageRank to be computable in reasonable time, it was discovered that computation time was scalable in logn. This
meant that for an exponential amount of citations, rank order can still be computed in reasonable time. In their implementation Brin and Page also discuss the removal of dangling links, these are links to nodes which do not have any outgoing links (an example can be seen between node 5 and 14 in the artificial network, where node 14 has no outgoing links). Upon examination of how initial removal of these links would affect the results, no discernible difference was found. This is perhaps why there is no mention of this concept in the subsequent publication on PageRank (Brin & Page 1998), possibly also due to the negligible difference in compute time to find and remove them verses leaving them in.
When applied to the number of resources on the Web Brin and Page added a damping factor d to PageRank, which aims to model a “random surfer” who will not continue to follow more than about 15% of links (when d = 0.85). When PageRank is applied to links on the Web, a starting place is chosen to start and then only 15% of links are followed before starting again somewhere else in the network. Completing a full iteration for every page on the Web is seen as both intractable and also not required for the results to be accurate; if the website is popular enough then the likeliness is that it will get linked by enough people to become picked up in the 15% of followed links. Brin et al. (1998) realised that the damping factor could be used to favour some websites over others, however this mechanism would only be useful to users who wished to create their own custom search preferences.
15% may also be a good indicator for the number of citations a reader may choose to follow between scholarly communications, such as that outlined during the Reader Pathway (RP) study looked at earlier (Section 4.2). Bollen et al. (2005) examined how readers of publications follow from one publication to another via the reference list. Although Bollen did not calculate the damping factor, it would not be a great surprise to find if a reader followed around 15% of citations in each publication. In the case of the artificial network this damping factor will have no effect as PageRank is being applied in it’s eigenvector form where a number of full iterations over every node will be completed.
Since the artificial network only details 18 publications, there is no need to selectively follow links and a full iteration can be easily undertaken. PageRank scores for all of the nodes can be computed without requiring the random jumping between them, therefore whatever the damping factor is set to, it will have no effect on the rank order result.
96 Chapter 5 Co-Citation Metrics