4.3. ANÁLISIS COMPARATIVO DEL LOS NIVELES DE DESARROLLO
4.3.1. Análisis comparativo del nivel de desarrollo de la capacidad de
For network clustering, the first step is to determine how many links (degreedu) each
distributed system u should have. Once the degree is determined, the system will interact with a large number of other systems (from a random pool) and select only
du systems as neighbors based on a connectivity probability function guided by the
clustering exponent α.
In main experiments on the ClueWeb09B collection (details in Section 5.1), we collect information about each web site/system’s incoming hyperlinks and normalize the in-degrees as theirduvalues. We will control the range of degree distribution [dmin, dmax]
for the normalization and study its impact on search performance. Given the number of incoming hyperlinksd′
u of system u, the normalized degree will be computed by:
du =dmin+
(dmax−dmin)·(d′u−d′min)
d′
max−d′min
(4.5) where d′
max is the maximum degree value in the hyperlink indegree distribution
and d′
min the minimum value in the same distribution. Once degree du is determined
from the degree distribution, a number of random systems/agents will be added to its neighborhood such that the total number of neighbors ˆdu ≫du, e.g., ˆdu = 1,000 given
du = 30. Then, the current agent (u) queries each of the ˆdu neighbors (v) to determine
their topical distanceruv. Finally, the following connection probability function is used
puv ∝ruv−α (4.6)
where α is the clustering exponent and ruv the pairwise topical (search) distance.
The finalized neighborhood size will become the expected number of neighbors, i.e.,
du. With a positive α value, the larger the topical distance, the less likely two sys-
tems/agents will connect. As illustrated in Figure 3.4, large α values lead to highly clustered networks while small values produce random networks with many topically remote connections or weak ties.
Chapter 5
Experimental Design
5.1
Data Collection
We plan to use the ClueWeb09 Category B collection created by the Language Tech- nologies Institute at Carnegie Mellon University for IR experiments. The ClueWeb09 collection contains roughly 1 billion web pages (25 TB uncompressed) and 8 billion out- links (71 GB uncompressed) crawled during January - February 2009. The Category B is a smaller subset containing the first crawl of 50 million English pages (1 TB un- compressed) from 3 million sites with 454 million outlinks (3 GB uncompressed). The ClueWeb09 dataset, though new in its first year, has been adopted by several TREC tracks including Web track and Million Query track. Additional details about the ClueWeb09 collection can be found athttp://boston.lti.cs.cmu.edu/Data/clueweb09/.
A hyperlink graph is provided for the entire collection and the Category B subset. Anchor text, however, is not provided as part of the link graph. In the Category B subset, there are 428,136,613 nodes and 454,075,604 edges (hyperlinks). Nodes include the first crawl of 50 million pages and additional pages that were linked to. Only 18,607,029 nodes are the sources (starting pages) of the edges (average 24 outlinks per node) whereas 409,529,584 nodes do not have outgoing links captured in the subset.
Analysis of the Category B hyperlink graph produces Figures 5.1 (a) in-degree frequency distribution and (b) out-degree distribution (on log/log coordinates). The in-degree distribution has two linear parts on the log/log coordinates, with a cutoff atk ≈50.
1 100 10000 1e+00 1e+02 1e+04 1e+06 1e+08 In−degree (k) Degree frequency f(k) 10 1 5 10 50 500 1e+00 1e+02 1e+04 1e+06 Out−degree (k) Degree frequency f(k)
(a) In-degree distribution (b) Out-degree distribution Figure 5.1: ClueWeb09 Category B Web Graph: Degree Distribution
Based on 50,221,776 pages extracted from 2,777,321 unique domains (treated as sites) in the Category B subset, we have also analyzed # pages per web site distribu- tions. The mean number of pages per site is 18. The distribution of the number of pages per site is shown on log/log coordinates in Figure 5.2 (a). Figure 5.2 (b) shows the cumulative distribution, in which the Y dimension denotes frequency of web sites with a size≥s represented onX.
Figure 5.3 (a) shows page size (text length) frequency distribution on log/log co- ordinates. There are a couple of visible high points on the graph – that is, many web pages have a content length of roughly 12 KB, 17 KB, or 65 KB. The mean size is 1,109 KB while the median is 622 KB. Figure 5.3 (b) shows the cumulative form, in which the Y dimension denotes the frequency of page size ≥l represented on X.
We also analyzed the distribution of web pages across major top level domains such as .com and .edu. Figure 5.4 shows major top level domains with the largest numbers
1e+00 1e+02 1e+04 1e+06
1e+00
1e+02
1e+04
1e+06
Web site size (# pages) (s)
Size frequency f(s)
1e+00 1e+02 1e+04 1e+06
1e+00
1e+02
1e+04
1e+06
Web site size (# pages) (s)
Cumulative size frequency f(>=s)
(a) Site size (# pages) distribution (b) Cumulative size distribution Figure 5.2: ClueWeb09 Category B Data: # pages per site distribution
of web pages. Note thatY is log-transformed.
Another dataset from TREC, namely Genomics track 2004 benchmark collection, is being considered in this research for additional experiments. The data collection is a ten-year subset of Medline from 1994 to 2003, with roughly 4,591,008 citations containing titles, abstracts, authors, etc. (Hersh et al., 2004). The number of articles in each year is shown in Figure 5.5 (a). There are 808,771 unique scholars and 17,443,160 author-article pairs. On average, each scholar (co-)authored five to six articles while each article has roughly three to four authors. Figure 5.5 (b) shows the frequency distribution of scholarly productivity (or the number of articles each scholar published) in the TREC Genomics collection. Probably due to name ambiguity, there are several authors who published more than one thousand papers (bottom-right of Figure 5.5 (b)).