• No se han encontrado resultados

1.   INTRODUCCIÓN 1

1.2   Contexto 6

1.2.1   Contexto industrial farmacéutico 7

We now present an application of our rule-based LP. In particular, we are interested in labeling online bibliographies with geographic information in order to conduct an analysis of the social behavior in the data. Geographic information about researchers over time and across countries is of great interest to different kinds of organizations and public authorities. Therefore, our goal is to create geo-annotated versions of large-scale bibliographies. Looking at bibliographies such as DBLP1 [137], the goal is to tag every of the over five million author-paper-pairs with an affiliation and its geographic information. This would allow us to analyze movement data of researchers worldwide in the field of computer science. We are triggered by the observation that many collective human activities have been shown to exhibit universal patterns. However, the possibility of strong regularities underlying computer science researcher migration has barely

1

been explored at global scale. Fortunately, in the Internet era the Web stores tons of data on researchers which is frequently updated. We demonstrate in this section and Chapter 6 that this information can be utilized to extract migration behavior of researchers and to learn models for the underlying process. Bibliographic sites on the Web, such as DBLP, are publicly accessible and contain millions of data records on publications. Papers are written virtually everywhere in the scientific world, and the affiliations of authors tracked over time could be used as proxy for migration. Unfortunately, many if not most of the prominent bibliographic sites do not provide affiliation information and there are no datasets available on the Web that immediately allow such an analysis. Instead, we first need to build a migration dataset, to conduct a large-scale investigation of migration afterwards. To this aim, we harvested data from different sources freely accessible on the Web and merged these into bibliographic databases. However, not all necessary affiliation information is available — it might actually be impractical to gather — and it is uncertain. Therefore, we have to rely on a Machine Learning algorithm to fill in the blank spots. This is precisely where our relational LP comes into play. Our relational LP is well suited to work on datasets based on bibliographies because the character of these databases is inherently relational. For example, the co-authorship relation allows to construct a social network of scientists and naturally the labels correspond to affiliations of an author-paper-pair.

Compiling the data in itself represents a significant advance in the field of quantitative analysis of research and migration patterns. Official and commercial records are often access restricted, incompatible between countries, and especially not registered across researchers. Instead, we present a general machinery for propagating geographical seed locations retrieved from the Web across online bibliographies. Next, we will describe how we gathered the data to construct a semi-labeled graph on which we run LP afterwards.

5.2.1 Harvesting Data

The Web provides several freely accessible bibliographies with millions of papers and authors. However, most of them do not contain affiliations or geo-information. For an extensive study of researcher’s migration behavior this information is crucial though. Our goal is to label every author-paper-pair in a bibliography with the affiliation of that author and its geographic location. Although it is possible to manually, or semi-automatically, retrieve such labels, a full labeling of large databases, such as DBLP, is not practical with such methods. In addition, if we can build an effective automated machinery that helps us with this task, it is also much easier to update the database continuously with new papers arriving.

To this end, we assume an initial bibliography consisting of papers and theirs authors. We start by adding affiliation information to authors in our bibliographic database. To obtain the affiliations, we could take a look at every paper and extract the information from the title page of the paper. Clearly, this is not feasible for various reasons. Fortunately, there are other information sources on the Web that contain such information. One of these systems is the ACM Digital Library2. Unfortunately, ACM DL does not allow a full download of the data. Consequently, we retrieved the affiliation information of only a few author-paper-pairs randomly selected from ACM DL which we then matched with our initial bibliography. This gave us initial seed affiliations per author for different papers. In order to fill in the missing information, we want to resort to LP. To do so, we have to be a little bit more careful. First, the names of the affiliations in ACM DL are not in canonical form which results in a very large

2

set of affiliation candidates. Secondly, although we have now partial affiliation information, we still lack exact geo-information of the organizations to identify cities, countries, and continents. Many of the affiliation names may contain a reference to the city or country but these pieces of information are not trivial to extract from the raw strings. Additionally, we want to have latitude and longitude values to enable further analysis and visualization.

Example 5.1. Latitude and longitude data allow to calculate the exact distances between collaborators. One could postulate the hypothesis that the Internet-age has removed barriers and enabled long-distance collaborations. Hence, having this precise geo-information in our database allows to investigate such a hypothesis empirically.

This geo-location issue can be resolved by using Google’s Geocoding API3. Querying the API resulted in geo-tags for most of the affiliation strings. The remaining gap primarily rises from the fact that the Google API does not find geo-locations for all the retrieved affiliation strings. This is essentially because the strings contain information not related to the geo-location such as departments, e-mail addresses, among others. In any case, as our empirical results will show, this resulted in enough information to propagate the seed affiliations and in turn the geo-locations across the initial network of authors and papers.

5.2.2 Inferring Missing Data

Before we infer the missing author-paper-pairs, we revise our obtained affiliation data. To further increase the quality of our harvested affiliations, we hypothesize that there are actually not that many relevant organizations in computer science and these names need to get de- duplicated. This hypothesis is confirmed by services such as MS Academic Search4 which currently lists only 13,276 organizations compared to our 150k+ names we obtained from crawling seed affiliations from ACM DL. Since we have the geo-locations attached to most of the affiliation strings now, we can use this information for a simple entity resolution which helps resolving this issue. More precisely, we clustered affiliations together for which the retrieved city coincide. Indeed, this approach does not distinguishing multiple affiliations per city such as MIT and Harvard which are both in Cambridge, MA, USA. However, our approach is simple and yet effective, and — as our empirical results show in Chapter 6 — the resolution is sufficient to establish strong regularities in the timing events.

Based on these known geo-locations, we fill in the missing ones by using the LP as described above. LP propagates the known cities to the unknown author-paper-pairs based on the similarity between the nodes. Correspondingly, the set of constants, C, consists of the author- paper-pairs and we have a node in the LP graph for each of these author-paper-pairs that we want to label with a city.

As we have mentioned above, a similarity function that is too dense will make the algorithm impractical, especially in large-scale bibliographies with millions of authors and papers. Hence, we resort to logical rules to formulate the similarity. These rules are based on relations such as co-authorship between the authors associated with the nodes. Specifically, in order to define the edges, we considered the following functions over the set of nodes that return facts about the nodes:

• author(i): returns the author of an author-paper node

3

developers.google.com/maps/documentation/geocoding 4

P A Y Aff Aff* 1 1 2000 g g 2 2 2000 b b 3 2 2001 r r 4 1,2 2002 ?,? r,r 5 1 2002 ? r 6 2 2003 r r 7 1 2004 r r 8 2 2004 g g

(a) Example database

1936 2000 2001 2002 2003 2004 2012

A1 1 4R25 7

A2 2 3 4 6 8

R3 R3 R3 R3

R1

(b) The graph for our rule-based LP

Figure 5.1: LP for geo-tagging bibliographies. Missing geo-tags from the example database (a) are estimated by propagating the known cities/geo-locations across the network of authors and papers (b).

• paper(i): returns the paper of an author-paper-node

• year(i): returns the the year of publication of an author-paper node.

Based on these functions, we can now define the following logic based rules that add a rule-specific weight wa to every matching edge ei,j. Initially, we set all edge weights Wi,j to zero. The first rule R1,

Wij = Wij + w1 if paper(i) = paper(j)

adds a weight between two nodes if the nodes belong to two authors that co-author the paper associated with nodes i and j. The second rule R2,

Wij = Wij+ w2 if author(i) = author(j)∧ year(i) = year(j)

adds a weight whenever two nodes corresponds to different publications by the same author in the same year. And finally the third rule R3,

Wij = Wij + w3 if author(i) = author(j)∧ year(i) = year(j) + 1

fires when the nodes belong to two publications of the same author but written in subsequent years.

Example 5.2. The construction process of the LP matrix and its corresponding graph is depicted in Figure 5.1b for the example publication database in Figure 5.1a. The example database is missing the affiliation information for papers 4 and 5 which is denoted by the “?” in the “Aff ”-column. The graph for propagating the information is constructed as follows. There is a node in the graph for each pair of an author from column “A” and the corresponding paper in column “P”. Two nodes are connected if they are written by the same author in the same or subsequent years or if two researchers co-author them. The colors of nodes indicate known cities and white nodes indicate unknown locations.

We then run LP on the constructed graph, to get a distribution over the possible cities for every unlabeled node. Running LP on the graph for our running example in Figure 5.1b, we see that LP labels the unknown nodes. Looking at the last column “Aff∗” in Figure 5.1a, the previously missing labels for papers 4 and 5 have been inferred and the graph is now completely labeled. This example shows a very simple case because the unlabeled nodes are only connected

to red neighbors. Therefore, the color of the unlabeled nodes is deterministically determined because only red can be propagated to them. Usually, the situation is less clear. For example, if paper 6 for A2 was green, we would have a probabilistic interpretation of the label scores for the nodes. The unlabeled nodes could be labeled with either green or red. In this situation, with paper 3 for A2 labeled red and paper 6 for A2 labeled green, there would be a tie between both colors because paper 4 of author A2 is identically connected to the labeled nodes. We will now present experiments of the rule based LP on publicly available online bibliographies.

5.2.3 Experiments

With LP based on logical rules at hand, let us now turn towards filling in missing geo-tags in bibliographic databases. There are different choices as a starting point for the data harvesting process. Ultimately, we are interested in a bibliography covering all different scientific disciplines. To begin with, however, we focus on computer science. For an qualitative evaluation, we are interested in a dataset with as much ground truth as possible, to answer the following question: Q5.1 Is the relational LP capable of producing meaningful geo-tags with high accuracy for our

bibliography based on few seed locations?

In order to answer Q5.1, we require a manually curated dataset for which we have a relatively large amount of affiliations in advance. As we will see, the AAN bibliography — which is described in detail in the next section — serves as a good starting point. For all our experiments, we used the following weights for the rules described above w1= 1, w2= 3, and w3 = 2. These rules were found by a grid search on a small subset of the data. Furthermore, all experiments were run on a Linux machine with 64GB RAM and 20 cores.

GeoAAN

The ACL Anthology Network (AAN) [184] is a comparatively small, but manually curated, dataset that contains affiliations for many authors and papers from the natural language processing community. In total, the dump in use from August 2013 contains 19,410 publications written by 15,397 authors from a time span across five decades. Although the dataset is manually curated, we cannot directly use the affiliations in the provided form. Many of the affiliation strings represent multiple affiliations for one author. Since we are interested in the geographical location of a researcher, we have to reduce these strings to a single organization. We split the affiliation strings and assume that the first mentioned affiliation is more likely the residential location of a researcher. After reducing the available affiliations to the city level, the resulting number of author-paper-pairs is 49,530 and 33,061 of these nodes are labeled with one of 802 cities. The LP graph G has a total of 145,594 edges, resulting in a very sparse matrix W , respectively T . By removing an increasingly number of labels from the graph, we construct test sets of different sizes which we use for the evaluation. We start by removing 10% of the labels, obtaining a graph with 55% of the nodes labeled. We then gradually add 10% of the nodes to the test set until only 6% of the nodes are labeled. We apply this dataset construction ten times, to allow for multiple re-runs of the experiment. Table 5.1 shows the average accuracy of the predicted labels for each test set when running LP for 200 iterations. Having access to only 36% of the labels or more, we can achieve an accuracy ≥ 0.80. As expected, reducing the number of labeled nodes slowly decreases the performance. With only 6% labeled nodes, we still achieve an accuracy of 0.58, which is a high performance on a multiclass labeling problem with

Labeled Nodes Accuracy 6% 0.58 12% 0.67 18% 0.72 24% 0.75 30% 0.78 36% 0.80 42% 0.81 48% 0.82 55% 0.83

Table 5.1: Label accuracy for the AAN dataset with a varying number of initially labeled nodes. One should note that accuracy is a very challenging performance measure for a multiclass labeling problem with around 800 classes.

roughly 800 classes; the accuracy of random labels would be 0.00125. We will refer to the AAN dataset with augmented geo-tags as GeoAAN. These results clearly answer Q5.1 in favor of our proposed relational LP approach. However, to infer global patterns for migration across various fields of computer science, neither the scope nor the size of GeoAAN are satisfactory. Instead, we want to use a bibliography, such as DBLP, with millions of papers. However, as we will describe next, one cannot simply just run LP on a dataset of that size. Therefore, we will describe a new LP approach that exploits symmetries in the LP-graph and splits the label matrix into chunks, in order to obtain label scores of over five million author-paper-pairs.