CONCLUSIONES Y CUESTIONES ABIERTAS
9. Materiales y recursos didácticos
The previously described algorithms make non-trivial assumptions about how place references should be combined to discover the geographic scope of a document. In order to assess what are the gains introduced by these assumptions, three simple baselines were implemented. The scope assignment results from a commercial application, namely Yahoo! Geoplanet, were also considered. The baselines are as follows:
• Assigning the scope according to the most frequently occurring place reference - The number of times a place is referenced in a document reflects the importance of that place to the document’s subject. We therefore experimented with a simple scope assignment method that chooses the most frequently occurring place reference as the scope. In case of ties, the place reference corresponding to the largest area is chosen. • Assigning the scope according to the bounding box that covers all the place ref-
erences - The different place references made in the document should all contribute
to the document’s scope. We therefore experimented with a simple scope assignment method that computes the bounding box that covers all the place references made in the document.
• Assigning the scope according to the bounding box that covers all the place refer-
ences that are not outliers - This is a refinement of the previous strategy, in the sense
that not all place references should contribute to the scope, but only the place references that are somewhat interrelated. The idea is to be able to filter the errors made while recog- nizing and disambiguating place references, as well as filtering out the place references that are only tangential to the content of the document. We first compute the average centroid point for all the place references made in the document, as well as the average distance between the place references and this centroid. Then, we filter out those place references whose centroid is at a distance that is greater than twice the standard devia- tion of all distance values. Finally, we assign a scope corresponding to the bounding box that covers all the remaining place references. If no place references remain after the filtering step, then the closest to the centroid point is chose as the scope. This technique was first proposed by Smith and Crane [2001] for toponym disambiguation.
• Yahoo! PlaceMaker Administrative Scope - We also tested the results produced by the recently released geographic scope resolver from Yahoo!. This service is freely available
and works by receiving a document and returning a bounding box corresponding to the administrative area that best describes its geographic content. Despite the fact that the details regarding the algorithm used in this system remain private, a comparison with this baseline is important to assess the current state-of-the-art in commercial applications.
3.3 Summary
This chapter discussed the importance of computing geographic scopes for documents con- taining multiple place references, both in the context of this thesis and in the more general context of Geographic Information Retrieval applications. The scope assignment task involves two sub-tasks, namely (i) recognition and disambiguation of place references over text, (ii) computation of the actual scope assignment algorithm. Figure 3.2 illustrates the discussed operations in the context of this thesis.
Figure 3.2: Overview of the geographic scope resolver for this thesis geotargeting prototype.
The chapter also described previously proposed algorithms for geographic scope resolution, namely the hierarchy-based method proposed for the Web-a-Where system, the spatial overlap- based method proposed in the GIPSY project, the pagerank-based method proposed in the GREASE project, and four baseline methods.
All methods were implemented for the purpose of making a cross-method comparison, using Yahoo! GeoPlanet as the gazetteer, and Yahoo! Placemaker for extracting and disambiguating place references in documents. Chapter 6 of this thesis presents the comparison results. The cross-method comparison of scope resolution algorithms described in this chapter, as well as the results in Chapter 6, were published in a separate publication [Anast´acio et al., 2009b].
Chapter 4
Locational Relevance
Classification
The previous chapter presented several methods for the task of assigning geographic scopes to documents, based on the places referenced in the text. Nonetheless, having a geographic scope does not imply that a potential reader of the document is interested in that region, or any other for that matter. Therefore, in the context of this thesis, there is the problem of trying to estimate the locational relevance of a document, i.e., how relevant is the geography of a given document to the content it describes. For instance, it is reasonable to assume that the Web page of a research group from Lisbon might be relevant to many people not interested in Lisbon, i.e., it has a low locational relevance. On the other hand, the Web page of a local restaurant in Lisbon will probably be of relevance only to people living in Lisbon, or planning to visit the city, i.e., it has a high locational relevance.
To the best of my knowledge, no previous work has addressed the problem of determining the locational relevance of a document. Thus, this thesis proposes an approach for this problem which considers that, for a given document, its locational relevance is the confidence score which results from classifying it in either having a narrow geographic scope, i.e., local class, or having a broad geographic scope, i.e., global class. This approach has three fundamental assumptions: (i) pages with narrow geographic scopes, like Lisbon, are more interesting from a geographical standpoint than pages with broad scopes such as Europe or United States; (ii) textual content is important to determine if a page with a narrow scope trully describes a local thematic; (iii) the standard document classification approach, based on a bag-of-words representation, can be further complemented with specific features, better suited to reflect the geographic characteristics.
In the past, Gravano et al. [2003] proposed a technique for classifying search queries as either local or global, using features based on the presence of geographic references in the results produced by a search engine (e.g., total number of references and number of unique refer- ences). There are many similarities between the work by Gravano et al. and the proposal of this chapter. However, this work is concerned with classifying documents instead of user queries.
4.1 Document Classification
Document classification is the task of assigning documents to topic classes, on the basis of whether or not they share some features. This is one of the main problems studied in fields such as text mining, information retrieval, or machine learning, with many approaches described in the literature. Sebastiani [2002] published a survey where he thoroughly discusses the main issues and approaches behind binary text classification. Sebastiani concluded that for text classification, which is characterized by having a high dimensionality in terms of the feature vectors, the approach by Joachims [1998] based on support vector machines (SVMs) achieves the best performance.
A SVM classifier conceptually converts the original measurements of the features in the data to points in a higher-dimensional space that facilitates the separation between two classes. While the transformation between the original and the high-dimensional space may be complex, it needs not to be carried out explicitly. Instead, it is sufficient to calculate a kernel function that only involves dot products between the data points, transformed from the original feature space. Commonly used functions include the linear and Gaussian (RBF) kernels, with the latter being recommended for text classification problems [Joachims, 1998]. Joachims [1998] also reports that most text categorization problems are linearly separable. Nonetheless, Gaussian kernels are able to handle cases where the relation between features is non-linear, an important characteristic since not all of the proposed features for computing the locational relevance are based on textual terms.
When dealing with Gaussian-based SVMs one must be aware that there are two parameters, Cand γ, which influence how well the classification model separates the classes in the training set. Ideally, these parameters would be tuned through an optimization strategy that includes cross-validation as a way to prevent overfitting. A commonly used approach following this principle is grid-search, where pairs of (C,γ) are evaluated and the one with the best accuracy is picked. However, it should be noted that grid-search, as well as any other technique for this task, is extremely time consuming [Hsu et al., 2009].
Particularly important to this application is the output value returned by the SVM classifier for each instance. As previously discussed, this value represents the proposed locational rele-
vance, measuring how local a Web page is by analysing how confident the class assignment was. However, regular SVM classifiers produce un-calibrated predictions for their class assign- ments, making it necessary to convert them to actual probabilities. Platt [1999] showed that sigmoid calibration is an effective method for converting the output of max-margin classifiers (e.g., SVM) into probabilities, while Lin et al. [2007] further refined his method. The exper- iments reported in this thesis use the SVM implementation provided by LIBSVM [Chang and Lin, 2001], which already considers the output calibration method described by Lin et al. [2007]. Besides configuring the classifier, it is also advisable to normalize its input data when different features have very different ranges (e.g., [0,1] and [0,100.000]), the reason is that features with big ranges might overshadow the contributions of the ones with smaller ranges. The min-max normalization method preserves the relationship among the original values and is one of the commonly used approaches by the research community [Han and Kamber, 2006]. Equation 4.1 shows the min-max normalization function, where v refers to the original value and v0 to the
normalized value.
v0= v − min
max − min (4.1)
4.2 Features
The feature vectors used in the proposed classification scheme result from an analysis to the full text of the documents and the document URLs, as well as from geographic information mined using Geographic Information Retrieval techniques. The considered features can be grouped in three classes, namely (i) textual features, (ii) URL features, and (iii) locative fea- tures.
4.2.1 Textual Features
This set of features tries to capture the thematic aspects, encoded in the document’s terminol- ogy, that can influence the decision of assigning a document to either the global or local class. The idea is that if a word frequently occurs in documents containing narrow geographic scopes, there is an increased probability that documents containing that word belong to the local class, and vice versa. For instance, intuitively, documents about restaurants or pharmacies are more likely to be local than documents about programming languages or music downloads.
Since texts cannot be directly interpreted by a classifier, there is a need for converting them into a proper representation. This thesis adopts the usual approach, representing documents as
vectors of term weights, where each weight is given by the traditional TF-IDF formula presented in Equation 4.2, originally proposed by Salton and Buckley [1988]. In the formula, tfi,j is the
number of times that term i appears in document j, while N is the total number of documents in the collection, and niis the number of documents where the term i appears. The experiments
reported in this thesis use the Weka1software package, described by Frank and Witten [2005],
in order to generate the vectors of term weights for the documents. wi,j = tfijLog
N ni
(4.2) As previously mentioned, text classifiers face the high dimensionality problem, i.e., each new term in the collection represents a new dimension to consider. One technique used for this problem is term extraction, which assumes that a document is only formed by its important words. In this thesis, experiments were performed with the Yahoo! Term Extraction2Web ser-
vice, a state-of-the-art industrial tool for key term extraction. Its implementation is available via an open Web service, which takes a text document as input and returns a list of significant words or phrases extracted from that document. Another technique for dimensionality reduction is term selection, which operates under the assumption that documents are better represented by the terms that are distributed most differently across the considered categories. Several authors have reported that both Information Gain and Chi-square statistics provide good met- rics for determining the most discriminative terms [Sebastiani, 2002; Forman, 2003]. Weka has the capability of executing term selection using several metrics, including the two previously mentioned. In some particular experiments, I adopted the Information Gain, as presented in Equation 4.3. IG(ti, catk) = X c∈{catk,catk} X t∈{ti,ti} P (t, c) × log P (t, c) P (t) × P (c) (4.3) In order to thoroughly evaluate the importance of textual features for the locational relevance problem, several experiments were performed. The full set of considered textual features is shown below. All of them include lowercasing, stemming, and stop-word removal using Weka:
• All terms occurring in the document text, weighted according to the TF-IDF scheme; • Terms selected by the Yahoo! Term Extraction Web service as the most important words
in the document, weighted according to the TF-IDF scheme;
• Terms selected using an Information Gain analysis, weighted with the TF-IDF scheme. Chapter 6 details the experimental methodology and discusses the obtained results.
1http://www.cs.waikato.ac.nz/~ml/weka 2http://developer.yahoo.com/search/content
4.2.2 URL Features
When classifying Web documents, another source of information that can be used for classifi- cation is their Uniform Resource Locator (URL). Previous research has shown that classifiers built from features based solely on URLs can achieve surprisingly good results on tasks such as language identification [Baykan et al., 2008] or topic attribution [Baykan et al., 2009]. Re- garding topic attribution, the best approach considered a SVM classifier and character n-grams with a length varying between 4 and 8 characters as the features.
Intuitively, URLs can provide information useful to discriminate between local and global pages, such as top level domains or words like local or regional. For instance, a document whose URL has a top level domain .co.uk is more likely to be local than a document with a top level domain such as .com.
Taking inspiration on the experiments reported by Baykan et al. [2009], the following feature was considered:
• Character n-grams, with n varying between 4 and 8, extracted from the lower-cased document URLs and weighted according to the TF-IDF scheme.
4.2.3 Locative Features
The considered locative features can be further divided into simple and high level. Simple locative features essentially correspond to counts for locations recognized in the documents, whereas high level locative features correspond to values directly related to the geographic scopes assigned to the documents.
Simple Locative Features
As previously mentioned, Gravano et al. [2003] successfully addressed the problem of clas- sifying search engine queries as either local or global. As features, the authors used infor- mation regarding the frequency of different types of place references, present in the results produced by a search engine. The proposed simple locative features are inspired on this approach. Through the use of the geographic text mining services provided by Yahoo!, twenty- two frequency-based values from a given document’s textual content are extracted. The Yahoo! Placemaker text mining service recognizes and disambiguates place references in text, while the Yahoo! GeoPlanet gazetteer service provides additional information about the recognized references.
Having locations referenced in the document can indicate a tendency towards a higher locality, particularly if these locations are all related to a single relatively narrow region. On the other
hand, having no locations whatsoever, or having many locations from different parts of the World, can indicate that the document has a global scope. The proposed features combine the location counts in various ways, aggregating the places according to containment relationships in order to group together the information for places in the same administrative divisions. The frequency count of place references at different levels of detail (i.e., continent, country, state, city), as well as the aggregated total for all different unambiguous place references, is also contemplated. The complete set of features is as follows:
• Total number of recognized locations; • Total number of unique locations;
• Number of unique locations, grouped by city, county, state, country and continent; • Number of locations, grouped by city, county, state, country and continent;
• Number of unique locations, grouped by city, county, state, country and continent, con- sidering the aggregated sub-locations that are hierarchically below (i.e., the number of counties includes the number of cities referenced in the text, the number of states in- cludes the number of counties plus cities, and so on);
• Total number of locations, grouped by city, county, state, country and continent,considering the aggregated sub-locations that are hierarchically below.
High Level Locative Features
The idea behind high level locative features is that documents having broad geographic scopes are more likely to correspond to global pages, and vice-versa. Following this assumption, fea- tures were based on the area values for the geographic regions assigned by several algorithms as the scope for a given page, as well as on the score they produce for that same scope – see Chapter 3 for more details on geographic scopes and implementation details for the various algorithms. Experiments with different scope resolution methods were performed, since they follow different philosophies. For instance, the GIPSY method chooses the most specific and frequent region, while the Web-a-Where method chooses the region containing the most fre- quent place references. As for the inclusion of the scope scores, the idea is to attenuate wrong assignments by considering their confidence values. The full set of features is as follows:
• Area for the geospatial region corresponding to the geographic scope of the document, computed with the Web-a-Where method;
• Area for the geospatial region corresponding to the geographic scope of the document, computed with the GIPSY method;
• Area for the geospatial region that covers all the place references extracted from the document;
• Area for the geospatial region corresponding to the most frequent place reference of the document;
• Confidence score assigned by Web-a-Where to the scope computed for the document; • Confidence score assigned by GIPSY to the scope computed for the document.
4.3 Summary
This chapter presented an approach for determining the locational relevance of a Web page, i.e., how relevant is the geography of a given document. An initial version of the classification scheme described in this chapter was published in a paper by Anast´acio et al. [2009a]. I believe this to be an important step in any geographically-aware advertising system, since it provides the capability for deciding if a given Web page should display local advertisements