• No se han encontrado resultados

4. ANÁLISIS TEÓRICO CONCEPTUAL SOBRE LA PROBLEMÁTICA

4.2 RESULTADOS DEL ANÁLISIS TEÓRICO

4.2.1 Marco teórico conceptual

4.2.1.6 Formulación, implementación y evaluación de las

4.2.1.6.2 Formulación de la política

In this chapter, we have reviewed three different machine learning methods for classifying and searching documents. They are the supervised text classification methods, the learning to rank methods and the unsupervised topic modeling methods. For classifying documents in SEEU, due to the drawbacks of the unsupervised topic modeling methods, we prefer to use the supervised text classification methods, more specifically, the hierarchical text classification methods for hierarchical topic classification. For ranking documents in SEEU, we will use the pairwise learning to rank approach to rank documents in each category of the topic hierarchy.

Chapter 3

Effective Hierarchical Webpage

Classification

In this chapter, we study the hierarchical webpage classification problem. A major challenge in SEEU is to automatically classify a massive number of webpages into a topic hierarchy. An effective hierarchial classification system is very important for SEEU. To deal with the chal- lenges of learning large-scale webpage datasets, we firstly propose an efficient webpage feature extraction tool based on MapReduce. Secondly, we develop a parallel hierarchical SVM clas- sifier for effective webpage classification. With extensive experiments on the well-known ODP (Open Directory Project) dataset, we empirically demonstrate that our hierarchical classifi- cation system is very effective and it outperforms the traditional flat classification approach significantly.

The rest of this chapter is organized as follows. In Section 3.1, we describe the webpage feature extraction tool for hierarchical webpage classification. In Section 3.2, we discuss the algorithm to learn hierarchical classifiers. Section 3.3 reports the experimental results on the ODP (Open Directory Project) dataset. The last section contains a summary of this chapter.

The implementation of the hierarchical classification system in Section 3.2 was in collabo- ration with Da Kuang and Dr. Charles Ling. We jointly published this work in the Proceeding

of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011) [67].

3.1

Webpage Feature Extraction

When we apply text classification algorithms on real-world webpage datasets, the first thing we need to do is to extract good text features from webpages. In this section, we describe the webpage features for learning hierarchical classification models. To deal with the challenges of extracting text features from a large-scale webpage dataset (e.g., one million webpages in the ODP dataset), we develop a distributed feature extraction tool based on the popular Hadoop MapReduce platform1.

1The website of Hadoop project ishttp://hadoop.apache.org/.

3.1.1

Webpage Features

To extract text features from webpages, it is important to consider the tag structure of the HTML source code. We use a webpage (see Figure 3.1) collected from Amazon to describe features extracted from HTML source. As we can see, HTML source code usually consists of a head part and a body part. Firstly, inside the head part, we extract the title as well as two meta texts, i.e., description and keywords. The three text features are very important as they describe the theme (such as shopping in Amazon) and the content (e.g., Books, Music and Games) of the webpage. Secondly, for the body part, we simply treat it like plain text by removing all the HTML tags. In this thesis, we do not consider the primary HTML tags such as head (<h>), paragraph (<p>) and section (<div>), because in most websites, these tags are oriented toward visualization rather than semantics [91]. However, for anchor tags (<a>), we cannot simply discard them because the text inside anchor tags is usually very relevant to the pointed webpage [14]. Therefore, we also use the anchor text feature for the pointed webpage. In addition, we extract text from the webpage URL (not the links inside the anchor tags) because the URL of a webpage also contains useful information.2 Thus in total, we use six text features for webpage classification. They are tabulated in Table 3.1.

Figure 3.1: A simplified HTML source code from Amazon home page.

Table 3.1: The six text features of a webpage. They are URL, title, description, keywords, body and the anchor text.

ID Description

1 URL

2 title

3 description in meta tag 4 keywords in meta tag

5 body

6 anchor text from inbound hyperlinks

2For example, the linkhttp://www.amazon.com/books-used-books-textbooks/...points to the book

3.1. WebpageFeatureExtraction 37

3.1.2

MapReduce based Feature Extraction

A webpage dataset is usually very large. An efficient feature extraction implementation is non-trivial. Simple sequential scanning over the entire webgraph could be too slow for a large- scale dataset. In this thesis, we build the feature extraction tool based on the popular Hadoop MapReduce platform.

We briefly describe the MapReduce programming model. MapReduce is a parallel pro- gramming model for processing large-scale datasets. In a MapReduce development, the com- plex low-level system programming, such as data communication, load balancing and fault tolerance are taken over by the MapReduce platform. Developers only need to focus on the implementation of high-level algorithms by using the simple map and reduce functions [32]. In a typical MapReduce program, the main computation procedure can be implemented as a series of data manipulation on key-value pairs. Specifically, the main program splits the prob- lem into many small subproblems. For each subproblem, the MapReduce platform launches a

map function that processes the subproblem and outputs intermediate results as a list of key-

value pairs. When all the map functions are finished, the MapReduce platform reorganizes all the intermediate key-value pairs into many value lists of identical keys. For each value list, a

reduce function will be launched to process it and output a single key-value pair as the final

results.

We implement the feature extraction tool based on the MapReduce programming model. Figure 3.2 shows an example of MapReduce pseudo-code for extracting anchor text. It con- tains a map function and a reduce function. The entire webpage dataset is split into many webpages by URLs. For each webpage, the map function extracts and emits pairs of (“hyper- link”, “anchor text”) from HTML source code. After all the map functions are finished, the

reduce function combines the list of anchor text for a URL to form the final anchor feature for

a webpage as (“hyperlink”, “merged anchor text”). The code to extract the other text features is simpler than extracting anchor text. They only have a map function which just extracts the in-page text.