3.2. POLÍTICAS Y OBJETIVOS
3.2.2. Objetivos a largo plazo Para el alcance de las políticas de Ordenamiento Territorial se requiere formular los siguientes objetivos
The nature of the web is such that information of almost any type can exist, this information can be connected in a multitude of ways, thus serving to demonstrate that the web is a complex interconnected web structure of great diversity. However, there are intuitive notions that are used to add meaning to the complex interconnected structure of the web. The notion of a website is one of these. This common term is used to describe a certain type of relationship that information can have on the web. Unfortunately due to the nature of the web the information that is contained within the boundary of a website is not explicitly described. This fact can hinder the definition of this commonly used term, as there is no description of what should be included or excluded from a particular website. Hence, the website boundary detection problem.
In this work, the WBD problem is defined as follows:
The problem of identifying all web pages/resources/media that are part of a single website
It is argued that in a broader context the WBD problem can be approached in two par- ticular ways, either; (i) Philosophically or (ii) Practically. In the philosophical approach to the WBD problem the focus is on the meaning of the term website, its relationships and collaborative interactions. This has already been discussed extensively earlier in this chapter. In contrast to the philosophical debate, in this thesis a practical approach to the WBD problem is taken. Despite appearing in the information retrieval and machine learning literature [41, 98, 104] there has been no clear outline of the problems associated with the implementation of practical WBD solutions. Any practical WBD solution must consider:
2. How to practically detect the boundaries of such a website.
The first problem has already been addressed earlier in this chapter (see section 3.2). The second problem is addressed in the following two sub-sections 3.3.1 and 3.3.2. Sub- section 3.3.1 outlines issues related to the practical implementation of a WBD solution. Sub-section 3.3.2 outlines an approach to the WBD problem with respect to the KDD process.
3.3.1 WBD Mining Issues
In this section the issues related to general web mining, as discussed in chapter 2.4.1, will be considered in terms of the WBD problem.
Volume: The amount of data on the web is large and increasing every second. The use of subsets or small parts of the whole web structure is an essential factor in the success of reducing the potential complexity of the WBD problem. In each case any assumptions or limitations that have been used to scope the WBD problem, in the context of the work described in this thesis, are highlighted.
Diversity: The web contains data in many formats, this makes it extremely diverse. To avoid costly resource hungry techniques and processes to deal with this di- verse data, this research has been limited to WWW pages expressed using only HTML/XHTML/XML. These formats were selected because they are the most commonly used formats. This work will not attempt to integrate other propri- etary formats, or deal with parsing or feature extraction associated with video or multi-media type data.
Semi-Structured data: The retrieval of information from a semi-structured data source can prove to be extremely challenging. The research described in this thesis calls upon some standard HTML parsing libraries to help extract relevant attributes from the acquired web data. The strategies and steps involved in the features extraction are given in more detail in chapter 5.
Authority: This issue does not directly concern the work in this thesis. The only assumptions made are that: (1) the data gathered is provided by the host as required, and is not adversely affected directly by spam content, and (2) that the content to be analysed is not malicious in any way.
Noise: An inevitable aspect of KDD is that some amount of noise data will be acquired, that must be processed accordingly. This is especially true when dealing with data from the web. This aspect is directly addressed in each of the techniques that are described in this thesis. A successful WBD solution will distinguish noise data from the target data sufficiently well, which in turn will help to achieving a highly accurate WBD.
Dynamic and Distributed: The dynamic and distributed nature of the web can make it a problematic data source for many time critical applications. To avoid the issues raised by the ever changing location and content of the data on the web this research uses snapshots of web content. It is common to limit a web crawl by domain or sub domain to some degree; however such a method would stifle the gathering of content that is distributed over various domains, but is actually part of the same website. This is the main reason the crawls of the web are unrestricted when creating RDG data sets (section 4.3.2). To overcome the distributed nature of the web, when data is gathered the web crawl is not limited by URL or domain or sub domain.
Virtual Society: The interactions of users of the web are considered as very important in the research presented in this thesis with respect to the proposed website definition. The key aspect of the research in this thesis is to exploit the user’s intention which is encoded in to related web documents of the same website. This aspect is reflected in the proposed definition that was given in section 3.2. Persistence: The ever changing nature of the web is addressed as described above
under “Dynamic and Distributed”. A snapshot of the web is used to achieve the desired website boundary. The assumption is made that this data remains fixed for the duration of the evaluation of the approaches presented in this thesis. To use “live” web data would cause issues if content changed during the WBD resolution process. These issues are well documented and is an active area of research [121, 108, 180, 91, 177].
Deep web: In this research only the surface web is considered when creating the data snapshots. The deep web is used to describe content that is not directly accessible using a normal web browsing pattern. Access to this content may require forms to be complete or other interactions that are far beyond the scope of a standard web crawler, and consequently this research.
3.3.2 WBD solutions derived using The KDD process model
The WBD problem shares many similarities with the general Knowledge Discovery in Databases (KDD) problem, which is important with respect to the research discussed in this thesis. In chapter 2.3 the KDD problem was presented in terms of a 6 step process. The WBD problem can also be described using this 6 step process as shown in in Figure 3.1. This sub-section discusses the WBD problem with respect to the process presented in Figure 3.1.
It is important to note that as in the general KDD process model, the proposed WBD process model can also be an iterative process. This means the steps can be repeated. This is an important point with respect to the static and dynamic approaches
Figure 3.1: The WBD problem as modelled using the KDD process based on the Process-Centred view model by [77]
to WBD considered in the later chapters of this thesis (static in chapter 5 and dynamic in chapter 7).
The task of identifying the complete boundary of a target website commences with some seed page(s) from the website of interest. Typically this is the home page or entry page. This step is related to understanding the selection criteria for a WBD problem given such a seed page. The WBD solution generation process then continues as follows:
1. Selection: A proposed WBD approach commences by obtaining a collection (“snapshot”) containing web pages by crawling a portion of the web from the given seed page. An ideal WBD approach would gather a snapshot containing only pages from the target website, thus producing a website boundary solution. However, in practice, a “snapshot” is created comprising of both “target” and unavoidable “noise” pages.
In this thesis, two main approaches are used, static and dynamic. In the static context, a snapshot needs to be gathered before any website boundaries can be detected, while in the dynamic context, this is done as the crawl proceeds. The web crawl aims to do two things: (1) gather all target content from the website, and (2) gather as little noise content as possible. The crawl must be wide and deep enough so as to cover at least the target website pages. There are two main techniques that are considered in this thesis:
1. Semi-Automated: This is a web crawling process guided by some initial in- put from a user. The process is conducted using some limit to control the web crawling. This process is used in chapters 5 and 6.
2. Automated: A fully automated crawl produces a snapshot of the web based on some heuristic approach. The crawling process is revised as the crawl progresses. This process is used in chapter 7.
The web crawling adopted in this work uses both the automated and semi-automated approaches. The static work in chapter 5 uses a semi-automated web crawl, which re-
quires a user depth input before the data acquisition begins. The dynamic work in chapter 7 uses an automated random web crawling technique to acquire the data from the web.
2. Pre Processing: During the pre-processing stage the aim is to make sure the web page content is valid; in other words that the content comprises a well formed document structure suitable for parsing and information extraction. A well formed document has no missing tags, and includes elements that support feature extraction. If pages are not well formed (which is often the case with data from the web) the web pages are “cleaned” and output in a suitable format for feature extraction. This is done using a library that can handle ill-formed web content, and output corresponding valid content. Examples of such libraries include HTML parser1 and HTML Tidy2.
3. Transformation: Using the pre processed web content which has been validated in the previous step, information extraction can take place. The features that are considered in this work are considered in section 5.2, and are modelled using the vector space model (see section 3.4).
4. Data Mining: The data mining step is the core element of the entire proposed process. The clustering algorithms that are considered in this work were detailed in the chapter 2.6.1. The clustering algorithms are used to produce clusters of related pages. The relationship of the pages in the same cluster as the seed page, is deemed to represent the target website. The remaining clusters are then identified as the noise cluster. The target cluster reflects the relationship that is encoded in the content of the website, which in turn is a reflection of the definition of what a website is as derived in this thesis.
5. Interpretation/Evaluation: The website boundary patterns that are discovered in the previous step are interpreted as a solution to the WBD problem. This may be evaluated against labels representing the ground truth of the specific WBD problem. Details on how solutions maybe evaluated, in an experimental setting, are given in section 3.4.3.
6. Knowledge: The discovered website boundaries can be used as new knowledge about a particular WBD problem. This new knowledge can be used as a basis for the applications presented in section 1.1.
1HTML parser: http://htmlparser.sourceforge.net/ 2
At this point it is important to highlight the fact that the entire process can be exe- cuted non-sequentially. This is referred to as an alternative selection method. Recall in the explanation of the KDD process that it was acknowledged that the process could be iterative, and did not have to be sequentially executed. Steps; 2. Selection, 3. Pre processing, 4. transformation and 5. Data mining can all be performed either sequen- tially (as in the case of the proposed static solution to the WBD problem discussed in chapter 5) or iteratively (as in the case of the proposed dynamic solution to the WBD problem discussed in chapter 7).