This chapter presented two graph partitioning approaches to generate solutions to the WBD problem. The first approach presented was a hierarchical graph partitioning method based on Newman’s approach. This method partitions an input graph into clusters based on a modularity value that produces clusters of high interconnectivity. The second approach was based on the min-cut max-flow theorem and finds the min- imum cut within a web graph when the graph is represented as a flow network. This method segments the graph into two dense clusters while minimising the edges cut between the clusters.
For each approach two sets of experiments were conducted. In the first experiments both methods were applied in an intuitive context. The Newmans method was simply applied to complete data set snapshots, while the minimum cut method used the given seed web page ws as the source of flow, and iterated the sink from the remaining
vertices. The reported results generated from these experiments indicated that they were unable to produce good quality WBD solutions.
In the second set of experiments variations of both approaches were considered. The Newman method was applied to varying sized snapshots of the data sets of the form that may be produced using different prior web crawl strategies. The minimum cut method was iterated over all combinations of sourcesand sinkt. The results produced from the further experiments with respect to both the mincut and Newman approaches indicated an improvement in the WBD solutions produced in comparison with those produced by the first sets of experiments. These results thus indicated that both ap- proaches have potential in terms of producing high quality WBD solutions. The results obtained also demonstrated that the connectivity between vertices of a single website are in fact encoded in the underlying web graph structure. It was also shown that this structural characteristic has the potential to be exploited using the methods presented in this chapter. Finally it should be noted that the WBD approaches presented in this chapter focused on exploiting only the hyper link structure when producing WBD solu- tions. The previous chapter, Chapter 5, exploited only content structural elements with respect to the WBD problem. The following chapter, Chapter 7, concentrates on both the content and hyper link structure of web pages when producing WBD solutions.
Chapter 7
Dynamic Techniques
This chapter presents the investigation of the WBD problem in the dynamic context. In the dynamic context the web data is not fully available prior to the start of analysis (as previously explained in section 3.4.2). The approaches presented in this chapter used various graph traversal techniques to gather portions of data, which were then clustered incrementally in order to produce a WBD solution using only partial data.
The approaches in chapters 5 and 6 described solutions to the WBD problem in a static context. The static approach operated using a three phase process: (1) collect data, (2) pre-process data and then (3) produce WBD solution. In the static context these phases are performed in sequence, there is no repeating of previous phases once a phase has been completed. In the dynamic approaches described in this chapter the same three phases are used, however they are applied in such a way that the phases are repeated. The repetition allows a portion of web data to be gathered, pre-processed and then a WBD solution produced from this portion. In the dynamic approach the web page data is gathered by traversing the web graph using the hyperlink structure. The web pages are then pre-processed and feature representations created for each page. The pages are then incrementally clustered as the pages are traversed, a website boundary is then identified based on the clusters produced. A main focus of this chapter is an investigation of the “power” of the random walk based method of graph traversal with respect to WBD performance in particular settings where the amount of data is large and not immediately available.
The evaluation of the dynamic approaches presented in this chapter was performed using the three categories of data sets which were previously introduced in chapter 4: (1) Binomial Random Graphs (BRG), (2) Artificial Data Graphs (ADG), and (3) Real Data Graphs (RDG). Recall that the BRGs are a very simplistic model of the web graph with respect to the WBD problem, which was used to test preliminary dynamic approaches to the WBD problem. The ADGs modelled the web using a more sophisticated method based on the preferential attachment model. The evaluation using ADGs allowed for the dynamic approaches to produce WBD solutions in a controlled
environment. The final data category used for evaluation of the dynamic approaches described in this chapter consisted of RDGs which were used previously as reported in Chapters 5 and 6.
The remainder of this chapter is organised as follows. In section 7.1 a formal description of the WBD problem in the dynamic context is given. Details of the dynamic approach with respect to the WBD problem is presented in section 7.2. Two main methodologies are used in the dynamic approach, a graph traversal method and a incremental clustering method; which are presented in sections 7.3 and 7.4 respectfully. Section 7.5 details methods of graph representation which can be used to weight or add edges to a graph. Section 7.6 presents some issues and characteristics of the dynamic approach. Section 7.7 presents an evaluation of the dynamic approaches with respect to the WBD performance, followed by an evaluation summary in section 7.8. Finally this chapter presents a conclusion to the work undertaken in this chapter in section 7.9.
7.1
Formal Description
In this section a formal description of the WBD problem in the dynamic context is presented. Recall the general WBD problem’s formal description presented in section 3.4. Given a collection of web pages W, comprising n individual pages w, such that W ={w1, w2,· · ·, wn}, where the seed page is ws, the website boundary (ω) is said to
be the bounded subset of pages in W that form the website given by ws. Each of the
individual web pages can be described using a dimensional numerical vector of length m,V ={v1, v2, . . . , vm}. This set of pages can be modelled using a graphG= (W, E),
whereW is a collection of web pages (as noted above), and the setE keeps track of all directed (hyper) links between pairs of elements of W. The key characteristic of the dynamic context is that the graphG= (W, E), and intuitively the setW andE, is not known apriori. Thus at commencement of the dynamic approach to identify a solution to the WBD problem all that is known is the seed pagews, and its associated directed
edges. The following section 7.2 presents the dynamic approach to the WBD problem, with respect to the formal description presented in this section.