The aim of this research was to investigate the automation of web extraction for training course information. This is a very challenging task, considering the complexity of capturing information from a substantially sized, inconsistent and ever changing source such as the Web. The above, coupled with the limited timeframe available for the EngD project, mean that there are certain limitations associated with this research, in addition to the many contributions and benefits to the sponsoring company and the research community, as discussed below.
6.2.1 C
ONTRIBUTION TOE
XISTINGT
HEORY ANDP
RACTICEThis research has proven successful not only towards helping the sponsoring company, but also towards providing innovative and constructive ideas to the research community. Some of these ideas have been published in respectable, international conferences and journals (see Appendix F to J).
From a technical point of view, contributions of this research include:
CRAWLER Filtering: Unlike standard crawlers, the CRAWLER in this research is responsible not only for locating and retrieving the web addresses of various web pages (and everything else that this task entails), but also for the exclusion of a large number of irrelevant web pages from the process. This was achieved through the use of the NB approach as explained in section 4.3.1. This additional functionality allows the CRAWLER to filter out a large number of irrelevant web pages before the CLASSIFIER is even underway. Section 5.4.1 lists the benefits of this approach. CLASSIFIER Training: Many researchers have expressed their confidence in the
simplicity and merit of Naïve Bayes in the classification of attribute data. The CLASSIFIER in this research upholds this confidence by exceeding expectations in the classification of web pages. A contribution from this part of the system, however, comes from the fact that the CLASSIFIER outperforms rival techniques despite it being trained with fewer web pages than any other classification system that has been encountered during the research (Xhemali et al., 2009). The normal approach for classification systems is to train the classifier with 70-80% of the data and test it on the remaining 20-30% of the data. In this research, the reverse is the case; the CLASSIFIER is trained with around 20% of the data and tested on the remaining 80% of the data. The fact that the CLASSIFIER succeeds even with such minimal training, shows the great potential of this approach.
Genotype-Phenotype Mapping: This research has presented a novel approach to mapping genotypes to phenotypes during the genetic evolution of REs, involving carefully written XML rules being fed to the system as a separate file. The existence of these rules as a separate file means that the file can be replaced to work on a different domain, without disrupting the rest of the GP system (explained in section 4.4.2.4). This was proven when the genotype-phenotype mapping approach was adapted to manage the evolution of Software Statements and Structures and Complete Software Programs with minimal effort (Xhemali et al., 2010-a; Xhemali et al., 2010- c). Furthermore, the use of XML denotes improved readability, compatibility with many programming languages, portability and extendibility (XML is not restricted to a limited set of keywords defined by the proprietary vendors, which aids the process of creating rules of different levels of complexity).
Initial Population Generation: In this research, a novel combination of two known approaches was used. Specifically, the approach of randomly generating the individuals of the initial population and the approach of seeding the initial population with known solutions, were combined whereby, the first gene of the genotype is seeded with existing solutions in that category (e.g. if the system needs to evolve REs to extract course titles, then the first genes of the top ten RE genotypes, that were successfully evolved to extract titles, are used as the first genes of the initial population), whereas the remaining genes of each genotype are generated randomly. This approach allows for some initial knowledge to be injected in the evolutionary process, which means faster execution and little compromising of the search space. Learning after each Genetic Generation: This is another novel approach used in this
research, specifically during the evolution of REs for the extraction of training course titles. As discussed in chapter 4, the NB approach was used to aid the fitness scoring of each evolved title RE. This in itself is not a novel idea, however the way it is applied in this research makes it innovative and successful. Specifically, information obtained by the NB part of the system, is passed from generation to generation, independently from the genetic evolution process, which means that each generation has additional knowledge to help it evolve good offspring faster. Chapter 5 showed the experiments carried out to prove the success of the above approach.
From a more practical point of view, this research offers the following contributions:
Domain: To the best of our knowledge, no other work has concentrated on providing automated WIE solutions for the Training Courses Domain. Furthermore, as mentioned in section 2.4.6, very few researchers have investigated evolving REs for the purpose of WIE, which makes this research very beneficial.
Sponsoring Company: Unlike conventional PhD research, this research was designed to meet the needs of an existing company, by managing a genuine commercial problem in a real business environment. All the achievements in this research (summarised in chapter 5), have contributed towards improving the business processes at the sponsoring company and having an impact on the wider industry. The following subsections focus on the impacts on the sponsor and the wider industry.
6.2.2 I
MPLICATIONS/I
MPACT ON THES
PONSORATM‟s main objective for sponsoring this research was to reduce the time wasted at the company, searching for and manually dealing with online training course information. This EngD project has made a positive contribution to the sponsoring company, as it has provided not only a solution for finding and storing the relevant websites that may contain training course information, but also a solution that permeates each website and discovers specific information related to the different courses advertised.
It was never the intention of this research to replace the work and expertise of the advisors at ATM, instead, the final system was intended to act as a helpful guide to the advisors, leading them in the right direction towards finding the course information required. Section 1.4 showed the hectic and inefficient ways of dealing with online courses at ATM, prior to this research, where advisors had to actively search the Web each time training course options were needed. The system developed in this research ensures that, the advisors‟ first point of contact is now an always-up-to-date database. The database provides the initial information related to a course, such as the course title, price, date and location, which is enough for the advisors to determine whether or not that course could be suitable for a specific client. The database also provides the exact web location that contains more information on each course. The system developed in this research makes a positive contribution to ATM even for cases when the system fails to extract any information from a web page, or only manages to extract partial information from it. This is because the system has still alerted the advisors to the existence of such web pages and the strong possibility of there being training course information in these pages. This corresponds to 2/3rds, or ≈66% of the work that advisors would have had to accomplish themselves prior to this research. The additional training providers identified by the system are also useful for ensuring that the best deals, in terms of course quality and price, are made by ATM‟s advisors, which in turn can translate to potential financial benefits for the company.
Another advantage of this research for ATM is that the developed system requires minimal user involvement. In fact, the only involvement required is during the training stages of the CLASSIFIER and the GP part of the system that deals with the extraction of course titles. This is because in both cases, the Naïve Bayes approach is used to aid the decision making process and this approach requires training for it to work efficiently.
The training to be provided requires no knowledge of any kind from the user. The user is simply required to select a number of web pages, from a list of pages discovered by the CRAWLER, and use their human judgement to separate these into pages that do or do not contain training course information. A pre-prepared script can then be executed to prepare the system and the database for the training process, using the above information.
The system has also been designed to allow for growth at ATM. If the areas of interest to the business were expanded, few changes would be required as follows:
New seed URLs for the CRAWLER
Re-training of the CLASSIFIER with web pages from the new domain Additional XML rules to manage the new areas of interest
Potential changes to the Fitness Test, depending on the extent of dissimilarities between the different domains.
6.2.3 I
MPLICATIONS/I
MPACT ON THEW
IDERI
NDUSTRYOrganisations are cautious when adopting new systems. This can be attributed to the significant change required by some off-the-shelf and most bespoke systems in the organisations‟ infrastructure and existing technologies. Considerable efforts have been made in this research to integrate the system developed with existing systems and work practices at ATM, without disrupting any of the processes already in place. This would also be the case for any other organisation; in fact the only part of the system that would need to be amended, for it to integrate with existing systems and work practices in other organisations, would be the server details where the database is stored, so the system can connect to it.
The wider industry would benefit from the implementation of the system in this research in ways similar to those of the industrial sponsor. Other training brokerages could use this system immediately without any changes required. Other organisations, which are interested in extracting information from the Web, whatever this information may be, would be able to use the system after few changes, as listed in section 6.2.2.
An advantage of the system in this research is that it is created from three separate components: CRAWLER, CLASSIFIER and EXTRACTOR, which can be treated as stand- alone applications if necessary, and as such could be helpful to an even larger number of organisations. For example, the CRAWLER can be used by anyone as long as the system is seeded with a few initial web pages from the domain of interest; similarly, the CLASSIFIER can be used by any industry as long as it is trained with a set of positive and negative examples from the domain of interest; the GP part of the system can be used for a variety of domains. Xhemali et al. (2010-a; 2010-c) proved that this is possible even for domains that are radically different from that of REs, such as Software Statements and Structures.
An initial market research study conducted by ATM, with assistance from Atos Origin Ltd., has indicated that there is great demand for systems like the one developed in this research. This is because, people and organisations will always require information, thus there will always be need for tools to assist them to efficiently and effectively locate relevant sources of knowledge. The ability to take this further, and extract specific pieces of information into a database, makes this project even more appealing to the wider industry.