• No se han encontrado resultados

As mentioned earlier, the web has grown in scope as well as in content in recent years. Search engines on the web has also evolved, but there are still a few challenges that need to be addressed.

Search engine evolution

Search engines have gone through significant changes in recent years. In his paper, Broder identifies three stages in the evolution of web search engines [39]:

• First generation search engines. These were the most advanced search engines around 1995-1997 and primarily used text and formatting information on HTML documents to drive searches. It is interesting to note that this is very close to the classical IR model presented earlier in the chapter, and were designed to support mostly informational queries.

• Second generation search engines. The next generation in web search engines occurred around 1998 - 1999. The focus shifted to incorporate the inherently structural aspects of the web in their search strategy (like link analysis, anchor text and click-throughs). This allowed search engines to allow for navigational queries as well as informational queries.

Most search engines today uses content as well as structural information to direct searches.

• Third generation search engines. These are the next generation of search engines and are slowly starting to emerge on the web. These search engines try to combine information from multiple sources in an attempt to discover (or guess) the information need behind a specific query. Third generation engines use techniques like semantic analysis, context discovery, dynamic database selection etc. to expand the previously fixed corpus of docu-ments to be considered by a query. The aim is to support informational, navigational and transactional queries.

Although the above mentioned evolution of search engines is similar for all search engines on the web, the approaches and techniques they utilize to search the web are quite different. These approaches are briefly summarized below.

Search engine approaches

A myriad of search engines currently exist on the web, each with a different approach and strat-egy to searching the web. In their paper, Ellman and Tait define four popular approaches web search engines follow [43]:

• Robot based engines. As one of the more popular approaches, robot engines try to ex-haustively reference all documents on the web and create a publicly accessible index from them. One of the major problems with this approach is the rapid growth of the web. It is questionable if engines with this approach would be able to keep up with this rapid growth.

Recent studies like the one by Lawrence and Giles seem to support this [36]. Examples of search engines with a robot approach are: WEBCRAWLER, InfoSeek and LYCOS [43].

• Assisted IR engines. These engines rely on human assistance to effectively index a website.

Websites are manually indexed by a human indexer before they can be searched. Assisted IR engines are typically superb technical search engines but lack the coverage of automated approaches. This makes them neither popular nor effective means of searching the web.

Examples of assisted IR engines are: INQUERY, WAIS, SWISH and HARVEST [43].

• Assisted object orientated databases (OODB’s). This is a particularly interesting approach where content providers actually indicate their most important pages and provide keywords to classify them. This ensures the completeness and accuracy of the index the engine main-tains. The most famous search engine that utilizes this approach is yet another hierarchical object orientated database (YAHOO). Obviously a setback with this approach is that the full cooperation of the content authors, in providing truthful and accurate descriptions for their site(s), is needed for the approach to be successful.

• Web information retrieval agents. This approach utilizes intelligent agents for the automa-tion and assistance of the web searching task. This approach to searching the web will be explored later in this dissertation (see chapter 6).

Search engine performance

A previously mentioned study by Lawrence and Giles tested eleven popular full text search en-gines. They concluded that not only do these search engines index sites unequally, but also that no current search engine indexes more than 16% of the publicly indexable web [36]. They speculate that there might be a couple of reasons for this:

• A threshold point beyond which it is not economical for search engines to improve their coverage or timeliness in returning results.

• Limited scalability of the engine’s indexing and retrieval technology.

• Physical constraints, like network bandwidth, better hardware for large index processing.

• Finding a good hub rather than authority is deemed acceptable. In other words, is it deemed acceptable to find a page that links to the original instead of the original itself.

They also showed that it could take a search engine months to index a new page. This effect seems to be induced by search engines being biased in listing sites based on the connectedness of a given site. Sites with high-linkages become more visible in search engine listings while new sites with low-link counts struggle to be listed [36].

The results obtained by the above study might suggest that there are some issues with current web searching techniques. In the next subsection, a brief summary of some of these challenges facing web search engines is given.

Documento similar