• No se han encontrado resultados

1. LÍNEA ESTRATÉGICA: CONSOLIDACIÓN DE LA INFRAESTRUCTURA FÍSICA EDUCATIVA

1.1. Programa de Consolidación de la Infraestructura Física y Equipamiento

1.1.10. Equipamiento informático

1.1.10.2. Lector biométrico

University of the West Indies, Barbados Charles Greenidge

University of the West Indies, Barbados

Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

IntroductIon

Traditionally a great deal of research has been devoted to data extraction on the Web (Crescenzi, Mecca, & Merialdo, 2001; Embley, Tao, & Liddle, 2005; Laender, Ribeiro-Neto, da Silva, & Teixeira, 2002; Hammer, Garcia-Molina, Cho, Aranha, & Crespo, 1997; Huck, Frankhauser, Aberer, & Neuhold, 1998; Ribeiro-Neto, Laender, & Soares da Silva, 1999; Wang & Lochovsky, 2002, 2003) from areas where data is easily indexed and extracted by a search engine, the so-called Surface Web. There are, however, other sites that are greater and potentially more vital, that contain information, which cannot be readily indexed by standard search engines. These sites, which have been designed to require some level of direct human participation (for example, to issue queries rather than simply follow hyperlinks), cannot be handled using the simple link traversal techniques used by many Web crawlers (Cho & Garcia-Molina, 2000; Cho, Garcia-Molina, & Page, 1998; Edwards, McCurley, & Tomlin, 2001; Rappaport, 2000). This area of the Web, which has been operationally off-limits for crawlers using standard indexing procedures, is termed the deep Web (Bergman, 2001; Zillman, 2005). Much work still needs to be done as deep Web sites represent an area that is only recently being explored to identify where potential uses can be developed.

BAcKGround

The deep Web comprises pages, which are not normally included in returned results by the conventional search engines. These Deep Web sites’ pages are easily ac- cessible to people with domain specific information through the use of some user interface and may include dynamic Web pages returned as a query response.

The problem arises when, due to design limitations, common spidering programs utilized by search engines to harvest pages from the surface Web, may be unable to perform the tasks needed to formulate and send the user query, thus hampering the search engines’ efforts at accessing the information. These search engine design barriers make the information appear to be “deep,” “hidden,” or “invisible”—hence the terms “deep Web,” “hidden Web,” or, less frequently, “invisible Web.”

Advances in search engine technology has changed the outer boundaries of the deep Web which once in- cluded non-text document formats such as the popular Word (.doc), postscript, and .pdf formats. Nevertheless, it is clear that information stored in dynamically ac- cessed online databases is still increasing, thus making future deep Web querying an attractive prospect.

Determining whether a retrieved page belongs to the deep Web or surface Web is, as shown in Figure 1, a difficult problem, as dynamically generated pages can sometimes point to static pages in the deep Web or other dynamic pages, which may not be visible from the surface Web. It has been estimated (Bergman, 2001) that the deep Web is more than 500 times the size of the surface Web.

In this article, we propose a method that uses deep Web sites to automate the discovery and extraction of numeric data from HTML-encoded tables. In our re- search, we focus on numeric tables, which arise in the banking domain and show how step-by-step analysis can be performed to disambiguate labels, headings, and other information in a table. The cells containing labels will (in this case) vary significantly from the central data content of a table due to their predominantly non- numeric values.

Our method takes into account the HTML <table> tag in particular and parses this structure to derive data cells. Due to the flexible nature and usage of HTML tables, not every region of data encoded using the

D

<table> tag is identified as a table—it may be purely a document-formatting construct. Using a combination of heuristics and standard statistical techniques, we differentiate between genuine and non-genuine table characteristics.

The issue of general table recognition is a complex one, and several researchers (Chen, Tsai, & Tsai, 2000; Hu, Kashi, Lopresti, Wilfong, & Nagy, 2000; Hu, Kashi, Lopresti, & Wilfong, 2001) have developed a number of approaches in recent years. The identification of tables in HTML has also been studied and various methods applied with some success. Determining the presence of labels encoded in a table is also a very important activity, as these may suggest the presence of data at- tributes which can further be used to categorize and bring structure to the otherwise unstructured or semi- structured data found in Web pages.

One of the aims of our research is to identify methods, which will allow table data encoded using the HTML <table> tag to be automatically structured using the identification of potential labels in the structure of the table itself.

We focus on both the benefits and limitations of the <table> tag and its associated <td> and <tr> tags. Un- like other previous research approaches, our approach intends to take into account the presence of sub-cells, row spanning, and column spanning. We also look at a broad spectrum of other tags which may be found in a typical HTML table document (such as <B>, <P> and <BR>) and use these to augment our research investigations.

Previous research (Chen et al., 2000; Embley et al., 2005; Hu et al., 2000, 2001) has also focused on the physical structure of tables. However, we focus on both physical characteristics (row and column information) and the formatting structure, tag structure, and content. In particular we seek to make use of the broad spectrum of HTML tags available to perform our analysis.

MAIn tHruSt

It is important to note that today many large data producers such as the U.S. Census Bureau, Securities and Exchange Commission, and Patent and Trademark Office, along with many new classes of Internet-based companies, choose the deep Web as their preferred medium for commerce and information transfer.

In the deep Web, databases contain information stored in tables created by such programs as Access, Oracle, SQL Server, and DB2. A significant amount of valuable information on the Web is generated from these databases. This therefore provides the motivation for the title of our article in which we focus on extracting data in numeric HTML tables.

In our approach, we gather HTML files from predominantly numeric deep Web sites through the use of deep Web indexes such as profusion.com and brightplanet.com. We then run a <table> tag cell data extraction and parsing engine to lift cell data from HTML files. For each cell we record key features such as Surface Web

D e e p W e b

User Query Static HTML Pages

Restricted Access Sites

D y n a m i c a l l y G e n e r a t ed P a g e s Search Engine Accessible

Non-HTML Document formats DatabaseContent

Public Access Sites

Web 1.0



Data Extraction from Deep Web Sites

alphabetic character content, digit content, and various other internal cell HTML tag content parameters.

Methodology/Experiment

We used the interfaces of profusion.com and bright- planet.com to manually issue queries on the search

key “bank rates.” This initial search yielded a number

of links to pages on the deep Web including several Web sites visible on the surface Web.

These initial links were then further exploited by programmatically extracting links from the results pages to produce an expanded collection of pages, care being taken to exclude some pages, which were not HTML-based.

The process of link extraction was repeated on the expanded collection of pages to yield a yet more diverse collection of URLs, all loosely related to our initial query term.

From this last diverse collection of URLs, we down- loaded the corresponding Web pages, being careful to exclude pages that were not HTML-based. We noted that this diverse collection of URLs contained the names of many international banking Web sites. Efforts were made to restrict the number of URLs utilized from an individual Web site, for example royalbank.com/rates/ mortgage.html and royalbank.com/ were recognized as having the same basic Web site name.

The diverse collection also initially contained a number of links to search engines and directories, as well as links to news sites, popular downloading sites, and advertising sites, but these were filtered out.

The final set of links used for our experiments con- tained in excess of 380 URLs from which we retrieved Web pages for our analysis.