The deep Web remains an intriguing area, which needs to be further investigated, especially in domains where numeric data is found. This initial research has raised a number of important questions such as the relative prevalence of numeric tables in HTML documents in general, the use of javascript and other HTML-enabled technologies in table construction, and the use of HTML content in cells to identify potential labels.
We are also interested in encoding some key features identified by our data extraction efforts in an XML-based format so that we can standardize findings across
domains. The development of this XML described meta-data may aid future researchers.
In our attempts at identifying tables, we also intend to take into account the presence of sub-cells, row spanning, and column spanning.
CONCLUSION
In this article, we have introduced a method that uses deep Web sites to automate the discovery and extrac-tion of numeric data from HTML-encoded tables. Our method extends previous approaches in a number of respects. In particular, our method focuses on physi-cal characteristics, formatting structure, tag structure, and content.
Our research has shown that the character type content predominates in a minority of cells across our sample Web documents. Even more surprising we found that numeric tables proved to be extremely rare, even though the domain (“banking”) should have been biased in favour of numeric content.
Figure 5. Number of digits in different tables
Table Digit Totals Distribution
0 500 1000 1500 2000 2500 3000
1 215 429 643 857 1071 1285 1499 1713 1927 2141 2355 2569 2783 2997 3211 3425 3639 3853 4067 4281 4495 4709 Table IDs
Number of Digits in a Table
Series1
Data Extraction from Deep Web Sites
The existence of over 2200 tables with CDRs of 0 shows that approximately 50% of the tables did not register any character content. This is significant and may be due to the fact that the <TABLE> tag is often used as a document formatting construct. Some unus-able documents upon examination were found to be in an XML format such as the popular RSS 1.0, some contained HTML links redirecting the browser to an alternative Web document, and others contained some type of scripting code such as javascript.
WEB SItES oF IntErESt
http://www.perl.com/pub/a/2003/09/17/perlcookbook.
htmlhttp://perldoc.perl.org/index.html www.completeplanet.com www.deepWebresearch.info www.tpj.com
http://en.wikipedia.org/wiki/Deep_Web
rEFErEncES
Bergman, M. (August 2001). The deep Web: Surfacing hidden value. BrightPlanet. Journal of Electronic Pub-lishing, 7(1). Retrieved from http://beta.brightplanet.
com/deepcontent/tutorials/DeepWeb/index.asp Chen, H. H., Tsai, S. C., & Tsai, J. H. (2000). Mining tables from large-scale html texts. In Proceedings of the 18th International Conference on Computational Linguistics, Saabrucken, Germany.
Cho, J., & Garcia-Molina, H. (2000). The evolution of the Web and implications for an incremental crawler.
In Proceedings of 26th International Conference on Very Large Databases.
Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. In Proceedings of 7th World Wide Web Conference (WWW7).
Christiansen, T., & Torkington, N. (2003). Perl cook-book (2nd ed.). O’Reilly Media, Inc.
Crescenzi, V., Mecca, G., & Merialdo, P. (2001, Sep-tember 2001). ROADRUNNER: Towards automatic
data extraction from large Web sites. In Proceedings of the 27th International Conference on Very Large Databases, Rome, Italy (pp. 109-118).
Edwards, J., McCurley, K., & Tomlin, J. (2001, May 1-5). An adaptive model for optimizing performance of an incremental Web crawler. In Proceedings of the 10th World Wide Web Conference (WWW10), Hong Kong.
Embley, D. W., Campbell, D. M., Jiang, Y. S., Liddle, S. W., Ng, Y., Quass, D., & Smith, R. D. (1998). A conceptual modeling approach to extracting data from the Web. ER’98.
Embley, D. W., Tao, C., & Liddle, S. W. (2005). Auto-mating the extraction of data from HTML tables with unknown structure. Data & Knowledge Engineering, 54(1), 3-28, July 2005.
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., &
Crespo, A. (1997). Extracting semistructured informa-tion from the Web. In Proceedings of the Workshop on the Management of Semistructured Data.
Hu, J., Kashi, R., Lopresti, D., Wilfong, G., & Nagy, G.
(2000). Why table ground-truthing is hard. In Proceed-ings of the 6th International Conference on Document Analysis & Recognition (Vol. 11, pp. 127-163).
Hu, J., Kashi, R., Lopresti, D., & Wilfong, G. (2001, January). Table structure recognition and its evaluation.
In Proceedings of Document Recognition and Retrieval VIII, San Jose, CA (Vol. 4307, pp. 44-55).
Huck, G., Frankhauser, P., Aberer, K., & Neuhold, E. J.
(1998). Jedi: Extracting and synthesizing information from the Web. CoopIS’98.
Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A.
S., & Teixeira, J. S (2002). A brief survey of Web data extraction tools. SIGMOD Record, 31(2), 84-93, June 2002.
Rappaport, A. (2000). Robots & spiders & crawlers: How Web and Internet search engines follow links to build indexes. (White paper; Search Tools Consulting).
Ribeiro-Neto, B. A., Laender, A., & Soares da Silva, A. (1999). Extracting semistructured data through examples. CIKM’99.
Wang, J., & Lochovsky, F. H. (2003). Data extraction
D
and label assignment for Web databases. WWW2003 Conference, Budapest, Hungary.
Wang, J., & Lochovsky, F. (2002). Data-rich section extraction from HTML pages. In Proceedings of the 3rd Conference on Web Information Systems Engineering (pp. 313-322).
Zillman, M. P (2005). Deep Web research 2005.
Retrieved from http://www.llrx.com/features/deep-Web2005.htm
KEY tErMS
Cell: A region with a HTML-encoded table, which is delimited by a HTML <TD> tag. Cells may contain rich variety of HTML tags and markup in addition to raw data in the form of text.
Character Content: In our context, this refers to the presence of alphabetic characters (A-Za-z) within a cell.
Character-to-Digit Ratio (CDR): This is a nar-rowly defined ratio obtained by dividing the number of characters by the number of digits. In the case where there are no digits the CDR is set to the number of characters, and in the case where there are no characters the CDR is set to zero. It gives a sense of the character content versus digit content of a cell.
Deep Web: A largely untapped region of cyberspace in which Web data is indirectly accessible through the use of query-type human readable interfaces. Typically, the user must enter log-on information or select options before being granted access to the information from the Web site. The need for human interaction restricts the
ability of search engines and Web bots to index these sites. The terms invisible Web and hidden Web are also loosely used to describe these regions of cyberspace.
Digit Content: In our context, this refers to the presence of digit characters (0-9) within a cell.
Dynamic Web Page: A Web page that is created on-the-fly from a back-end database when a user inter-actively issues a query on a Web site. Sometimes the presence of a question mark “?” in the body of a URL indicates that dynamic content will be sent instead of a static HTML page.
General Table Recognition: A complex field of study, which seeks to identify tables within documents, typically by a pixel by pixel analysis of an image file.
The presence of borders and other repeating regions of distinctions may indicate the presence of a table.
HTML-Encoded Tables: Sections of HTML code, which are delimited by the HTML <TABLE> tag. The data within these sections are not always tables in the logical sense of the word.
HTML Tag: HTML consists of elements, which control how HTML encoded data is displayed. Tags start with a “<” and end with a “>.” For example <HTML>,
<P>, and <A> are three distinct tags in HTML. Tags may also contain information, which modify the default behaviour of the tag called attributes. For example the tag <TABLE BORDER=“0”> contains the border attribute for this table. The lettering inside the angle brackets is not case sensitive.
Perl Module: This is a special-purpose pre-built section of Perl code, which is freely available from the CPAN.org or other standard Perl-coding Web sites. Modules act as code libraries and allow extended functionality to be added simply and easily to Perl pro-grams. For example the HTML::TableContentParser module.
0