Resultados e impactos de primer nivel en las inversiones apoyadas en el

Formats change over time, and some fall into disuse. “Live” sites gradually implement various software upgrades, change hardware platforms, and perhaps even adopt new protocols. Consider gopher, ftp, and telnet which have mostly been replaced by http/https, scp, and ssh. HTML 1.0 has evolved to SHTML and XHTML, and a number of early HTML tags have been deprecated. The net result is that a faithful bit-level copy of an old resource (w) might not be usable at all on the new system (W_∆). For a resource that continues to live during the changes, w becomes w_∆ by manual intervention, by automated updates, and perhaps through repeated interventions of both types. A preserved resource would need to be similarly adapted to the updated environment in order to be viable. The adaptation could happen by emulation of the older system, translation to a newer format, or by some other method, ideally one that is automated.

For preservation, the metadata customarily available from an HTTP request-response event is insufficient. If web crawling and browsing occur through HTTP, how can more metadata obtained? In part, archivists actively coordinate with the website owner to manually store additional information about the website and its resources, or to post-process the item using various utilities. For example, Dublin Core metadata may be derived through a series of conversations and form-filling between the archivist and the site owner.

In the technical metadata arena, a variety of utilities exist to aid the archivist in metadata pro- duction once the website has been crawled. Jhove and Exif Tool are two well-known examples of

While some metadata utilities depend on supplemental manual input from an archivist, others are fully automated and capable of being used by the originating server as well as by the archiving client. Regardless of the approach, the key is to maintain enough information (metadata) about the resource to enable its future understanding. The insufficient metadata accompanying an HTTP response is behind the Representation Problem.

3 SEARCH ENGINES & REPRESENTATION

Before Google revolutionized web searching with its PageRank algorithm, finding resources on the web was difficult, and many authorities believed it could only be solved by somehow incorporating metadata into websites [121, 134, 148]. Google’s approach was to weight links on web pages to produce a hierarchy of results, circumventing the supposed metadata dilemma. One aspect of metadata remains a factor for search engines, regardless of the indexing strategy used: trust in content representation. Consider Figure 43 on the next page, which shows the HTML content (A) and the browser-view of the content (B). The representation of content on this page differs depending on whether it is crawled or browsed. The crawler “sees” the text content (Britney Spears) repeated numerous times. The browser doesn’t display that content; only the image is shown. Such pages are considered a kind of “spam” because their content cannot be trusted by the crawler to accurately reflect content the user will see. The issue of trust is important [80]. If this page was in the top-10 links for a user’s “Britney Spears” query, the user would be very unhappy with the results since it has nothing to do with the request. Although there have been many improvements, search and rank algorithms have not yet eliminated the ability of such “spam” pages to populate search results [103, 90]. On the other hand, sometimes the intent of text is to communicate a picture, as in Figure 44 on page 88. How can this representation be distinguished from the spam-like content of Figure 43 on the next page? How does the content of ASCII art relate to the image drawn? Is it spam or is it informational or is it nothing but pixel-rendering? In OAIS terms, the knowledge base is as important as the other two components (the data object itself and the representation information) in order to produce a valid information object.

Search engines also alter content representation when they transform the site resource in the cached copy they keep. Consider Figure 45, where the original PDF resource (A) has been cached and modified (B). Yahoo’s cached copy has only the essential text and none of the imagery. Whether or not information has been lost by the transformation depends on the resource and on the intent of the original document. If the client’s search includes an expectation of an image – perhaps as

(A) Crawler-Viewed Content (B) Browser-Viewed Content

FIG. 43: Representing content. The dominant content varies by the type of access, that is, the emphasis may not be the same to the crawler as it is to the user with a browser. The HTML in (A), which repeats “Britney Spears” a few hundred times, produces the page in (B) – but that is not a photo of Britney Spears. All the “Britney Spears” are seen by the crawler but not displayed by the user’s browser, who may never realize that they are there, and who will not understand why the page is in the Britney Spears query result set.

(A) Coffee Shop Zombies1 (B) Turmoil2

FIG. 44: ASCII Art was popular during the days of Usenet. In some cases the text had both view- able art and meaningful content. In other cases, the text merely served to turn monitor pixels on and off, effectively drawing the image on the screen, if the screen is a monochrome 800x600 pixel device. Future representation of this will depend on having sufficient information about its content and expression. 1http://www.penguinpetes.com/images/BBS_art/thumbs/Coffeeshop_Zombies.jpg 2_{http://www.penguinpetes.com/images/BBS_art/ASCII/Turmoil.jpg}

the “recognition” factor for the client – this cached copy is less likely to be useful. Representation issues impact search engines as well as preservationists.

4 WEB SERVERS, BROWSERS, & REPRESENTATION

4.1 MIME

Once mostly plain ASCII text or Hypertext (HTML), many World Wide Web sites now contain application-specific files (Flash, Video, multimedia), non-hypertext documents (Adobe PDF, Word files, XML files) and enhanced hypertext content (XHTML, CSS). Successful access to this variety of resources is accomplished in part thanks to MIME typing, which identifies a resource as belong- ing to one of 8 major types, each of which has a variety of subtypes. Servers and browsers are individually configured to recognize various MIME types as defined by IANA. Apache, for example, has an extensive list of default MIME types that are installed with the server, including many that are seldom used (as in the example in Figure 15 on page 33).

The MIME specification (Multipurpose Internet Mail Extensions) and MIME types are one method for encoding binary data in an ASCII format so that files can be transferred using simple text-based protocols like HTTP and SMTP [37]. The MIME specification has enjoyed a nearly universal implementation, but it differentiates file content types on only a very simple level, and

(A) The original PDF1 (B) Yahoo’s transformed cached copy2

FIG. 45: Search engines sometimes transform resources that will be stored in cache. In the process, images and other information may be lost. 1http://www.erpanet.org/guidance/docs/ ERPANETPolicyTool.pdf 2http://cache.search.yahoo-ht2.akadns.net/search/cache?ei= UTF-8&p=digital+preservation&y=Search&fr=yfp-t-501&u=www.erpanet.org/guidance/ docs/ERPANETPolicyTool.pdf&w=digital+preservation&d=Q1PzJpzfQw6A&icp=1&.intl=us

one which is insufficient for archiving purposes. RFC 2046 defined 5 basic content types [38], and two composite types. The 7 categories are listed in Table 21, with example files given for each type. There are some unexpected category assignments mixed in with the usual suspects. Most of us probably would guess correctly that the content type assignment for voice messages is multipart media, but it is a bit surprising to find that encrypted resources such as message digests are also assigned to this category.

In most cases, both the server and client rely on the file extension for type identification, and problems can arise if the typing and content are mismatched. For example, the file http:// beatitude.cs.odu.edu:9999/falsePdf.pdf is a UTF-8 encoded resource which has been renamed with the “dot-pdf” extension. Both the server and the client misidentify this file. Browsers attempting to access this file can generate an error if the file is not examined more closely. But iffalsePdf.pdfis downloaded and examined with a more capable tool like the Unixfile command, the “real” file format is recognized as “UTF-8 Unicode English Text”. The automatic MIME typing process was misled by the “pdf” extension.

(2) application pdf, octet-stream, zip, msword (3) audio basic, wave

(4) image jpeg, tiff, gif (5) video mpeg, quicktime

Composite Types

(6) multipart header-set, digest, mixed (7) message external-body, news, partial

TABLE 21: The MIME Content Type Categories

In some cases, not enough information is given to access the resource once it is received. For example, a Content-Type ofapplication/octet-streamcould be an Open Office document, an Excel spreadsheet, or some other file format not recognized by the server. Another frequent scenario is where the server understands the type, but the client does not, as the previous example of Figure 15 on page 33 illustrated. The web server has correctly identified the MIME type, but the browser has no representation method. VRML files, popular in the 1990s, are just one of many formats that have fallen into disuse. Travelling back in time, it might be possible to get more useful metadata on the file: the best time to get information about a VRML file was about 10 years ago. Certainly, the minimal metadata generated by crawling the site for this resource is unlikely to prove sufficient for historians in the year 2100. Despite “knowing” what the file is, representing it is a problem for the browser.

4.2 HTTP

The MIME Content-Type entity header sent over HTTP by the server provides only bare-bones information about the resource. Version 1.1 of the HTTP protocol has 47 defined Headers which are grouped into 4 general categories: (1) Entity (2) General (3) Request and (4) Response. Table 22 lists the headers by category. Few of these are routinely used by web servers, and even fewer provide insight into the resource. The Request and Response categories together contain more than 50% of all HTTP headers. This distribution of fields makes it plain that most HTTP exists to facilitate the transfer of data rather than interpretation of data.

TABLE 22: HTTP Headers, grouped by category. Those that were intended to provide resource metadata fall into the Entity category, but useful data can be found in the other categories as well.