• No se han encontrado resultados

Efecto de la concentración de industrias relacionadas

FUNCIÓN PRODUCCIÓN

3.5 Análisis de correlaciones

3.6.4 Efecto de la concentración de industrias relacionadas

Storing and searching on the explosively increasing amount of data is one of the most im- portant problems in the digital library age. A single site may also contain large collections of data such as library database. It has been pointed out that compression is a key for next-generation text retrieval systems [ZMN00]. A good compression method may facilitate efficient retrieval on compressed files. The amount of storage used and the efficiency of index- ing and searching are major design considerations for an information retrieval system. The volume of data can be reduced by using compression techniques. However, search and re- trieval become much more complicated. Usually, there is a tradeoff between the compression performance and the retrieval efficiency. Therefore, most of the existing search or retrieval schemes on compressed text use compression methods that have relatively lower compres- sion ratio, but simpler indexing and searching schemes such as those using Run-length, Huffman, word based Huffman, or BWT [BW94a]. When the database is compressed, ob- viously, it is not efficient to decode the whole collection and locate the portions from the

uncompressed text. Given a query using a keyword, there are two categories of methods to search a matched pattern in general. One is the compressed pattern matching that searches a pattern directly on the compressed file with or without preprocessing discussed in Chapter 5. Both exact and approximate matching can be performed. This method requires no or some offline preprocessing. The other method is the popular information retrieval approach that requires preprocessing by building index with the keyword and document frequency information []. The query is processed and the search is performed on the index files. Then the documents are ranked using some standard so that the precision and recall are optimal. Relevant feedback may also help to refine the query to have more accurate results. For large collections of text, it is difficult to access a piece of a compressed file for both compressed pattern matching method and popular text retrieval system with index files. One option is to break the whole collections into smaller documents [MSW93a]. However, the compression ratio will be poor for small files. The longer the sequence to be compressed, the better is the estimation of the source entropy. Furthermore, the request for retrieval may change for different purposes. For example, only a small portion of the collection that is relevant to the query is required to deliver to the user. A single record, or a paragraph in stead of a whole document might be enough. It is unnecessary to decompress the whole database and then locate the portion that is retrieved. Using a single level document partitioning system may not be the best answer. We propose to add tags into the document. Different tags indicate different granularity. Decoding will be performed within the bounds. The major concerns of compression method for the retrieval purpose, ranked roughly by their importance are: a)

random access and fast (partial) decompression; b) fast and space-efficient indexing. c) good compression ratio. The compression time is not a major concern since the retrieval system usually performs off-line preprocessing to build the index files for the whole corpus. Besides the searching algorithm, random and fast access to the compressed data is critical to the response time to the user query in a text retrieval system. A typical text retrieval system is constructed as follows. First, the keywords are collected from the text database off-line and an inverted index file is built. Each entry points to all the documents that contain the keyword. A popular document ranking scheme is based on the keyword frequency tf and inverted document frequency idf. When a query is given, the search engine will match the words in the inverted index file with the query. Then the document ranking information is computed according to a certain logic and/or frequency rule to obtain the search results that point to the target documents. Finally, only the selected documents are displayed to the user. To find a good compression scheme that meets the given criteria, we first evaluate the performance of text compression algorithms currently in use. Besides the categorization given in Section 1.2.1, the compression schemes can also be categorized as entropy coders and model based coders. Huffman coding and Arithmetic coding are typical entropy coders. LZ family, including LZ77, LZ78, LZW, and their variants [ZL77, ZL78, Wel84] are the most popular compression algorithms because of their speed and good compression ratio. Canon- ical Huffman uses the language model in which English words are considered as symbols in the alphabet that contains all the words in the text. It has a sound compression performance, but it is language dependant. Dynamic Markov Coding (DMC) uses Markov model to pre-

dict the next bit using the history information. It has a good compression ratio. Prediction by Partial Matching (PPM) is currently the algorithm that achieves the best compression ratio. However it has a higher computational complexity. The Burrows-Wheeler Transform (BWT), or block-sorting algorithm has a compression ratio close to PPM and the speed is slightly slower than LZ algorithms. From the compressed searching point of view, there are algorithms that support direct searching and partial decoding. For example, in Huffman coding, given a query keyword, we can obtain the codes from the Huffman tree and then search the codes directly on the compressed file using pattern matching algorithms such as Boyer-Moore or Knuth-Morris-Pratt algorithms. It is possible that some further checking needs to be done. In Canonical Huffman model, a similar method can be used to search the word and decode partially [WMB99]. For the other compression algorithms we have to decode from the beginning of the compressed text and random access is difficult for them. In this thesis, we propose a modified LZW algorithm that supports random access and partial decoding. The original text retrieval system does not need to change on the query evalu- ation process. The data structure of the indexing file is still the same with the content to be changed into the index for the compressed file in place of the raw text file. A new tag system is incorporated with the indexing system to achieve the different levels of details for the text output. In our algorithm, we can decode any part of the text given the index of the dictionary entry and stop decoding until a certain tag is found or decode a given number of symbols.