6. Guías de las políticas para fortalecer la capacidad de exportación de las empresas
6.1. Consenso sobre el rol de las empresas comerciales estatales (FTC) de Cuba
Bag-of-words approach
An approach frequently used in information retrieval and text-learning for document represen-tation is the so called “bag-of-words” approach. Many web content mining systems utilize this approach as a document representation scheme [91, 92]. This suggests that the approach can be seen as the (current) standard technique for document representation by content mining systems.
In a typical bag-of-words approach, the set of all words W = {w1, . . . , wn} present in an HTML document is extracted from the entire document. For the purposes of the results analysis agent, information contained in META-tags are considered as part of the document text. It is important to note that the ordering of these words or any structure present in the text of the document itself is not usually considered. Each document, di, in a set of documents D = {d1, . . . , dk} is then represented by a feature vector−→
d where each dimension in the vector is mapped to a wiW in the document space D. In other words, the words from all documents are combined into a common document vector. Each new document is then represented in terms of this common document vector. Usually a boolean value is used to indicate if a word is present in a document or not.
Other information present in the document could also be used as features, i.e. word positions, word frequency etc.
The results analysis agent could use a bag-of-words approach to document representation and represent each document with a frequency-word vector. Each word that is present in the set of
documents is mapped into a feature. An individual document, di is then represented as a vector of features −→
d where each feature is associated with a word frequency (i.e. the number of times the specific word appears in the document). This is scheme is illustrated in figure 10.2 below.
Frequency-word vector
Advanced dungeons & dragons or AD&D is a game system intended for multiple players and facilitates role-and-campaign playing in a fantasy world. The system ...
An introduction to advanced dungeons & dragons
Figure 10.2: Document representation with a frequency-word vector [91].
Feature reduction
One of the major issues with the use of word-based features from documents is the problem of resultant document vectors of very high dimensionality. This high dimensionality of the docu-ment vectors can have a serious impact on the computational effort required for processing large numbers of vectors in a relatively small space of time.
A number of different approaches have been devised for reducing the dimensionality of feature vectors. One of the most common approaches is to remove words that occur in a stop list from the feature vector. The stop list typically contains words that are common to the language most retrieved results will be written in. As an example, if most results are expected to be in English, a stop list containing the most common words in the language (i.e. “is”,“the”,“are”...ect.) could be constructed and these words then removed from the feature vector. A related simple measure for
reducing the dimensionality of feature vectors is removing words that occur infrequently inside a document (i.e. word frequency < predefined min. frequency).
Another, technique for feature reduction is stemming or conflation. This technique attempts to replace related words with the common morphological variant present in the words. Stemming algorithms are usually language specific, and use a combination of morphological analysis and dictionary lookups in order to stem word groups [93]. As an example, the English word “work”
could replace “works”, “working” and “workable” in a specific feature vector.
Other, more advanced techniques like latent semantic indexing (LSI) with singular value de-composition (SVD) has also been applied to the dimensionality reduction problem. Although a complete discussion of LSI is beyond the scope of this dissertation, the basic idea is that LSI maps synonymous and similar words to similar vectors, thereby reducing vector dimensionality.
The interested reader is referred to the work by Chakrabarti for a more detailed discussion of LSI [93].
For the sake of simplicity and computational efficiency, the agent could use a combination of a simple frequency measure like the one discussed and a stop list containing common words.
Some authors have noted that surprisingly good results can be obtained in this manner [91].
Additional custom features
Using only word-based features as described previously, can be seen as a somewhat “flat” rep-resentation for especially web documents. As noted by many authors, documents on the WWW can have other attributes that are as important or even more so in the context of a user query. The inherent structure present in an HTML document could also contain features that could be used in classification.
Systems like Inquirus 2 extract additional page specific attributes in parallel to the actual words contained in the document [83]. Some of these features include heuristic measures for indicating if a page is a homepage, research paper or general page. Other heuristic features like the number of words on the page, the number of images, number of sections, the number of unique links on the page ect. are also considered.
The results analysis agent could follow a similar strategy by including heuristic features in the feature vector for a specific page. These features would then not map to a specific word, but to a specific attribute present on the page. This could potentially enrich the frequency-word vector representation with web page specific attributes and provide additional features for use in the generalization phase.