Bloque IV. La gestión económico financiera
ANEXO Conocer el contenido material de la gestión
information
Ideally, features extracted from documents should uniquely represent high- level concepts and topics in order to give an accurate representation of their contents. In practice, words or other features are generally far from the ideal condition due to the fact that there are more words are frequently used in each topic and each word may be used in more than one of them. From the point of view of the language, words can have multiple meanings (pol- ysemy) and the same concept can be recalled by more words (synonymy). Additionally, some words may have different meanings but be somehow related.
These properties sometimes dmake ifficult to correctly spot similarities and correlations within data. For example, two documents may discuss the same topic using different words, so that they appear to be unrelated to each other. However, the relevant words of both documents may signifi- cantly occur simultaneously in either few or many other related documents: this information might serve to infer that all those words are somehow se- mantically related, thus the two example documents are potentially related despite they differ significantly in the words they contain.
Solutions exist based on the use of external knowledge bases, which are described in the following section. However, another possible approach is to analyze the available documents to recognize recurring dependencies between words, which are usually indicative of relatedness between them. These techniques to extract latent semantic information from documents are based on statistics and probability and are used across different text mining and general information retrieval applications.
2.5.1
Latent semantic analysis
Latent semantic analysis (LSA) [30], also known as latent semantic indexing (LSI), is a general technique to analyze relationships between documents and terms in a collection, extract high-level concepts and transform the representation of documents according to the identified relationships.
Summarily, LSA transposes documents of a collection and terms therein in a latent feature space, where dimensions ideally correspond to high-level concepts or components. Therefore, each document is represented as a
2.5. Extraction of latent semantic information 27
weighted mix of such components, while each term may similarly be related with different degrees to more concepts. This scheme is very similar to principal component analysis, which is used to map a vector space with possible correlations between dimensions to another space without such correlations.
Given a collection with n documents and m distinct terms extracted from them, in order to apply LSA, a m× n term-document matrix X must be built, with each cell xi,j containing the weight of term ti in document
dj. Columns of X correspond in practice to bags of words for documents,
with terms weighted according to some scheme: those presented above in 2.4.3 can be used, although different schemes based on entropy are often effective in this case.
Within this matrix, dot product (or cosine similarity) can be computed between two rows (terms) or two columns (documents) to estimate their correlation. A whole correlation matrix for terms or document may be obtained computing XXT or XTX respectively.
On the term-document matrix is applied singular value decomposition (SVD), a mathematical technique which computes a decomposition of the original matrix X into three matrices.
X = UΣVT
Of the resulting matrices, U and V are orthogonal matrices sized m× r and n × r respectively, while Σ is a r × r diagonal matrix containing eigenvalues. The rationale is that each of the r eigenvalues corresponds to one of the aforementioned high-level components traced in the collection of documents and denotes how much it is relevant throughout the collection.
Eigenvalues are sorted along the diagonal of Σ in decreasing order, so that the ones coming first are related to the most relevant components. This allows to easily cut off less important components to a number k≤ r, simply by removing relevant rows and columns in the matrices. This reduction potentially allows to remove noise in the data, which can be constituted for example from terms or groups thereof appearing in few documents and poorly related to other ones.
Once such a value k is set, it can be considered to build an approximated version of the original term-document matrix X, by multiplying the three reduced matrices: the resulting matrix X0 will have its rank reduced from r to k. X0 is structurally identical to X (its rows and columns are repre-
28 Chapter 2. General Techniques and Tools for Text Mining sentative of the same terms and documents as X), but term weights are corrected so that noise is removed and evident correlations between terms (or between documents) are accounted. For example, if two terms ta and tb
frequently occur together in documents, a document containing only ta of
the two will anyway have a weight for tb higher than zero (and vice versa).
From the reconstructed matrix X0 or directly from the truncated matri- ces used to compute it, similarity between terms and between documents can be computed according to the corrected weights, which will generally be different from the corresponding one computed from the original matrix. In the common case where documents most related to a query must be found, using the common approach where the query is represented like a document to be compared to known ones, it should first be mapped into the latent space to undergo the same correction of values: this procedure is known as fold-in. In the latent space, related documents which do not contain the exact words of the query but strictly related ones can be found.
2.5.2
Probabilistic models
The LSA technique described above is based on singular value decomposi- tion, which assumes a normal distribution of weights in the term-document matrix: this modeling is not fully accurate, although particular weighting schemes can make it work better. For this, improved techniques have been proposed, based on different probability models, in particular on multino- mial models, which better represent the occurrences of words in documents. A first extension of the basic LSA technique has been the probabilistic latent semantic analysis (PLSA) [49], which considers a probabilistic model based on an hidden class variable z ∈ Z, which correspond to components (dimensions of the latent space) in LSA. In practice, each word and each document under analysis are considered to have affinities to these latent classes, which ideally represent topics, each with specific recurring words. From this, the occurrence of a word w in a document d is seen as a mixture of these classes; in another parameterization, a document is seen as a mixture of classes, which are in turn seen as mixtures of words.
P (d, w) =X
z∈Z
P (z)P (d|z)P (w|z) = P (d)X
z∈Z
P (z|d)P (w|z)