• No se han encontrado resultados

Interno, Cumplimiento y Gestión

INFORMACIÓN Y COMUNICACIÓN Año terminado al 31 de diciembre de

Author’s role: Support

In the last few decades, technological advances have made it easy to instantly access vast numbers of natural language documents, provided that you can determine exactly which document you wish to access. The field of Information Retrieval focuses on indexing, organizing, and summarizing natural language corpora so that desired information can be located quickly. For example, the Google search engine is an information retrieval tool for documents on the internet. Topic modeling is a branch of information retrieval that seeks to organize documents into groups with similar semantic topics.

The Latent Dirichlet Allocation (LDA) model is a generative probabilistic model for topic modeling [5, 22]. LDA models topics as probability distributions over words.

For a language with W words, each topic β~i is a W-element vector sampled from a symmetric Dirichlet distribution

β~i ∼Dirichlet(γ).

Each documentD~j is represented as a bag of words. Associated with each document is a probability distribution over topics, denoted~θj. For an LDA model withN topics,

Figure 6-17: TheBlaiseSDK for the Latent Dirichlet Allocation model. The left half of the model defines the topic; the right half of the model defines the corpus, including the assignment of each word of each topic to a topic. To perform inference most efficiently in this model, the θ and β weights would be integrated out via Dirichlet-multinomial conjugacy, leaving only corpus with word assignments in the state space.

each~θj is a N-element vector sampled from a symmetric Dirichlet distribution

j ∼Dirichlet(α).

Each word D~j[k] is generated by first sampling a topic ~zj[k] using the document’s distribution over topics

~zj[k]∼Multinomial(~θj)

then sampling a word from that topic’s distribution over words D~j[k]∼Multinomial(β~~zj[k]).

Intuitively, each document in an LDA model has an effective distribution over words produced by a linear combination of theN topic-word distributionsβ~i, weighted by the document-topic distribution~θj. Thus the topic-word distribution vectors form a linear algebraic basis for the intuitive document-word distribution. Document like-lihoods will be maximized when the document-word distributions most closely match the observed word frequencies in the documents. It is therefore the intuitive goal of inference to determine an appropriate basis set β~i from which to construct the document-word distributions. These basis vectors are probability distributions, so they can contain only positive values. It follows that the best vectors for the basis set will be those that put probability mass on words that are typically used in the same document – that is, words that are about the same topic of discussion.

Beau Cronin (MIT Brain and Cognitive Sciences, Navia Systems, Inc) designed and implemented LDA inBlaise (see figure 6-17 for an implementation sketch). As a brief demonstration, the LDA implementation was used to perform topic analysis on the introductions of 32 Wikipedia [1] articles. Example input is seen in figure 6-18.

The topics extracted with this model can be seen in figure 6-19.

Ontology:

Ontology is a study of conceptions of reality and the nature of being.

In philosophy, ontology is the study of being or existence and forms the basic subject matter of metaphysics. It seeks to describe or posit the basic categories and relationships of being or existence to define entities and types of entities within its framework.

Some philosophers, notably of the Platonic school, contend that all nouns refer to entities. Other philosophers contend that some nouns do not name entities but provide a kind of shorthand way of referring to a collection (of either objects or events). In this latter view, mind, instead of referring to an entity, refers to a collection of mental events experienced by a person; society refers to a collection of persons with some shared interactions, and geometry refers to a collection of a specific kind of intellectual activity.

As a philosophical subject, ontology chiefly deals with the precise utilization of words as descriptors of entities or realities. Any ontol-ogy must give an account of which words refer to entities, which do not, why, and what categories result. When one applies this process to nouns such as electrons, energy, contract, happiness, time, truth, causality, and God, ontology becomes fundamental to many branches of philosophy

Reality:

Reality, in everyday usage, means “the state of things as they actually exist.” The term reality, in its widest sense, includes everything that is, whether or not it is observable or comprehensible. Reality in this sense may include both being and nothingness, whereas existence is often restricted to being (compare with nature).

In the strict sense of philosophy, there are levels or gradation to the nature and conception of reality. These levels include, from the most subjective to the most rigorous: phenomenological reality, truth, fact, and axiom.

Figure 6-18: The introductions to 32 Wikipedia [1] articles were used as documents to demonstrate Latent Dirichlet Allocation topic analysis. Shown here are 2 of the 32 documents. Stop words (such as: “the,” “an,” “to,” and “or”) were removed and a rudimentary word stemming was performed before analysis. Wikipedia articles are copyrighted by Wikipedia contributors and licensed under the GNU Free Documen-tation License; these excerpts are believed to be covered by fair use.

Figure 6-19: This figure depicts the topics LDA infers when applied to the dataset described in figure 6-18, assuming 4 topics. Nodes are documents, labeled by their Wikipedia title, colored by their predominant topic, and projected from the 4-dimensional topic simplex to a 2-4-dimensional space for visualization. The most com-mon words in each topic are also displayed, with size proportional to frequency in the topic.