Capítulo 2. Marco legal
2.1. Regulación ambiental de las IES en el marco del desarrollo sustentable
In this section, we give a more formal discussion of context modeling in text data, by formally define the important concepts.
3.3.1
Definitions of Basic Concepts
Let us introduce a few important notations and definitions that will be frequently referred to in the rest of this thesis. We start with the definition of a few core concept of contextual text mining.
We first introduce D as a notation for the collection of text data. Since the content of text data is commonly formatted with natural language, we first define the language units in text.
Definition 3.1 (Language Unit): We use w to denote a basic language unit in text. Such a language unit could be a word, a concept, a phrase, a ngram, an entity, or any other units that carries semantic meaning. We refer to the smallest language units that carries semantic meaning as basic units. The most common basic units in text are words (or terms in the context of information retrieval). So we use w to denote a word, or a textual term in the rest of this thesis unless specified. We define a vocabulary,
V , as a set of all w in D.
Please note that in some scenarios, there are symbolic tokens that are not in natural language, but also carry semantic meanings, such as the label of a product, the symbol of a gene in biology literature, and the URL of a web page in query logs, etc. In these cases, we relax the definition of language units so that they also cover those symbolic tokens with semantic meanings. We then define a document in the text collection. Definition 3.2 (Document): A text document, d, in a text collection D is a sequence of words
w1w2...w|d|, where wi is a word from a fixed vocabulary V . Following a common simplification in most
work in text mining [61, 10], we can represent a document with a bag of words, i.e., d = {w1, w2, ..., w|d|}.
We use c(w, d) to denote the occurrences of word w in d.
Definition 3.3 (Context Feature): A context feature of text, or a context variable, X, is an attribute of the text the values of which could define a partition of the text collection D. Such an attribute could either be intrinsic to the content, such as the appearance of a term, or a topic; it could also be an extrinsic feature of text, such as time, location, and authorship. Please note that the value of X could either be discrete, such as XAuthor = “JimGray00, or continuous, such as “Xyear < 2006.” We use uppercase X to
denote the variable of a context feature, and lowercase x to denote the value of the feature.
Definition 3.4 (Context): A context of text, c is a meaningful condition of text data that reflects the situation at which the text is produced. Formally, a context c corresponds to a condition defined by the value of a set of context features, i.e., {X1 = x1, X2 = x2, ..., XN = xN}. We use C to denote the set of
all meaningful contexts. Following this definition, each context defines a subspace of the text collection D, which consists of all language units that satisfy this condition.
We further define this set of language unites, Dc, as the domain of a context c. We require that Dc6= φ.
We say that c1and c2overlap with each other, if Dc1∩ Dc2 6= φ. From the definition of context, we can
also see that the smallest context covers a single language unit w, and the largest context covers D itself. A document d also corresponds to a special type of context. In a particular problem of contextual text mining, we only select the meaningful contexts that we are interested in. It is also worth noticing that the various values of a particular context feature can define a set of contexts (e.g., the time context).
Please note that the aforementioned definitions are general enough to capture different approaches to contextual text mining. When a generative view of text mining is taken, we also need a few key definitions related to generative models of text.
Let us first introduce a generative model M for D, which is a probability distribution of the observation of language units in D. M corresponds to the random process that how the language units in D are generated. When we assume that the basic language units w are generated independently, M can be characterized with the distribution {p(w|M)}w∈V. The likelihood of generating D with the probabilistic model M is written
as p(D|M). M is usually used as a representation of D. M is also called a language model . When M models context information as well as content, we call it a contextual language model .
Similarly, we introduce the following definitions related to contextual language models:
Definition 3.5 (Context Model): For a context c, a context model Mcis a probabilistic model which
explains the generative process of language units in Dc. Similarly, Mc is usually used as a representation of
the context c, and can be characterized with {p(w|Mc)}w∈V.
Definition 3.6 (Topic/Theme): A topic or a theme, T , is a semantically coherent subject of a discourse in a text collection. In this thesis, we formally define a topic as a latent context in D. We use
T to denote the context feature for topics. If a language unit satisfies the condition T = t, we say that it
belongs to the latent context of topic t. The context model which represents a topic is called a topic model, denoted as θt, which is characterized by a probabilistic distribution of words {p(w|θt)}w∈V. Clearly, we have
P
w∈V p(w|θt) = 1. We assume that there are all together k topics in D.
Definition 3.7 (Contextual Pattern): A contextual pattern in text is defined as a pattern that could be derived based on conditional distributions involving language units and various contexts. In par- ticular, the conditional distributions include the context model p(w|c), the distribution of context given a language unit, p(c|w), and the distribution of one type of context given another type of context, p(c0|c).
We refer to the interesting conditional distributions as basic contextual patterns, and patterns derived from postprocessing the basic contextual patterns as refined contextual patterns.
We can see that this definition of contextual pattern covers a wide range of mining products of contextual text mining tasks as instantiations. For example, topic modeling [59, 10] aims at the discovery of conditional distributions of words given topics as a representation of the topics in text; temporal text mining attempts to extract topic life cycles [113], which can be represented by, or can be extracted by refining the conditional distributions of topics given time context and of time given topics; author-topic analysis targets at discovering conditional distributions of topics given a particular author and conditional distributions of words given different authors and topics [162, 114]. We will discuss specific instantiations of contextual patterns in Chapter 4 and in the concrete contextual text mining applications later in this thesis.
When contexts do not overlap, it is clear that all the words in the domain of c are generated based on
Mc. However, when w is in the domain of multiple contexts, it may be generated based on any of those
contexts. We define c to be the active context for w, if w is generated using Mc. Clearly, the context
distribution p(c|w) defines how likely c is the active context of w. We define all the language units which are actually generated by Mc as the active domain of c.
We can also define other useful concepts for contextual text mining. We leave other definitions to specific chapters in the rest of this thesis.
3.3.2
A Taxonomy of Context
The definition of context is quite broad. We can see that the definition unifies many different notions of context, including both the linguistic context and the situational context, as long as it corresponds to an evaluable condition and defines a subspace of the text. Following the definitions, we can give a more in-depth discussion about different ways to categorize contexts.
Explicit Context and Implicit Context
We can divide contexts into explicit context and implicit context. Recall that every context corresponds to a condition with particular values of context variables. Every language unit which satisfies this condition belongs to the domain of this context. In most cases, whether a language unit satisfies the condition is deterministic. For example, whether a document is produced at some time, published at some location, or written by some author are deterministic. The context corresponding to such a condition is called an explicit
context, such as time, location, and authorship. Some conditions, however, are not deterministic for some
is somewhat implicit. The context corresponding to such a condition is called an implicit context. As a result, the domain of an explicit context is well defined; the domain of an implicit context is vague, which we need to infer from the text data. We will introduce the modeling of explicit context in Chapter 5 and the modeling of implicit context in Chapter 6.
Context of Various Granularity
We can also distinguish contexts according to the granularity that a context applies to. Some contexts are larger, the domain of which covers multiple documents, e.g., time and authorship. The largest (but trivial) context is the whole collection. Some contexts are finer, the domain of which only covers a sentence, or even several words, e.g., the words next to the word “mining.” Based on the size of the domain, we can categorize a context into a document-level context, a sentence-level context, or a local adjacency context, etc.
Complex Context
Contexts are not always independent. The time contexts follow the structure of a linear chain; every location has its adjacent locations; different people form a social network structure. These dependent contexts make a complex system by themselves. Such a complex structure of contexts introduces important criteria to context modeling. In general, we use complex context to denote the structure of contexts, which we will discuss in details in Chapter 7.
3.3.3
Tasks of Contextual Text Mining
Based on the definitions and discussion about contextual text mining, we can introduce the general tasks of contextual text mining. The general tasks of contextual text mining include:
1. constructing a reasonable contextual language model M for the text data; 2. discovering contextual patterns from text; and
3. further analysis based on the context models and contextual patterns.
When topics are involved in the contexts, we can use “contextual topic patterns” to replace “contextual patterns” in task 2) and 3).
Please note that the third task is defined rather broadly, which covers a lot of analysis based on the output of the first three tasks. Once we have the representation of contexts (e.g., the context model), we can summarize a context; we can compute the similarity of different contexts; we can group similar contexts; we
also categorize contexts, score contexts, and rank contexts; we can also compare the meanings of language units, topics, and other patterns across different contexts.
These basic tasks unify many text mining problems with context information involved. Topic modeling [59, 10] tries to construct probabilistic topic models for text, and discover word distributions for topics; temporal text mining tries to extract topic life cycles and evolutionary topic patterns [113]; spatiotemporal topic analysis attempts to model blog articles with temporal and geographic information and to discover the diffusion patterns of topics over time and location [111]; opinion summarization aims at model the mixture of topics and sentiments and compare topics under different sentiment contexts [110]; personalized search aims at modeling the user’s search history and predicting the likelihood of clicking a URL by a user when she issues a query [108]. We will introduce a general instantiation of contextual text mining, contextual topic analysis, in Chapter 4.