Riesgo de presentar signos: Cocaína y/o alcohol

THC no detectado

4.5. Riesgo de presentar signos relacionados con la presencia de sustancias en el fluido oral

4.5.2. Riesgo de presentar signos: Cocaína y/o alcohol

As explained previously, in the literature [145], topic models are defined as hi- erarchical Bayesian models of discrete data, where each topic is a set of words that together represent a high-level semantic concept. According to this defini- tion LDA [34] was introduced. In both parts of this thesis we use LDA. In the following we explain LDA in detail:

Latent Dirichlet Allocation

LDA is a generative probabilistic topic model which discovers topics present in a given text corpus. LDA represents each topic as a probability distribution over words in the documents. The generative process of LDA is as follows:

1. For each topicβ_i, f or i∈ {1, . . . , K}: 2. For each document d:

(a) Draw topic proportionsθ_d ∼ Dir(α). (b) For each word:

Draw z_d,n∼ Mult(θd),

Draw w_d,n∼ Mult(βzd,n).

The graphical model of LDA is shown in Figure 2.3.

LDA is the basis for a number of other topic models including ours presented in Chapter 4. LDA is a Bayesian network that generates each document from a corpus using a mixture of topics. For each document, a multinomial distribution

θ over topics is randomly sampled from a Dirichlet function with parameter α

(which influences the shape of the distribution). Moreover, to generate each word, a topic z is chosen from this topic distribution and a word, w, is generated by randomly sampling from a per-topic multinomial distributionβ.

In Chapter 3, similarly to [146], we use LDA topics as summaries. That is because the use of LDA topics as representations of documents is theoretically motivated and endorsed by previous related work on temporal compression of recordings of conversations[138].

In Chapters 7, 8, and 9 we use LDA topics to predict their continuation in future time slices in a JITIR setting. However, our proposed methods in these chapters are generic enough that any vector representation of documents could

23 2.3 Text Representation

𝜶 𝜽 𝒁 𝑾

𝜷

N M

Figure 2.3. Graphical model of LDA.

be used and predicted by the methods. This advantage of our models is very im- portant because there are many human-centered studies such as[136] which pro- duce various textual summaries of conversations other than LDA topics. The only requirement for using our models is to have a vector containing a set of words where each word is associated with a corresponding probability score showing the strength of its presence in a given document.

Other Word Representation Methods

Recently, there has been interesting work on mapping semantically related words to nearby positions in the vector space in an unsupervised way. Some example approaches are the well known word2vec model [94], Glove [103], in addition to other probabilistic word embedding methods such as [140] which uses a Gaussian distribution for modeling each word. In this chapter, we build on the same concept by using a Gaussian Mixture Model (GMM) for modeling each word in each of its contexts. We define a context of a target word as a word co-occurrence in the same vicinity. For example, the word ’book’ can mean making a reservation or it can also mean a bound collection of pages depending on the context.

Temporal Word Representation Methods

Temporal topic models are capable of tracking the evolution of topics over time and model probability of words over time. The Dynamic Topic Model (DTM)[33]

24 2.3 Text Representation

is the state of the art in this domain. It is based on the LDA model and requires as input a sequential corpus of documents. It uses a linear Kalman filter [73] to compute the evolution of each topic over time. The authors showed that, on a sequential dataset, DTM outperformed LDA in terms of log likelihood. We elaborate on the details of this model in the background of Chapter 4.

We use DTM as a baseline model in Chapters 4 and 8. In Chapter 4, we compare this model against our temporal topic model capable of tracking intermittent topics over time. In Chapter 8, we compare DTM with our proposed model capable of predicting topics that continue in a future time-slice.

We note that a temporal topic model such as the DTM is not a suitable option to be included in our benchmark in Chapter 7 for predicting continuing conversation topics, because: (1) it assumes that all topics are present over all time slices of a given dataset, which does not hold in the case of conversation logs(2) its not capable of tracking textual representations of conversations other than topics.

Another notable temporal approach is the continuous-time dynamic topic modeling [145]. The model relaxes the assumption made by DTM, which is, all documents are exchangeable in each time slice. For this purpose, it replaces the state space model used by DTM in order to model Brownian motion[82]. The model is able to capture continuous topics by taking into account the timestamps of documents within a collection with different levels of granularity. Unlike this work, in Chapter 4, we use a Markovian state space model [55] for tracking intermittent topics over time. This means that the evolution of a topic will be a discrete process. That is because a Markovian state space is based on the Markov assumption which states that data in each time step merely depends on its previous time step.

Additionally, another topic model that tracks the evolution of topics is the Topics Over Time model [147]. This model uses the timestamps of documents as observations for the latent topics, and each topic is associated with a continuous distribution over timestamps. Thus, for each generated document, the mixture distribution over topics is influenced by both the word co-occurrences and the document’s timestamp. Although this model allows to account time jointly with word co-occurrence patterns, it does not discretize time and does not make Markov assumptions over state transitions in time. However, the topics are con- stant and the timestamps can be used to explore them.

Evaluation of Topic Models

Blei et al.[33] and many other previous work in the domain of topic modeling, use likelihood on held-out data as a standard measure of evaluation. Sim- ilarly to the common practice, in Chapter 4 we also use likelihood to evaluate

In document CORRELACIÓN ENTRE SIGNOS DE DETERIORO Y CONSUMO DE ETANOL, THC Y COCAÍNA Estudio financiado por la D.G.T. (Nº Exp.: 0100DGT21348) (página 37-43)