UNIDAD 5. LA DEMOCRACIA Y SUS CARACTERÍSTICAS
5.4 LAS DEMOCRACIAS DEL SIGLO XVIII
There are several major topic modeling approaches to consider, and given that there is no consensus on the best way to evaluate a topic model, it is useful to assess topic modeling results using several metrics. Here we describe two modeling approaches and metrics that can be used to evaluate them.
3.2.1
Topic Modeling Approaches
In this subsection, we provide details of the two main topic modeling approaches that we consider in our experiments: Latent Dirichlet Allocation and the Meaning Extraction Method.
3.2.1.1 Latent Dirichlet Allocation
The most popular and widely used topic modeling approach is Latent Dirichlet Alloca- tion (LDA) [14]. LDA is a generative model that treats each document as a probabilistic mixture of topics, which themselves are distributions over words in the vocabulary. The generative story of LDA, outlined in Figure 3.1, is as follows: for each of the M docu- ments, a distribution over all topics, θ, is chosen for the current document. θ is sampled from a k-dimensional Dirichlet distribution parameterized by α, which is a hyperparameter of the LDA model. Then, for each of the N words in the document, a single topic, z, is sampled from θ. Lastly, the word itself, w is sampled from another Dirichlet distribution φz, which is parameterized by β. Given that θ and φ are multinomial distributions them-
selves, it is fitting to sample them from a Dirichlet distribution as it is the conjugate prior of the multinomial.
When fitting an LDA model, the goal is to find values for the unobserved variables that result in a high likelihood of the observed corpus, D, with the corpus probability defined
Figure 3.1: Graphical model representation of LDA. Image originally presented in [14].
by the model as:
p(D|α, β) = Y d∈D Z p(θd|α) YNd n=1 X zdn p(zdn|θd)p(wdn|zdn, β) dθd
While exact computation of the posterior distribution of the hidden variables (i.e., θ and z in Figure 3.1) is intractable, it can be approximated using methods such as variational inference (an algorithm has been outlined for this in [14]) or Collapsed Gibb’s sampling [110]. In the Gibb’s sampling case, for example, θ and φ are either fixed or updated in an alternating fashion. That is, first θ is fixed in order to sample topics for each word in each document. Then, these topic assignments are used to update the topic-word probabilities in φ based on the number of times that each word was assigned to each topic. Next, φ is fixed and used to sample topics for the words in any document, and the new values of z can then be used to update θ for each document. This process is repeated until convergence, i.e., the values of these variables do not change or only change by a very small amount.
A huge number of extensions have been proposed to LDA in order to tackle specific problems, showing that LDA is a strong base model from which successful variants can be constructed. For example, labeled LDA modifies the traditional LDA model by incorporat- ing document tags into the model so that the learned topics correspond to defined tags that are already providing some loose structure for the corpus [112]. Z-label LDA, on the other hand, allows a practitioner to initialize word-topic relationships based on external knowl- edge or hypotheses [6]. Tweet-LDA is a specific version of LDA tailored for short texts in which the assumption that a document is composed of many topics no longer necessarily holds [152]. Cross collection topic models are able to learn topic distributions of multiple collections of documents that might have different properties of interest, allowing for the shifting of topics across different subsets of the corpus in order to observe how the topics change [101]. These examples are only the beginning of a long list of other ways to take the base LDA model and adapt it to solve myriad problems.
Despite it’s ubiquity, LDA does have some drawbacks. Aside from general criticisms of topic modeling approaches, there are cases in which LDA itself may not be ideal. The as- sumptions that every topic contains every word with some probability and every document contains every topic with some probability can make LDA models difficult to interpret, and the continuous nature of the results can lead to ad-hoc decision making regarding the number of top words to associate with a particular topic. Often, social science researchers are interested in a more discrete set of terms, and clear separation between the groups is very important.
3.2.1.2 The Meaning Extraction Method
The Meaning Extraction Method (MEM) [30] is an alternative topic modeling approach that has been proposed in the context of psychology research. Essentially, the MEM treats the presence of a word in a document as a binary indicator variable, and then a PCA-based factor analysis is performed in order to find the primary factors that explain the presence of the words.
The MEM takes a corpus as input, represented as a |V | × M word-document matrix, D. Then, a Principal Components Analysis [103] is run on D and the top k components are retained and scaled to produce factor loadings. These loadings are then adjusted using the varimax rotation,[62] which seeks to maximize the following :
V ARIM AX =X(lj,q2 − l−2j,q)2
Where lj,q is the value of the loading for the jth word on the qth factor. Essentially, to get
a high value from the equation, words should be loaded more heavily on a small number of factors, and each factor should contain a small number of words. The rotated loadings are used as the word-topic scores (i.e, φ), and the softmax function can be applied to each column in cases where it is required that topics appear as distributions over words. While the MEM doesn’t generate the document-topic probability matrix θ inherently, we can simulate it using the training data that the model has been fit to. For each document, we compute a score for each topic based on φ as follows:
θ = (DφT)
and then normalizing so that the rows sum to equal 1. The Meaning Extraction Method has previously been used for numerous projects in the social sciences and digital humanaties. For example, the MEM was used to analyze social media data in order to study potential
Russian interference in the 2016 United States elections [22], to summarize feedback given to students in communication courses [75], and to study patterns of content exchange in emails between psychotherapists and their patients [148].
3.2.1.3 Relationship between LDA and the MEM
While LDA and the MEM differ algorithmically, there is also a set of assumptions that is tied to each approach about they ways in which data are preprocessed, represented, and fed into the models. These additional steps have almost become indistinguishable from the models themselves, but that does not necessarily need to be the case. While some of the typical processing associated with each method may have been heuristically or empirically determined to be useful and exists for good reason, we seek to decouple these additional steps from the models themselves in order to make a fair comparison. That is, to truly evaluate the differences between the MEM and LDA, we should control all other factors by performing the same preprocessing steps for each model. Throughout section3.3 of this paper, we note the parameter choices that are often used as defaults for each model.
Focusing on the topic modeling methods themselves, we can actually see that they are strongly connected. In fact, LDA can be viewed as a case of multinomal probabilistic PCA [25]. Probabilistic PCA is a reformulation of the traditional PCA method using a latent variable model. In probabilistic PCA, an observed variable x, is defined in terms of a transformation of a Gaussian latent variable z (representing the principal component space) and additive noise:
x = Wz + µ +
which has parameters that can be estimated using maximum likelihood estimation or ex- pectation maximization. In the typical Probabilistic PCA case, comes from a Gaussian distribution. In the multinomial case, we apply the same formulation as above, but to count data, z is a multinomial variable and our noise comes from a Dirichlet distribution. The formulation used in LDA is a special case of this multinomial, probabilistic variation of PCA. Therefore, by using a PCA-based approach like the MEM, we are removing the probabilistic nature of our solution, which makes computationally efficient solutions less feasible, especially with large, high-dimensional data sets. Further, the classical MEM for- mulation uses binary variables to represent the input data, but by changing this to count data, we are already moving one step closer to LDA– the major difference at that point being the probabilistic formulation. However, in the case of the MEM we are also using a post-decomposition rotation to try to enforce separation between the sets of words that appear in each topic. Seeing how these two methods are only a few steps removed from
one another, we would like to empirically explore the effects of each of these small, yet important, differences.