We use a generalized framework in which both LDA and MEM can be run (in addition to the potential addition of other methods) in variety of settings. Each topic model can be applied to a number of different corpora, which can be preprocessed in a variety of ways. There are also several parameter choices to be made during the modeling process itself, and finally, a handful of evaluation metrics that can be computed for each model. For the purposes of evaluation, each corpus will have a set of documents that belong to known categories, or classes. This way, we can measure the extent to which the topic models are able to recover the underlying class labels without any supervision.
Formally, each corpus is an unordered set of M documents, D, where each document, dm ∈ D, is a sequence of Nm words from the vocabulary V , i.e., dm = {w0, . . . , wNm}.
Each document has exactly one class label c(d) from the set of labels C. Each topic model should be fit to a specific corpus, D, given a set of parameters, λ, that is, T M = π(D, λ) for a topic modeling method π. The number of topics, k should be specified beforehand (i.e., k ∈ λ). Each topic model must contain two matrices: a document-topic matrix, θ, and a topic-word matrix, φ. θ should be an M × k matrix that gives a likelihood score to each topic for each document, and φ must be a k × |V | matrix that gives a likelihood score to each word for each topic. We would like to find topic models that maximize one or more out of several evaluation metrics, depending on the goals of the researcher.
3.3.1.1 Preprocessing
Before topic modeling even begins, the corpus text is preprocessed as follows: in all cases, the documents in the corpus are tokenized, punctuation is stripped, common conversions from British to American English are applied, common misspellings are corrected, and
Parameter Name Symbol Possible Values Vocabulary Selection Method fV(D, λV) Doc. Frequency,
Class Doc. Fre-
quency, Word
Rank, Fixed List
Document Frequency Minimum dfmin 3%, 5%
Document Frequency Maximum dfmax 95%, 100%
Class Document Frequency Minimum dfminC 3%, 5%
Class Document Frequency Maximum dfmaxC 95%, 100%
Word Frequency Percentile Minimum P Rmin 90%, 95%
Word Frequency Percentile Maximum P Rmax 98%, 100%
Lemmatization L True, False
Training Data Amount T 20%, 40%, 60%,
80%, 100%, 3000 instances
Corpus Data Representation dtypecorpus count, binary
Table 3.1: Corpus preprocessing parameters, shorthand symbols, and values used in exper- iments.
stopwords3 and words containing less than three characters are removed.4 Then, lemma- tization is applied if requested, the Vocabular Selection procedure is applied to produce V , and an M × |V | term-document matrix, D, is initialized, and subsequently populated using the chosen Term-document Matrix Representation. During this phase, the following parameters (summarized in Table3.1) are considered:
Vocabulary Selection Method
We define the vocabulary selection method, fV(D, λV) as a function that takes a corpus
as input and returns a set of words V that should be used for that corpus given vocab parameters λV. Vocab parameters vary depending on the selection method being used.
The vocabulary used for a topic model is important for several reasons. First, including words that are common across the entire corpus will often lead to one or more uninforma- tive topics that contain high concentrations of these ubiquitous words. Even after removing stopwords, other high frequency words may remain, either those missed by the stopword dictionary or exist due to the nature of the corpus. For example, it may be better words like “chapter” in a corpus of novels or “today” in a news corpus. On the other hand, rare words will add unnecessary complexity to the model, and if a word appears only a few times in
3We use the python NLTK (nltk.org) stopword list.
4We acknowledge that each of these initial steps could be ommitted or modified according to an additional
tuning parameter. However, preliminary results showed these steps either have a small or consistently positive impact on overall performance, and we leave them out of our experiments at this time in order to reduce the already large space of possible parameter combinations.
the entire corpus, there will be no good way for a topic model to learn reliable information about the types of words that it co-occurres with. This is common with proper nouns or jargon.
In order to address these potential concerns, we propose four approaches. The Docu- ment Frequency filter selects words based on their document frequencies, defined for a word w in a corpus D as:
df (w, D) = |d ∈ D : w ∈ d| |D|
The filter parameters, λDFV , are dfminand dfmax, and the filter function is:
fVDF(D, λDFV ) = {w ∈ d : d ∈ D ∧ dfmin < df (w, D) < dfmax}
The Class Document Frequency filter works similarly, but document frequencies are com- puted at the class level, i.e.,
dfC(w, D) = X c0∈C |d ∈ D : w ∈ d ∧ c0 = c(d)| |d ∈ D : c0 = c(d)| /|C|
and given parameters λCDFV = (dfC
min, dfmaxC ), the filter function is:
fVCDF(D, λCDFV ) = {w ∈ d : d ∈ D ∧ dfminC < dfC(w, D) < dfmaxC }
The Word Rank filter does not consider which documents words appear in, only their overall corpus frequency. A list of all words, F = Sf req(D), is created by sorting all words
in the corpus in ascending order by frequency. We then define P R(w, F ) as the percentile rank of word w, i.e., the percentage of words that appear before w in the list F . Then, given λW RV = (P Rmin, P Rmax), the filter is:
fVW R(D, λW RV ) = {w ∈ Sf req(D) : P Rmin < P R(w, Sf req(D)) < P Rmax}
It is worth noting that since words frequencies generally follow Zipf’s Law [111], the total count of words in the bottom 90% of the list is relatively low compared to the top 10%. Therefore, we can retain a large proportion of the overall tokens in a corpus, even when setting (P Rmin to a value like 0.90.
Lastly, the Fixed List filter takes a predefined set of words, V0 as input and uses them as the vocabulary. In this work, we experiment with using the set of roughly 8,000 most common English Wikipedia words that was used as a predefined topic modeling vocabulary in foundational examples of LDA [14]. The only parameter in λF LV is the word list V0itself,
and the filter is simply:
fVF L(D, λF LV ) = {w ∈ d : d ∈ D ∧ w ∈ V0} Lemmatization
The choice of whether or not to perform some sort of lemmatization, stemming, or other hashing of words can have an impact on the overall size of the vocabulary. When per- forming lemmatization, the topic model will ignore morphological information that might convey information about tense or number. When the goal is to focus on content, this may be an added benefit to the reduced complexity of a fitting a model to the smaller vocabu- lary remaining after the lemmatization process. On the other hand, some potentially useful information could be removed, and so we experiment with both performing and abstaining from lemmatization5. It has previously been shown that choices about stemming can have a significant impact topic modeling results, including interpretability and stability of topics [124]. As the choice of stemming method has been explore in-depth in prior work, we only consider the option of whether or not to perform any type of lemmaziation/stemming at all, and not the differences in outcomes when using any particular approach.
Training Data Amount
In order to determine the effect of having access to more training documents, we also vary the amount of data to be used to fit the model. The rest of the data is treated as test data, which is used during evaluation. We experiment with using a relative proportion of the full dataset as training data, and we also consider treated a fixed number of instances as training data so that we can make more direct comparisons between datasets that are different sizes.
Corpus Data Representation
Each topic modeling method requires a matrix representing the relationship between documents in the corpus and the words in those documents. We explore two ways to represent the data: either as count variables (the number of times a word appears in the document), as are used in LDA, or binary indicator variables (1 if the word appears in the document any number of times, and 0 otherwise), as are used in the MEM. By using the same data type for both methods, we can make a more fair comparison between them, and by evaluating the methods when fed different data representations than those that are normally provided, we can determine how beneficial it might be use each representation in general.
Parameter Name Symbol Possible Values
Number of Topics k 0.5|C|,|C|,1.5|C|, 2|C|,
5|C|
Method M MEM, LDA
Rotation rot varimax, none
Table 3.2: Topic modeling parameters, shorthand symbols, and values used in experiments.
3.3.1.2 Models
After the preprocessing has been completed, we are ready to begin learning topics from the term-document matrix that represents the preprocessed corpus. Based on that input, we fit the topic models as described in Section3.2using our own custom implementation. At this point, we consider the following topic modeling parameters, which are outlined in Table 3.2.
Number of Topics
Selection of k, the number of topics, is one of the most important parameters when fit- ting topic models. As there is no consensus on the optimal number of topics, it is generally recommended that practitioners test several values of k in order to determine which number of topics leads to the model best suited for their needs. In our experiments, we consider values of k proportional to the number of classes in the dataset being used in order to inves- tigate the relationship between the space of underlying classes and the set of topics learned by the chosen modeling method.
Topic Modeling Method
We consider both LDA and the MEM as topic modeling methods. We use our own implementation of batched LDA [57] with a batch size of 100, and set both α and β to 0.1. For the MEM, we use a factor loading membership threshold of 0.2.
Rotation
For the MEM only, we test the effect of omitting the varimax rotation. This will help us determine the degree to which this rotation, which is typically done by default, actually helps produce to meaningful and accurate themes.