In this section I describe some of the more technical pitfalls and difficulties en- countered during the development of the analyses presented here
In the first attempted analysis, the tokeniser treated underscores in hash tags and user mentions as separate tokens, resulting in a large number of underscore tokens. To further complicate the situation, when counting the number of words (as opposed to non-word tokens) attributed to a topic, the underscore (“ ”) was counted as a word (as opposed to punctuation) due to the default characterisa- tion of word characters in perl regular expressions (which include underscores). The number of words attributed to a topic was used in calculating salience prob- abilities, so this had a small effect on some salience scores. 1067 underscore characters are assigned to topic 8, which increased the number of words (as op- posed to non-words) for topic 8 from 5124 to 6191, a 20% increase. Topics 1 and 6 also contained notable numbers of underscores, though only 4% and 2% of words respectively. Topic 8’s regression coefficient remained dominated by the regression model intercept, however, changing only from 3.13 to 3.10, and thus the resulting salience probability only changed from 0.958 to 0.957. The effect on topics 1 and 6 was an order of magnitude smaller. If, however, we were interested in topics with low estimated personal probability, the word frequencies dominate in the regression calculation, and a substantial shift could occur. For example, topic 16 shifted from 0.48 to 0.32 with a 20% increase in words if the underscore is included.
Two of the topics with high estimated personal salience have non-trivial prob- ability assigned to colons (:) and underscores ( ). In Twitter, one can “reply” to a tweet, in which case the produced tweet typically has the form
@my user name:“replied tweet text” my additional text.
If a reply is not recognised by the pre-processing filter and removed (for example, a tweet made before Twitter provided metadata indicating replies and which the user edited the reply text), and a user name containing an underscore was not recognised by the mention filter, the resulting corpus document would contain at least one underscore and colon. Many such tweets would result in frequent co-occurence of these characters, and tend to cause them to also co-occur in one or more topics. Such topics would include words often used in these replies that may not have a strong relationship elsewhere in the corpus. This is an indication of one way a topic may represent non-semantic structural features and easily be mis-interpreted as representing general semantic relations. For example, it is generally a good idea to investigate sample documents with strong representation in a topic before drawing too many conclusions.
some words with non-trivial probability in one or more topics may not have made it into the 100 words chosen by termite to represent the model. For example, in topic 13, the words“see” and “more” can be found to have substantial represen- tation in the interactive visualisation for which non-words were excluded (found at the URL mentioned above) but these words do not appear in Figure 5.3.
5.8
Discussion and Future Work
In Section 5.2 I state that a favourable posterior predictive check for a topic indicates that words are as evenly distributed as can be expected, and claim that thus topic probabilities can provide a reasonable estimate for word frequencies among words assigned to that topic. A more rigorous statistical assessment of that claim and quantification of its uncertainties would be in order.
The observation that regularisation can break the independence of topic and document word allocations (as tested by the posterior predictive checks presented here) likely extends to other supervised topic models also. In Section 2.4.4 I in- troduce several other methods for providing prior information about topic struc- ture [Jagarlamudi et al. 2012; Hall et al. 2008; Ramage et al. 2009; Ramage et al. 2011]. If these models also break the independence of topic and document word distributions, the general utility of such supervised models must be brought into question — do the models reflect “true” structure in the corpus, or merely the prior provided? Measuring this effect and developing unbiased tests tailored to the intended application of such models would be needed to establish model credibility.
Conversely, if those approaches to topic supervision are found not to adversely effect, or to minimally effect, the independence of topic and document word dis- tributions, they would be good candidates for improving the presented approach to measuring psychological (or other word frequency correlated) features.
The analysis in Sections5.7.2and5.7.3is intended as an indication of the type of analyses that can be done to interpret the results of the presented approach. A deeper analysis including expert review of tweets strong in presumed salient topics and associations between topics found to be good candidates for salience and particular users, groups, hash tags etc. . . would be of interest to both the social media and eating disorder research communities.
5.9
Conclusions
This chapter develops and demonstrates a methodology for combining topic mod- els with word frequency based psychometric tools, providing useful contextuali- sation and a measure of the features those tools detect. Results such as this can help to provide insights into the psychological processes active within a group as well as provide some measure of their activity.
Though the psychological study used to provide a psychometric proxy was small and arguably distant from the context of people tweeting in the Twitter eating disorder and thinspiration community, this study serves as a useful illustra- tion, paving the way for future studies combining more traditional psychological questionnaires, elicited text responses and online social media data.
Topic regularisation as a means for model supervision was found not to im- prove the method due to its adverse effect on the independence of topic and document word distributions (as measured by posterior predictive checks).
The next chapter introduces a method for combining a topic model and over- lapping network community model drawn from the same data set, associating individual documents with communities and estimating topic mixtures for each community.
Chapter 6
Community Topic Usage
This chapter presents a Bayesian model to identify community topic usage in data combining documents and a network of their authors. The content of this Chapter has been published in the proceedings of the workshop “Topic Models — Post Processing and Applications” at CIKM 2015 [Wood 2015c], presented here with minor additions.
Members of social groups share some purpose, beliefs or other common human features, and one would expect those features to appear as common language markers. One premise of this thesis is that such common language use can be detected via topic models.
In the presented model, overlapping communities are identified using stan- dard network community detection algorithms and document topics using stan- dard topic models. The model then associates those topics with communities, balancing community topic coherence with author community affiliation.
This chapter is organised as follows: In Section 6.1 I present an overview of the background, motivation and relevant literature to the model. In Section 6.2 I describe and develop the model, including the conjugate prior to the Dirichlet distribution. In Section 6.3 I present an algorithm based on Gibbs sampling for estimating the posterior. In Section 6.4 I describe the data set and contributing topic and community detection models used as an example in this study. In Sec- tion6.5I develop two metrics for assessing model quality. In Section6.6I present results showing the that the model succeeds in its aims. In Section 6.7 I discuss the results and their implications, and indicate some research questions that may be of interest for future work. In Section 6.8 I summarise the contribution.