This section presents a qualitative analysis of topics with high estimated probabil- ity of personal over gender salience from one of the 20-topic models and highlight some caveats on their interpretation. Though I draw some weak conclusions about the individual and social psychology of authors in the community, I leave stronger conclusions to further study.
First I would like to investigate the logistic model for differentiating gender from personal identity salience. Appendix B includes a table of the eight most significant logistic coefficients of the model. The coefficients of the remaining LIWC classes are very small and do not contribute significantly to the model. One could interpret personal identity salience as a kind of soft default in the model: the model has many LIWC classes with negative coefficients that indicate gender salience (suggesting the model is able to identify gender salient text) but few positive coefficients and a positive intercept value (suggesting the model identifies personal salience largely by the absence of gender salience indicative words). We will see that this default characteristic played a role in the apparent personal salience of some topics. The two positive coefficients do, however, allow for positive identification of personal salience, as we will later see with topic 4.
Figure5.3presents a visualisation of that model created with the termite topic model visualisation tool [Chuang et al. 2012]. Termite uses a balance between the significance of words in each topic, and their ability to distinguish between topics, to choose a set of words that can provide an overview of the semantics captured by the model. Words are ordered in an attempt to present groups of words from each topic as well as present common phrases where possible. The interactive version of Figure 5.3 as well as a visualisation of the model with non-words removed and an example 50-topic model can be found at http://cs.anu.edu. au/~Ian.Wood/termite/20-run1-LIWCvocab/public_html/. Summaries of the words with greatest contribution to the logistic model of salience probability for each topic are also presented in Appendix B.
On inspection of the visualisation, it can be clearly seen that topics 1 and 13 are dominated by collections of hash tags. Less obvious in Figure 5.3, though easily identified in the interactive version, is that topics 6 and 8 are also dominated by hash tags.
Topic 13 in particular has very little weight in normal words (tokens that are not hash tags, smileys, punctuation etc. . . ), with nearly all the probability in hash tags, the vertical bar (“|”) and the word “PIC”. In Appendix B we can see that
Figure 5.3: 20 topic model with probable personal salient topics highlighted. Interactive version athttp: // cs. anu. edu. au/ ~ Ian. Wood/ termite/
this topic has only one word (with only two occurrences) considered in the logistic model for salience probability, and it is the absence of words that gives this topic high estimated probability of personal salience, since only the intercept value of the logistic model plays a significant role. The absence of sentence punctuation such as full stops and commas, and the extreme lack of diversity among actual words, this topic most probably does not contain sentences and other discourse. The estimated probability must therefore be considered suspect, as the context is quite different to the identity salience study, there are very few word types to draw conclusions from and the logistic model arguably ignores any semantics that the topic may indicate, which perforce must be encoded in non-words such as hash tags.
In the case of topics 1 and 6, this argument appears less strong. Though about 90% of the tokens assigned to these topics are non-words, they nonetheless contain a diversity of words in the remaining 10% and non-trivial probability in sentence punctuation such as full stops. Topic 8 does not contain notable proportions of punctuation, though a diversity of other words is present. It would be well to investigate tweets strong in topic 8 and assess to what extent they could be said to fit the context of discussion on the topic of diet.
Of perhaps greater interest is topic 4. The tags present in this topic are also informative (in order of significance within the topic): “#edproblems”, “#thin- spiration”, “#proana”, and to a lesser extent “#diet”, “#thighgap”, “#fitspo”3 and “#ednos”4. The association between these tags indicates particular themes seen as important or relevant to people using this topic in their tweets. These themes resonate with several of the themes identified in the expert assessment re- ported in Section4.5and AppendixC. #edproblems and #ednos specifically refer to personal difficulties relating to eating disorders. #thinspiration and #fitspo (short for “fit-inspiration”) are motivational. #proana could have several in- terpretations, such as identifying with a community or motivation, but clearly relates some affinity to eating disorders and anorexia. #thighgap5 is a symbol of the “thin ideal”. #diet reflects a concern over controlled eating, most probably with an eye to weight loss, and occurs with similar prominence to the words “eat” and “eating”. The presence of #diet within the more significant tokens suggests that the context of this topic is similar to the writing task used in the gender
3“fitspo” is short for “fitness inspiration”.
4“ednos” refers to “Eating Disorder Not Otherwise Specified”, a term used for clinical clas-
sification [WHO 2015].
5An open space between a person’s thighs when their feet are together — there are many
salience study. It is also interesting that this topic contains many function words and simple punctuation (“.”, “,”, “!”, “?”) and several verbs. This may suggest that tweets strong in this topic may contain short full sentences, however exami- nation of a sample of such tweets reveals tweets that are primarily just hash tags, and that the sentence-like tokens are more likely a (low frequency) theme that is merged into this topic.
Pronouns in topic 4 are mostly in the first person singular (“I”, “I’m”, “my”, . . . ), with a small presence of impersonal pronouns (“it”, “it’s”) and a distinct lack of any other pronouns. It is interesting to note that personal pronouns do not have predictive power in the logistic model of personal vs. gender salience from [Dann 2011]. In past studies, high first person singular pronouns frequencies have been associated with honesty, depression, low status, personal and emotional communications, and informal language [Tausczik and Pennebaker 2010b]. How- ever, one must be careful to match the context of those studies with the current context. For example, though status may play a role in the Twitter pro-ana com- munity, it seems unlikely that it would reveal itself in pronoun usage in a model of such low granularity. On the other hand, personal and emotional communications seem plausible in this context. To identify the actual role of topic 4 pronouns, expert analysis of a sample of tweets strong in topic 4 would be required.
In regard to the high estimated personal salience probability for topic 4, al- most all the positive contributions are from words that relate to eating (LIWC variable “ingestion”) with a small contribution from words relating to insight. The only LIWC variable with a negative contribution that is not represented is “influence” (it has a trivial contribution). Impersonal pronouns and causality words (because, effect, hence, . . . ) make a small contributions and the others — inclusion words (and, with, include, . . . ) exclusion words (but, without, exclude, . . . ) and negation words (no, not, never, . . . ) all contribute substantially (see Appendix B). The presence of most of the contributing LIWC classes as well as the hash tag #diet support the idea that the context of this topic may be simi- lar to that of the gender salience study, and that the model is indeed detecting personal identity salience.