CAPÍTULO II: ESTUDIO DE MERCADO
2.2. MICROENTORNO
2.3.4. Análisis de los resultados
Spectral clustering (§ 6.3.2) produces groups of accounts that exhibit similar traits. Table 6.1 lists traits that are similar among accounts within the same cluster, e.g. aggressive tweeting patterns. However, this provides little insight into what di↵erent types of bots tweet about. Particularly, I am interested in un- derstanding the context of each bot in terms of its purpose and topics of interest.
8Drudge (better known as Drudge Report) is a news aggregator service that allows the user
Figure 6.4: Distribution of top 20 activity sources per cluster: percentages are calculated per source per cluster (i.e. normalised for di↵erent sources in each cluster).
Next, I attempt to explore the topics discussed within each cluster. I hy- pothesise that certain clusters may have a proclivity towards certain prominent topics. I emphasise, however, that the clusters are derived from the traits listed in Table 6.1, i.e. topical similarity was not taken into consideration. Hence, I now explore popular topics discussed within and across clusters.
I start by filtering stop-words and frequently occurring words, such as URL protocol names (to clean the text). I then employ topic-modelling by converting tweets into the most popular topics per bot account. In order to accomplish this I use Latent Dirichlet Allocation (LDA). LDA is an unsupervised genera- tive probabilistic model that discovers latent structure in a set of documents by considering each document as a collection of latent topics. Tweets are first bro- ken down into word vectors, and topics are then modelled as a distribution over word co-occurrences. Exact details regarding LDA can be found in [9]. I use the
LDA implementation inscikit-learn[67] to generate topic models for the eight clusters.
Figure 6.5–6.6 presents the topic word cloud for each cluster. For the purposes of comparison, Figure 6.7 shows most popular topics and words tweeted by the 11,379 human Twitter users. To give greater context, I perform a manual review exercise to allocate topic labels to these clusters. Topic labels are only generally suggestive and indicative, not decisive. Therefore, I manually label these eight clusters into any combination of Advertisements & Marketing (A), Daily A↵airs & Lifestyle (D), International A↵airs (I), News (N), Politics (P), Online Social Networks (O), Sports (S), and Television (T).
It can be seen that di↵erent clusters have a di↵erent “skew” towards cer- tain topics. For instance, whereas accounts in Clusters 3–7 (dominos, HPbas- ketball, RedeGlobo, BBCWorld, MoneyA↵airs, BreakingNews, CollingwoodFC, ESPNFC, WDRBNews) have certain very dominant topics of discussion,e.g.Bas- ketball, The Economist, Football, etc, accounts in Clusters 0–2 (AJArabic, bbc- worldfeed, CNNEE, CNNsWorld, NFL, pitchpivot, photo cj, reddit top, swis- sifg, talkvn, teachersdesign, trafficjamnet, whats live, youkoudan, yalgaarmateen) have a far more egalitarian distribution of topics. This is predominantly driven by the size of these clusters. Whereas Cluster 0 has over 3K accounts, Cluster 7 has just 8 accounts. Despite this, there are clear topics shared across each group, particularly related to politics,e.g.US politics. This suggests that each cluster is not dedicated to individual topics but, rather, their behaviour traits are shared across accounts tweeting on a number of issues.
To explore the similarity between the topics, I also compute the topical affinity scores for each cluster against every other cluster. Affinity scores are computed by calculating close matches between pairs of clusters (e.g.0 and 1, 0 and 2, and so on) using Python’sdifflib9library. Tiny di↵erences can be observed between same pairs in opposing sequences (e.g.0 and 1, 1 and 0) because the first item of the pair is taken as a base to compare against the second item. When the order of comparison is reversed it changes the comparator cluster (base) and therefore produces the di↵erence in result.
Table 6.4 shows the produced clusters and their affinity scores, where boldface shows the highest topical affinity between two clusters, as well as topic labels per
(a) 0 - Young producers - DNP.
(b) 1 - Young assistants -
ANPST. (c) 2 - Assistants - ADO.
(d) 3 - Popular content pro- ducers - DS.
(e) 4 - Popular content redi- rectors - INP.
(f) 5 - Stellar active en- gagers - INP.
Figure 6.5: Word Clouds of extracted bot clusters with their statistical labels (Table 6.2) and topic labels: Advertisements & Marketing (A), Daily A↵airs & Lifestyle (D), International A↵airs (I), News (N), Politics (P), Online Social Networks (O), Sports (S), Television (T).
cluster. This shows that there is heavy overlap between the topics discussed in di↵erent clusters. For the purposes of comparison I also show the affinity scores between the entire human population (11,379 accounts in total) and the eight bot clusters. The bot clusters are strikingly similar to the human population in terms of the popular topics in tweets. The reason of this is that most of the bots are reproducing content which has been posted by humans (either on Twitter or from elsewheree.g.via external URLs). Additionally, this suggests that although there are two very distinct entity populations on Twitter, the topics are highly common among the entities. This strongly indicates that bots are trying to appeal to humans because human action (in the form of a like, retweet, follow, external redirection, influence, bias, manipulation, support, publicity, etc) is the end goal
(a) 6 - Stellar passive en- gagers - ADIT.
(b) 7 - Social chameleons - INPS.
Figure 6.6: Word Clouds of extracted bot clusters with their statistical labels (Table 6.2) and topic labels: Advertisements & Marketing (A), Daily A↵airs & Lifestyle (D), International A↵airs (I), News (N), Politics (P), Online Social Networks (O), Sports (S), Television (T).
for most of these entities as noted in Chapter 4.