these novel features with text features from past research, and demonstrated those novel features are more suitable for CLVD. In addition, we explored the contributions of bots (automated algorithms) – often ignored or dismissed in research – to detecting vandalism on Wikipedia. This chapter showed our con- tribution of novel text features suitable for CLVD and how bots (bot editors) and users (human editors) compare in the vandalism detection task across lan- guages.
• Chapter 7 developed a novel context-aware CLVD technique to address types of sneaky vandalism that involve changing the meaning of text. The tech- nique used word labels and sequential patterns of occurring words to identify words used for vandalism, which also allowed immediate identification of van- dal words that constitute evidence of malicious intentions. We compared this technique with the text feature technique of the previous chapter, showing dif- ferences in the vandalism each technique detects. This chapter provided a new research direction for vandalism detection research of context-aware techniques to tackle increasingly difficult types of vandalism.
• Chapter 8 showed the extendibility of CLVD techniques to other domains by applying the text features from Chapter 6 to detecting malicious content in spam emails. Many of the novel features proposed in Chapter 6 and specific to this chapter showed that text in spam emails is a strong predictor of whether the attachments (and to a lesser extent, URLs) of emails are malicious. These text features and classification algorithms significantly reduce the need to scan emails using comparatively more complex data sources. This chapter demon- strated that the CLVD techniques in this thesis can be applied to other appli- cation domains, and provided a new direction of research to find some of the most damaging types of spam emails from a cybercrime perspective [Hofmeyr et al., 2013].
9.2
Future Work
In this section, we summarise the future research directions of this thesis based on the summary section of Chapters 4 to 8.
• Additional languages.1 We aim to investigate new languages that are some of the largest language editions on Wikipedia2. In particular, the language edi- tions of Vietnamese, Mandarin Chinese, and Japanese have become some of the largest Asian languages represented on Wikipedia in terms of the num- ber of articles. The European languages used in this thesis share similar text structures which reduced the complexity of developing language independent features. The combination of Asian and European language families creates ad- ditional complex challenges because of the different representations of words,
1https://en.wikipedia.org/wiki/List_of_language_families 2https://meta.wikimedia.org/wiki/List_of_Wikipedias
sentences, and grammar that may cause difficulties in developing language in- dependent features.
• Cross-language summarisation measures. The summarisation measures pre- sented in Chapter 4 allowed visualisation and determination of the knowledge coverage (similarity) of articles across languages and activity (stability) of arti- cles within languages. We intend to further refine these measures and develop new measures (such as incorporating semantic knowledge from DBpedia) to allow us to characterise the changes on Wikipedia in different languages.
• Derived metadata features. The metadata features presented in Chapter 5 were limited to existing features, which allowed fast combinations of feature data between two data sets. We look at deriving additional features based on monthly and yearly access patterns, and other more complex derived metadata features based on time, location, and reputation from West et al. [2010a].
• Derived and additional text features. Our proposed novel text features from Chapter 6 showed improved classification performance compared to text fea- tures used in related work. We intend to propose additional complex derived text features based on textual analysis and natural language processing to com- pare with the vast number of features used by West and Lee [2011].
• Combined metadata and text features. The CLVD research presented in Chap- ters 5 and 6 addressed the use of classifiers and contributions of bots and users, respectively. We look to combine these research aims in future work to com- prehensively evaluate combinations of four distinguishing aspects of CLVD: classifiers, feature sets, user types, and languages.
• Contributions of different user types. Our experiments in Chapter 6 com- pared and contrasted the contributions of bots (bot editors) and users (human editors) because the counter-vandalism activities of bots are often not seen in research papers. Similarly, the differences between contributions of anonymous and registered users are often not distinguished in counter-vandalism research. These three user types have different contributions to detecting vandalism us- ing metadata on the English Wikipedia as shown by West et al. [2010a]. We would like to extend our CLVD research to distinguish the three user types (bots, registered users, and anonymous users) and their contributions in future research.
• Evaluation of other classifiers. The Random Forest (RF) classifier was chosen for later chapters after experiments in Chapter 5 to avoid excessive results in- volving combinations of different classifiers, feature sets, and languages. We intend to revisit the experiments in Chapters 6 to 7 with different classifiers to explore whether there are more suitable classifiers that can meet the parallelism requirements and high classification scores of the RF classifier.
§9.2 Future Work 141
• Additional semantic tags. The tag set for the context-aware CLVD technique of Chapter 7 was limited to POS tags, but allowed us to demonstrate the feasibility and scalability of this novel technique for Wikipedia. We look to add new tag sets from different domains such as word semantics3, WordNet4, and Wikipedia ontologies5. More complex dependencies between these tag sets can also be modelled through feature functions to determine patterns of sneaky vandal words on Wikipedia.
• Modelling additional dependencies. The context-aware CLVD technique of Chapter 7 was limited to patterns of tags for sentences because of the linear- chain conditional random fields (CRF) classifier. The general CRF classifier [Sut- ton and McCallum, 2010] allows modelling additional dependencies between articles, which allows us to explore the spread of vandalism to adjacent inter- nally linked articles, or articles linked across language.
• Specific features for detecting malicious attachments and URLs. In Chapter 8, we used text features from Chapter 6 and additional features for attachments and URLs. These additional features were relatively simple compared to the text features, which may not have allowed the classifier to distinguish the mali- cious content. We look to include additional features based on lexical analysis of names of attachments and URLs [Ma et al., 2009b; Le et al., 2011; Khami et al., 2014], and avoid using external resources where possible.
• Spam campaign analysis. In Section 8.8 of Chapter 8 on detecting malicious spam emails, one issue we did not address is spam campaigns in our email data sets because of non-disclosure of email sources (their originating servers) and the anonymisation of email addresses within each of our data sets. However, on review of the research for this thesis, our initial findings suggest that spam campaigns may be identifiable from approximate matching of text in different emails because although spam campaigns use templates [Stone-Gross et al., 2011], the variations for avoiding detection may not be significant enough to avoid text analysis for approximate matching [Kreibich et al., 2008].
• Other collaborative environments. Malicious activities such as vandalism also occur in other non-wiki based collaborative software systems. These systems face unique challenges such as preventing design flaws in experiments or tasks offered through the Amazon’s Mechanical Turk6, and preventing false map details or graffiti in the OpenStreetMap7 [Neis et al., 2012]. We look to extend our research to these other collaborative systems to show the extendibility of our approaches in addition to the malicious spam email detection research of Chapter 8. 3https://www.freebase.com/ 4http://wordnet.princeton.edu/ 5http://dbpedia.org/About 6https://www.mturk.com/mturk/welcome 7http://www.openstreetmap.org