publicidad Investigaciones y conclusiones al respecto.
3.2. E rving Goffman y el interaccionismo simbólico.
Frantzeskou et al. [2004] extended the source code authorship analysis four-heading taxonomy (au- thorship identification, authorship characterisation, author intent determination, and author discrimi- nation) by Gray et al. [1997], with a fifth category — plagiarism detection — to highlight the common ground between authorship attribution and plagiarism detection. Plagiarism detection is concerned with finding matching segments in work samples, whereas authorship attribution is concerned with identifying the author of work samples using a trained model. The common ground between these ar- eas is the use of metrics. Metric-based implementations have been used to analyse both stylistic traits in authorship attribution and content similarity for plagiarism detection. However, comparison of au- thorship attribution strategies to structure-based plagiarism detection systems such as JPlag [Prechelt et al., 2002], is not appropriate, as these systems match contiguous work sample fragments against one another. These approaches are not helpful in identifying authorship unless there are also non- trivial chunks of matching content.
There are many tools to detect plagiarism in both natural language and source code. In particular, source code approaches have been divided in the literature according to metric-based and structure-
based approaches [Verco and Wise, 1996]. We now review natural language plagiarism detection,
metric-based source code plagiarism detection, and structure-based source code plagiarism detection.
Natural Language Plagiarism Detection
Natural language plagiarism detection tools have been used on work such as essays and reports for academic and corporate text-based domains. Hoad and Zobel [2002] described two methods for de- tecting plagiarised text documents: ranking and fingerprinting. Ranking involves presenting the user with a list of candidate answers sorted by similarity to a query; this approach is commonly used by search engines to retrieve multiple answers to a query where there is not necessarily a single correct answer. Ranking requires fast lookup of candidate documents using keywords through the use of an inverted index. A similarity measure is employed to give a score for candidate documents [Witten et al., 1999].
4.1. RELATED AREAS
Fingerprinting involves the computation of compact descriptions of documents, typically numeric
representations generated using a hash function. Selective fingerprinting can be used [Heintze, 1996] to make the representation even more compact and hence more scalable. Selective fingerprinting can be implemented with either fixed length fingerprints, where each fingerprint size is independent of document length, or variable length fingerprints, where the fingerprints are relative to the size of the full fingerprint. However, the selection of the document components that make up the fingerprint is non-trivial. Fingerprint hashes can be chosen using random selection, which produces poor re- sults [Heintze, 1996]. A better selection strategy is one that returns similar fingerprints for similar documents by picking a fixed number of hashes with the lowest values [Heintze, 1996].
Concerning specific systems, Turnitin [iParadigms, 2007b] and iThenticate [iParadigms, 2007a] are well-known text plagiarism detection tools. These systems compare submitted samples against repositories of documents obtained from the Web and other sources. However, details of the inner- workings of these systems are difficult to obtain due to commercialisation.
The EVE (Essay Verification Engine) plagiarism detection system [Stevens and Jamieson, 2002] works similarly to Turnitin, as it compares documents to online sources. However, the comparison is done using current versions of Web documents on the fly, instead of downloading the content into a database. This means that although the content is fresh, the process is less efficient. Niezgoda and Way [2006] alleviated this problem by identifying document passages with high average word lengths in suspect documents, which were submitted as queries to the Google Web API for similarity analysis to online documents. This approach is advantageous as only the most relevant candidate documents are returned. However, there is an upper limit on the number of queries that can be processed from a Google Web API account per day, so scalability is limited.
Concerning scalability, the repetition of a single sentence may be enough to indicate wrong- doing or at least inappropriate or missing citation or quoting in plagiarism investigations. However, scalability is a challenge in authorship attribution, as reasonable amounts of content are needed to establish an authorial style. The exact amount is uncertain, as previous work has simply indicated that increasing the amount of training data increases accuracy in general [Zhao and Zobel, 2005].
Scalability is also relevant when checking for plagiarism outside of a collection. An example is comparing a collection of student essays to content that could have been copied online. There is no authorship attribution software that is as scalable as existing out-of-collection plagiarism detec- tion services, such as Turnitin [iParadigms, 2007b], which suggests that the scalability of authorship attribution algorithms could be improved.
Plagiarism detection systems also need to be sensitive to proactive attempts to disguise inci- dents. Mozgovoy [2007] used semantic analysis to gain an understanding of document parts of
speech, and then substituted them with tokens indicating the presence of nouns, verbs, places, people, (and other placeholders) to detect simple substitutions. However, we suggest that dissolving docu- ments to parts of speech and other more generalised formats is unlikely to be helpful in authorship attribution as individual word choices are important. For example, Kacmarcik and Gamon [2006] ex- plained how the simple choice of “while” versus “whilst” represented good evidence in determining the authorship of the Federalist papers [Mosteller and Wallace, 1963]. Reducing these words to a part of speech (for instance, a conjunction token), would result in the loss of this stylistic trait.
Metric-Based Source Code Plagiarism Detection
Metric-based source code plagiarism detection systems use quantitative software measurements to identify potentially plagiarised samples. The two most recent contributions in this area are the works by Jones [2001] and Engels et al. [2007], which we now review.
Jones [2001] described an unnamed system based on a vector comprising a hybrid of three phys- ical metrics (line, word, and character counts) and three Halstead metrics [Halstead, 1972] (token occurrences, unique tokens, and Halstead volume) to characterise code. The Euclidean distance mea- sure was used on normalised vectors of these measurements to score program closeness.
The Plague Doctor presented by Engels et al. [2007] is the only metric-based plagiarism detec- tion system we have seen that has employed machine learning techniques. The key idea is to combine the textual analysis techniques of widely accepted structure-based plagiarism detection tools, such as MOSS (Measure of Software Similarity) [Schleimer et al., 2003] and JPlag [Prechelt et al., 2002], with “cues that instructors themselves use when visually scanning two assignments for signs of pla- giarism”, such as use of comments and white space that MOSS and JPlag ignore [Engels et al., 2007]. They proposed twelve software metrics that were used to train a neural network classifier for making plagiarism decisions. The first metric is the output score of MOSS, which is the only representation of a structure-based plagiarism detection system. The remaining metrics concerned differences between duplicate source lines, misspelled words in comments, submission lengths, constants, string literals, looping constructs, white space characters, and a random number as a sanity checking mechanism.
The work by Jones [2001] and Engels et al. [2007] are the only source code plagiarism detec- tion systems using software metrics to appear more recently. The emphasis has since largely moved towards structure-based systems. The earliest metric-based systems were amongst the first that im- plemented electronic plagiarism detection, and some measured archaic features that are less relevant to modern-day programming languages. However, the software metrics in this older literature are still of interest, as these have driven almost all source code authorship attribution research to date.
4.1. RELATED AREAS
The other metric-based systems have employed between four and twenty-four metrics. The ear- liest work is that of Ottenstein [1976], which used the n1, n2, N1and N2Halstead metrics [Halstead,
1972] (defined in Section 2.4.1, p. 36) to indicate possible plagiarised Fortran programs when the four values were the same. Robinson and Soffa [1980] calculated the number of blocks, number of statements per block, control structures used, and data types used in data structures, to eliminate pro- gram pairs from contention that did not have measurement differences within heuristic ranges. Grier [1981] built on the work of Ottenstein [1976] for Pascal programs with the introduction of three new metrics: lines of code, variables used, and number of control statements. Dissimilarity of the programs was then computed using the sum of the differences in the seven measurements. Whale [1986] used a three-pass procedure to detect plagiarism in Pascal and Prolog programs; first, the complexity of each source code block was computed with a complexity measure based on the indi- vidual statements in the block; second, candidate samples were shortlisted with a nearest-neighbour measure; and third, a variation of the longest common subsequence algorithm [Hirschberg, 1975] was computed (a structure-based plagiarism detection component), to identify a final set of programs for inspection. Finally, Faidhi and Robinson [1987] evaluated twenty-four counting metrics and in-
trinsic metrics (such as those dealing with flow control and modularisation), and found that the latter
category contributed more towards plagiarism detection.
All of the above approaches are relevant to authorship attribution due to the use of software metrics. Even features that are easy to modify by plagiarists such as comments are of interest in au- thorship attribution. These features are conversely a limitation in existing plagiarism detection work such as that by Faidhi and Robinson [1987], as these are easy to modify to hide plagiarism. More- over, authorship attribution features would benefit from being insensitive to program length, so that measurements are not biased towards short or long programs. This is unlike the work by Donald- son et al. [1981], which presented a metric-based plagiarism detection system for Fortran code with summation metrics that are sensitive to program length.
Structure-Based Source Code Plagiarism Detection
Mozgovoy [2007] reviewed three kinds of structure-based plagiarism detection systems: fingerprint- ing systems, string matching systems, and parse tree systems, which we now review in turn.
The fingerprinting approach has been demonstrated in the MOSS software [Schleimer et al., 2003], which hashes n-gram representations of source code to single integers. The complete set of hashed values is the initial fingerprint. The pool of hashed values is then reduced in size by applying a fingerprint selection algorithm called winnowing. In this algorithm, a sliding window of size w is positioned over each value in the sequence of hash values in turn, and the smallest value is selected
provided that it was not present in one of the previous windows. The rightmost value is selected in the event of duplicates. The final fingerprint is therefore highly compressible, since the smallest hash values have been selected.
Prechelt et al. [2002] described the JPlag structure-based plagiarism detection system with string
matching. First, the source code samples are parsed and converted into token streams. Second, the
token streams are compared in an exhaustive pairwise fashion using the greedy string tiling algorithm, as used in the YAP (Yet Another Plague) plagiarism detection system [Wise, 1996]. Collections of maximally overlapping token streams above a threshold length are stored and given a similarity score. Program pairs with similarity scores above a threshold percentage are made available to the user, with links to pairs of interest for manual side-by-side inspection, as shown in Figure 4.1.
Parse tree comparison refers to making use of the hierarchy of programs for similarity calculation.
This has been demonstrated by Gitchell and Tran [1999] in the sim plagiarism detection tool that tries to best align each program module between two programs and compare their similarity using global alignment [Needleman and Wunsch, 1970]. Belkhouche et al. [2004] essentially implemented parse trees by transforming the source code into tree-like structure charts. They then identified highly cou- pled regions and compared the structural similarities of these regions with those of other programs.
Mozgovoy [2007] compared the speed and reliability of fingerprinting, greedy string tiling, and tree-matching methods. They have expressed the performance of these categories as a trade-off be- tween speed and reliability. The speed was expressed in terms of the runtime complexity of the software that they analysed in each category. They suggested that fingerprinting is the fastest, fol- lowed by string matching and tree matching respectively. Their observation was that speed comes with a reliability trade-off, and hence the order of these categories for reliability is reversed.
In other work, Burrows et al. [2006] described a scalable code similarity approach that uses the Zettair search engine [Search Engine Group, 2009], to index n-grams of tokens extracted from the parsed program source code. At search time, the index is queried using the n-gram representations of each program to identify candidate results. Candidate results above a similarity threshold are then filtered using the local alignment approximate string matching technique [Smith and Waterman, 1981]. Burrows et al. [2006] showed that their work is competitive and highly scalable compared to MOSS and JPlag.
Generally speaking, structure-based plagiarism detection systems such as the ones reviewed above are unsuited to verify authorship, as they aim to identify functionally equivalent code, rather than stylistically consistent code. However, since n-grams are typically used to represent small pat- terns of adjacent features, their use can indicate general preference of features that regularly occur near one another. Therefore, the use of n-grams is the construct we take from the above structure-
4.1. RELATED AREAS
Figure 4.1: The JPlag plagiarism detection system interface showing the similarity scores of various program pairs with student numbers marked as “0000000”. Permission to use a JPlag screenshot was provided by JPlag creator Guido Malpohl on 11 May 2010.
based plagiarism detection literature for our information retrieval approach to source code authorship attribution in Chapter 5. Moreover, since the work by Burrows et al. [2006] has demonstrated suc- cessful use of information retrieval together with plagiarism detection, it raises the question on how well information retrieval can be used with source code authorship attribution.