• No se han encontrado resultados

2.3. Uso del ADN como identificador biométrico

2.3.6. Bases de datos

To evaluate the performance of semantic features for file-level defect prediction tasks, we compare the semantic features with two different traditional features. Our first baseline of traditional features consists of 20 traditional features. Table 3.6 shows the details of the 20 features and their descriptions. These features and data have been widely used in previous work to build effective defect prediction models [84, 109, 169, 170, 194, 326].

We choose the widely used PROMISE data so that we can directly compare our ap- proach with previous studies. For a fair comparison, we also perform the noise removal approach described in Section 3.3.2 on the PROMISE data.

The traditional features from PROMISE do not contain AST nodes, which were used as the input by our DBN models. For a fair comparison, our second baselineof traditional features is the AST nodes that were given to our DBN models, i.e., the AST nodes in all files after handling the noise (Section 3.3.2). Each instance is represented as a vector of term frequencies of the AST nodes.

Table 3.6: Benchmark metrics used for file-level defect prediction.

Metric Description

WMC the number of methods used in a given class [46]

DIT the maximum distance from a given class to the root of an inheritance tree [46] NOC the number of children of a given class in an inheritance tree [46]

CBO the number of classes that are coupled to a given class [46]

RFC the number of distinct methods invoked by code in a given class [46]

LCOM the number of method pairs in a class that do not share access to any class attributes [46]

LCOM3 another type of lcom metric proposed by Henderson-Sellers [57] NPM the number of public methods in a given class [22]

LOC the number of lines of code in a given class [22]

DAM the ratio of the number of private/protected attributes to the total number of attributes in a given class [22]

MOA the number of attributes in a given class which are of user-defined types [22] MFA the number of methods inherited by a given class divided by the total number of

methods that can be accessed by the member methods of the given class [22] CAM summation of number of different types of method parameters in every method

divided by a multiplication of number of different method parameter types in whole class and number of methods [22]

IC the number of parent classes that a given class is coupled to [111]

CBM the total number of new or overwritten methods that all inherited methods in a given class are coupled to [111]

AMC the average size of methods in a given class [111]

CA afferent coupling, which measures the number of classes that depends upon a given class [161]

CE efferent coupling, which measures the number of classes that a given class depends upon [161]

Max CC the maximum McCabe’s cyclomatic complexity (CC) score [163] of methods in a given class

Avg CC the arithmetic mean of the McCabe’s clomatic complexity (CC) scores [163] of methods in a given class

Baselines for Evaluating Change-level Defect Prediction

Our baseline features for change-level defect prediction include three types of change features, i.e.,bag-of-words features,characteristic features, andmeta features, which have been used in previous studies [104, 256].

• Bag-of-words features: The bag-of-words feature set is a vector representing the count of occurrences of each word in the text of changes. We employ the snowBall

stemmer to group words of the same root, then we use Weka [74] to obtain the bag-of-words features from both the commit messages and the source code changes.

• Characteristic features: Inspired by the Deckard tool [103], we use characteristic vectors as features. Characteristic vectors represent the syntactic structure by count- ing the numbers of each node type in the Abstract Syntax Tree (AST). Bag-of-words and characteristic vectors have different abstraction levels. Although bag-of-words can capture keywords, such as if and while, it cannot capture abstract syntactic structures, such as the number of statements. Suppose that we are usingifandelse

node types for characteristic vectors, the characteristic vector of the code before the changes shown in Figure 3.5 is (1, 1). After obtaining the characteristic vectors for the file before the change and the file after the change, we subtract the two charac- teristic vectors to obtain the difference. For each change, we use Deckard [103] to automatically generate two characteristic vectors: one for the source code file before the change and one for the source code file after the change. We use the difference between the two characteristic vectors and the characteristic vector of the file after the change as two sets of features.

• Meta features: In addition to characteristic and bag-of-words vectors, we also use a set of metadata features, which includes the basic information of changes, e.g., commit time, filename, developers, etc. It also contains code change metrics, e.g., the added line count per change, the deleted line count per change, etc.