• No se han encontrado resultados

NOTA: EL ORDEN DE ESTOS DATOS ES UNICAMENTE INFORMATIVO, PARA SU CAPTURA REFERENCIAR AL FORMATO GUIA

FRACCION II DEL REGLAMENTO DE LA LISR)

NOTA: EL ORDEN DE ESTOS DATOS ES UNICAMENTE INFORMATIVO, PARA SU CAPTURA REFERENCIAR AL FORMATO GUIA

Our benchmark can be used to measure the recall of clone detection tools and estimate their precision.

Recall and precision are shown in (8.1), where Btcis the set of all true clone pairs in the benchmark, Bf c is

the set of all false clone pairs in the benchmark, and D is the set of candidate clone pairs reported by the

detector. Also interesting is measuring a tool’s recall for subsets of Btc. For example, all clone pairs of a

particular functionality, all clone pairs of a particular type, all Type-3 clone pairs within a particular range

of syntactical similarity, and so on. Precision is estimated as the ratio of the known clone pairs (true and

false) found by the detector that are true clones. It ignores the detected clones that are unknown to the

benchmark. However, the primary purpose of our benchmark is to measure recall, which has been an open

problem in the community for the last decade. While the estimate of precision provides some insight, it does

not replace a true measurement of precision: the manual validation of the output of a tool. However, without benchmarks like ours, it is not possible for tool developers to measure recall because they do not know which

clone pairs exist in a system or repository.

recall =|D ∩ Btc| |Btc|

precision = |D ∩ Btc| |D ∩ (Btc∪ Bf c)|

(8.1)

Our benchmark is outside the scalability constraints of classical clone detection tools, which are not

designed for large-scale. While these tools cannot be executed for IJaDataset in its entirety, they can be

executed for subsets of the benchmark. The subsets would need to be small enough such that the tool could

be executed for the relevant source files without scalability issues. The subsets could be randomly chosen,

could be all the true and false clone pairs found for a functionality, or could even be the intra-project clone

pairs found in one of the 25,000 original subject systems crawled for IJaDataset. High confidence could be

1000

10000

100000

1000000

10000000

#

Clo

n

e

P

a

ir

s

Clone Similarity Range, e.g., [40-50)

Line

Token

Average

achieved by evaluating the tool for a large number of subsets.

An advantage of using our large-scale benchmark to evaluate these classical tools is clone variety. In

addition to their inherit weaknesses [9, 128], classical benchmarks only consider 1-10 subject systems, which

provides a limited variety of clones, especially Type-1 and Type-2. In our experience with the subject systems of Bellon’s benchmark [13], the clone pairs from a single subject system are often dominated by

a few large clone classes, and therefore have very little variety. In contrast, our benchmark considers 43

functionalities across 25,000 subject systems with a total of 8.9 million clone pairs. Also, our benchmark was

built independently of clone detection tools.

Since our benchmark consists of clones of particular functionalities, it is very useful for evaluating semantic

clone detectors (e.g., [38]). To our knowledge, there is also no significant benchmark for semantic clone

detectors. While semantic clone detectors may not be scalable to large-scale, they could be executed for

subsets of the benchmark. Good subsets would be the individual functionalities, or a random selection of

true and false clone pairs from each of the functionalities.

While our focus was on measuring recall and precision, the benchmark can also be used as a common

target for measuring clone detection execution time and scalability. Big data clone detection and search tools

can be compared by their execution time for IJaDataset. Classical tools can be compared by the benchmark

subset size they can handle, and their execution time for common subsets. Additionally, some large-scale

clone detection tools [141] use common large-scale analysis frameworks such as Hadoop [36]. Our benchmark

can be used to evaluate the execution performance (time and scalability) of these frameworks when used in

a clone detection context.

8.5.1

Example Tool Evaluation: D-NiCad

While a tool evaluation experiment is out of the scope of this work, we provide a small demonstration of

an example use of our benchmark. We used our benchmark to evaluate D-NiCad, a distributed version of NiCad that scales to large-scale. It uses the distributed and deterministic scalability heuristic introduced

by D-CCFinder [84]. This heuristic executes NiCad for subsets of IJaDataset within its scalability limits.

Across a large number of executions, NiCad is exposed to every file pair (and thus every clone) in the dataset.

The executions are distributed over a number of computers. For this case study, we executed D-NiCad for a

subset of IJaDataset that includes the files containing the sample snippets and tagged snippets of the first

ten functionalities in Table 8.1. D-NiCad was configured to detect function clones of size 6 lines or greater

for a 70% similarity threshold and full Type-1/2 source normalization.

D-NiCad’s recall results are as follows: Type-1 (99.7%), Type-2 (99.6%), Strongly Type-3 (93.0%), Mod-

erately Type-3 (0.5%), and Weakly Type-3+4 (0%). Since NiCad is a line-based tool, we separated these

type 3 true clone pairs using the line-based metric. Our benchmark estimates D-NiCad’s precision as 99%. Our earlier studies [128] have shown NiCad to have very high intra-project recall (99-100)% for the first

Our benchmark reveals that there is many true clone pairs D-NiCad misses because their similarity is be-

low NiCad’s recommended similarity threshold. This detection information can be used to improve NiCad’s

detection performance for large-scale inter-project clone detection. D-NiCad has strong recall and precision

for Type-1, Type-2, and Strongly Type-3 clones. Ideally, future development will lower its recommended similarity threshold into the Moderately Type-3 clone range while maintaining its superb precision [110].

These results demonstrate the need for our large-scale benchmark. Classical intra-project benchmarks

did not reveal these gaps in NiCad’s detection [128]. Perhaps because these clones have properties that are

specific to inter-project cloning, or perhaps there are edge-case gaps in NiCad’s detection abilities that are

not revealed by the limited number and variety of intra-project clones in a handful of subject systems [13].

A standard clone detector agnostic large-scale benchmark is needed to properly evaluate the clone detection

techniques.