FRACCION II DEL REGLAMENTO DE LA LISR)
NOTA: EL ORDEN DE ESTOS DATOS ES UNICAMENTE INFORMATIVO, PARA SU CAPTURA REFERENCIAR AL FORMATO GUIA
Our benchmark can be used to measure the recall of clone detection tools and estimate their precision.
Recall and precision are shown in (8.1), where Btcis the set of all true clone pairs in the benchmark, Bf c is
the set of all false clone pairs in the benchmark, and D is the set of candidate clone pairs reported by the
detector. Also interesting is measuring a tool’s recall for subsets of Btc. For example, all clone pairs of a
particular functionality, all clone pairs of a particular type, all Type-3 clone pairs within a particular range
of syntactical similarity, and so on. Precision is estimated as the ratio of the known clone pairs (true and
false) found by the detector that are true clones. It ignores the detected clones that are unknown to the
benchmark. However, the primary purpose of our benchmark is to measure recall, which has been an open
problem in the community for the last decade. While the estimate of precision provides some insight, it does
not replace a true measurement of precision: the manual validation of the output of a tool. However, without benchmarks like ours, it is not possible for tool developers to measure recall because they do not know which
clone pairs exist in a system or repository.
recall =|D ∩ Btc| |Btc|
precision = |D ∩ Btc| |D ∩ (Btc∪ Bf c)|
(8.1)
Our benchmark is outside the scalability constraints of classical clone detection tools, which are not
designed for large-scale. While these tools cannot be executed for IJaDataset in its entirety, they can be
executed for subsets of the benchmark. The subsets would need to be small enough such that the tool could
be executed for the relevant source files without scalability issues. The subsets could be randomly chosen,
could be all the true and false clone pairs found for a functionality, or could even be the intra-project clone
pairs found in one of the 25,000 original subject systems crawled for IJaDataset. High confidence could be
1000
10000
100000
1000000
10000000
#
Clo
n
e
P
a
ir
s
Clone Similarity Range, e.g., [40-50)
Line
Token
Average
achieved by evaluating the tool for a large number of subsets.
An advantage of using our large-scale benchmark to evaluate these classical tools is clone variety. In
addition to their inherit weaknesses [9, 128], classical benchmarks only consider 1-10 subject systems, which
provides a limited variety of clones, especially Type-1 and Type-2. In our experience with the subject systems of Bellon’s benchmark [13], the clone pairs from a single subject system are often dominated by
a few large clone classes, and therefore have very little variety. In contrast, our benchmark considers 43
functionalities across 25,000 subject systems with a total of 8.9 million clone pairs. Also, our benchmark was
built independently of clone detection tools.
Since our benchmark consists of clones of particular functionalities, it is very useful for evaluating semantic
clone detectors (e.g., [38]). To our knowledge, there is also no significant benchmark for semantic clone
detectors. While semantic clone detectors may not be scalable to large-scale, they could be executed for
subsets of the benchmark. Good subsets would be the individual functionalities, or a random selection of
true and false clone pairs from each of the functionalities.
While our focus was on measuring recall and precision, the benchmark can also be used as a common
target for measuring clone detection execution time and scalability. Big data clone detection and search tools
can be compared by their execution time for IJaDataset. Classical tools can be compared by the benchmark
subset size they can handle, and their execution time for common subsets. Additionally, some large-scale
clone detection tools [141] use common large-scale analysis frameworks such as Hadoop [36]. Our benchmark
can be used to evaluate the execution performance (time and scalability) of these frameworks when used in
a clone detection context.
8.5.1
Example Tool Evaluation: D-NiCad
While a tool evaluation experiment is out of the scope of this work, we provide a small demonstration of
an example use of our benchmark. We used our benchmark to evaluate D-NiCad, a distributed version of NiCad that scales to large-scale. It uses the distributed and deterministic scalability heuristic introduced
by D-CCFinder [84]. This heuristic executes NiCad for subsets of IJaDataset within its scalability limits.
Across a large number of executions, NiCad is exposed to every file pair (and thus every clone) in the dataset.
The executions are distributed over a number of computers. For this case study, we executed D-NiCad for a
subset of IJaDataset that includes the files containing the sample snippets and tagged snippets of the first
ten functionalities in Table 8.1. D-NiCad was configured to detect function clones of size 6 lines or greater
for a 70% similarity threshold and full Type-1/2 source normalization.
D-NiCad’s recall results are as follows: Type-1 (99.7%), Type-2 (99.6%), Strongly Type-3 (93.0%), Mod-
erately Type-3 (0.5%), and Weakly Type-3+4 (0%). Since NiCad is a line-based tool, we separated these
type 3 true clone pairs using the line-based metric. Our benchmark estimates D-NiCad’s precision as 99%. Our earlier studies [128] have shown NiCad to have very high intra-project recall (99-100)% for the first
Our benchmark reveals that there is many true clone pairs D-NiCad misses because their similarity is be-
low NiCad’s recommended similarity threshold. This detection information can be used to improve NiCad’s
detection performance for large-scale inter-project clone detection. D-NiCad has strong recall and precision
for Type-1, Type-2, and Strongly Type-3 clones. Ideally, future development will lower its recommended similarity threshold into the Moderately Type-3 clone range while maintaining its superb precision [110].
These results demonstrate the need for our large-scale benchmark. Classical intra-project benchmarks
did not reveal these gaps in NiCad’s detection [128]. Perhaps because these clones have properties that are
specific to inter-project cloning, or perhaps there are edge-case gaps in NiCad’s detection abilities that are
not revealed by the limited number and variety of intra-project clones in a handful of subject systems [13].
A standard clone detector agnostic large-scale benchmark is needed to properly evaluate the clone detection
techniques.