Informe de huella de carbono (Subsecretaría del medio ambiente)

3. MARCO REFERENCIAL

3.5. Proyectos afines

3.5.2. Informe de huella de carbono (Subsecretaría del medio ambiente)

Setup

Simian was chosen for this experiment as it was possible to obtain its gold standard for IJaDataset. This

allowed us to compare our framework’s performance with Simian against its native performance. For evalu-

ation with the shuffling framework, a subset size of 50,000 files was chosen (58 subsets per round). Simian’s fast execution allowed us to execute 30 rounds of the shuffling framework. Simian reports clones as clone

classes, which was exploited when analyzing the results. We converted the clone classes into clone pairs to

measure total recall.

Subset generation and round detection took approximately 8-12 and 4-10 hours per round, respectively, on Westgrid hardware. Based on our later experiments, we estimate that subset generation and construction

would take approximately 1.25 hours, and detection approximately 1.25 hours per round, on our local machine

with a solid state drive. The bottleneck on the Westgrid systems was the IO performance. Specifically, copying

many small files from the dataset into the subsets was much slower on a traditional hard disk drive.

Table 11.2: Summary of the IJaDataset Clone Detection Experiments

Tool Hardware Subset Size (# files) # Sets # Rounds

Deckard 24GB 10K 289 10

NiCad 12GB 10K 289 20

Analysis

Since Simian’s gold standard is extremely large (300 billion clone pairs) total recall was estimated using the

heuristic, which is shown in Figure 11.5. After 30 rounds of the framework, 70% of the cloned fragments in

Simian’s gold standard were detected. According to the heuristic study, Section 11.3.4, total recall should

be less than the heuristic, but with faster growth. Specifically for Simian, the study showed that recovered

recall quickly approached heuristic recall when total recall was within 70-90%. From Simian’s heuristic recall

trend for IJaDataset, we estimate it would reach 80% in approximately 10-20 additional rounds. We therefore conclude that Simian has achieved an acceptable recall for cloned fragments within 30 rounds. The heuristic

recall suggests that within 10-20 additional rounds a transitive clone recovery technique could achieve an

equivalent total recall of clone pairs.

While the heuristic is a worthy approximation of total recall, it is still desirable to directly measure total

recall, which necessitated a reduction in Simian’s output. Investigation into the characteristics of Simian’s

gold standard found that 99.99% of Simian’s clones came from clone classes greater than 100 fragments in

size. Manual investigation into these clone classes revealed that Simian suffered from what we term the

“sliding effect”. It reported some extremely large clone classes containing the same fragment(s) repeated

numerous times with small offsets in line numbers. These clone classes generate an extreme number of self and overlapping clones and represent a significant threat to Simian’s precision. We therefore reduced Simian’s

output size by trimming clone classes over a certain maximum size. Its gold standard was likewise trimmed.

The remaining clone classes were converted to clone pairs to measure total recall. This post-processing of

the framework’s output for Simian was done solely to aid the evaluation of total recall on our hardware in a

reasonable time-frame, and is not an expected post-processing step for users of the framework.

Figure 11.6 shows our framework’s total recall with Simian for various maximum clone class sizes up

to 100 fragments (limitation of our hardware). The legend of this figure specifies the maximum class size

considered with the gold standard’s size in parenthesis. Total recall was higher and increased faster for lower

maximum clone class size. This suggests that the framework works best for specialized clone detection (i.e., focusing on detecting interesting/unique clones rather than all clones). This is due to larger classes requiring

more rounds (on average) to be completely found as they contain more clone pairs that need to be shuffled

together.

For the smaller class sizes a respectable total recall was achievable within 30 rounds (2: 52%, 5: 44%,

10: 40%). This total recall may be acceptable in cases where only a sample of the clones is required. For

example, when building an inter-project clone corpus using many tools, 50% of a tools’ native recall is likely

sufficient for the corpus to benefit from the tool’s unique detection characteristics. Consulting multiple tools

may make up for individual tools’ diminished recall.

While this total recall is low, in each case it is increases nearly linearly, with very little decay in slope. Additional rounds could bring these to an acceptable level. As can be seen, a 7-10% increase in total recall

In our preliminary studies, we found that the framework performed the worst with Simian. Therefore we

expect the other tools to achieve a higher total recall than Simian.

Figure 11.7 shows heuristic recall for the same trimmed output. As can be seen, the shuffling framework

is finding the cloned fragments very fast, with 52-62% heuristic recall after only 30 rounds. Heuristic recall

increases faster for larger maximum clone class size, meaning that the fragments in large clone classes are more easily found. This is expected as fragments in large clone classes have a higher chance of being shuffled

into a partition with another fragment from the clone class. This suggests that a transitive recovery method

may work especially well for the clones of large clone classes. This is particularly beneficial as it was for the

clone pairs in larger classes that the framework had a slower increase in total recall (Figure 11.6).

In document REVISIÓN ENERGÉTICA Y PLANTEAMIENTO DE MEJORAS EN EFICIENCIA ENERGÉTICA PARA LA ALCALDÍA DE BUCARAMANGA (página 32-0)