• No se han encontrado resultados

¿Qué vas a aprender hoy?

EL CONJUNTO DE LOS NÚMEROS NATURALES ES INFINITO

The final fuzzy matching procedure takes place on all remaining results from the branch-based detection phase. These are a relatively small set of functions that have already been assessed as to

having a high degree of similarity to the provided target function. Inspired by [122], each provided function is compared to and ranked by similarity to the target function. The results of this phase are ordered pairs of the highest identified matches.

4.4.3.1 Longest Path Generation

The function matching approach can de decomposed into three steps namely (i) longest path gen- eration, (ii) path exploration and (iii) neighborhood exploration. For a given target function CFG, all of its loops are first unrolled, followed by the use of the Depth First Search (DFS) algorithm to identify its longest path. A path is then chosen as the basis of comparison since it represents the functionality created by the ordered execution of all the basic blocks it contains. Leveraging two equivalent or highly similar execution paths is a good basis for the comparison process. Thus the longer the path, the larger the possibility to acquire matching pairs of basic blocks. Leveraging another default path, such as the shortest, would give lower accuracy results. The reason is that the shortest paths are often used for error checks that bypass most of the functions execution [184].

4.4.3.2 Path Exploration

Given the longest target function path, it is paired with its best match amongst all those the reference function contains. This can be achieved by leveraging the work performed by [122, 185] through combining the Breadth First Search (BFS) algorithm in order to identify all paths and the LCS algorithm [24] to identify the best one. Within the context of the LCS algorithm, each basic block is compared using a similarity function. It leverages the instruction contents of each block as the basis of its comparison operation. Thus, the path from the reference function that contains the most similar blocks to the longest target path is considered to be the best match. This result is determined by applying backtracking to the contents of the memorization table the LCS algorithm produces. All resulting basic block matches are inserted into a priority queue.

Inspired by [186], the algorithm leverages the longest identified target function path along with the entire reference function CFG. A dynamically sized memoization table, that can accommodate the unpredictable size of CFGs, is used to store all comparison results. A vector is employed to store the highest identified similarity score and a queue is leveraged to keep track of all nodes that are to be explored. The algorithm is initialized with the root node of the provided reference function. Then for each node of interest, the LCS with respect to the longest path. Finally if the resulting score is superior to all others, it is preserved and all successor nodes are added to the queue for further investigation.

4.4.3.3 Neighborhood Exploration

The final step in this procedure is the utilization of the Hungarian algorithm on the highest matching pair of paths. This is necessary in order to perform a neighborhood exploration relative to the initial basic block matchings. As such, by leveraging the in and out degrees of the highest matching basic block pairs in the identified paths, it is possible to locate similar neighbors they possess. This has the effect of increasing the overall accuracy of the initial pairing and produces the final output of the fuzzy comparison process. It is composed of the identified set of target and reference function basic block pairs, along with a similarity score for each pair. A final similarity score has to then be computed by leveraging this information. This formula is formally defined in Equation 7 where fT

is a target function with nT basic blocks, fr is a reference function with nr basic blocks, k is the

number of matched basic blocks between the target and reference function, and W J (S, T ) returns the similarity score between the two matched basic blocks.

similarity (fT, fr) =

2 ×Pk

i=1W J (S, T )

nT + nr

(7)

4.4.3.4 Basic Block Similarity Comparison

The basic block similarity computation proposed by [122] leverages the LCS algorithm and applies it to the instructional contents of two given basic blocks. Although effective, this approach suffers from two main limitations. The first is instruction reordering, which reduces match accuracy, since the LCS algorithm is instruction order sensitive. The second are instruction substitutions that are syntactically different but semantically equivalent, which will also affect similarity scores. Due to these limitations, we chose to follow a different approach that can mitigate them. The Weighted Jaccard (WJ) similarity [187] has been selected as the means of comparing the instructions of two basic blocks. It is necessary to include weights, in the form of instruction frequencies, in order to prevent possible false positives. Purely based on the nature of the formula, instruction reordering does not affect the final similarity score. Furthermore, it is possible to handle instruction substitutions by grouping them as discussed in Subsection 4.4.2.4. Another advantage is that this comparison method can be implemented to run in linear time. The utilized WJ formula is defined in Equation 8 where S and T are sets of basic block mnemonic frequencies, as well as n and m are the number of elements in each block.

W J (S, T ) = PN k=1min(Sk∩ Tk) PN k=1max(Sk∪ Tk) , N = max{m, n} (8)

Documento similar