An excursion in evaluation of binary analysis tools

As outlined in a previous blog post, there are multiple tools for binary and source analysis. They are crafted with slightly different motivations and considerations by their respective developers and thus it is not trivial or well-defined to generally evaluate them against each other.

If we really want a comparative evaluation though, to have a meaningful one we need to identify a feature supported by all the tools under consideration, and generating callgraph is a baseline feature all tools support, and the idea is simple: the tool would analyze the code (be it binary or source code) and identify the functions and if a function foo can call bar, make an edge foo->bar in the callgraph.

The test program which I used to test, was just defining a bunch of functions which call each other. The main was empty, but it suffices as all it matters is the callgraph which is about how these defined functions call each other. And to then make that structured I generated hybercubes with them: for an N-hypercube, for 0 to 2^N - 1 name the functions as: name i = “f”+[binary representation of i] so for i = 2 we have f00, f01, f10, f11. Now treat each binary representation as coordinates in an N-dimensional space, each function would correspond to a vertex of a hypercube, for 2d its a regular square, and for 3d a (normal)cube. How are the vertices connected? Each two points are connected if their Hamming distance is 1 (i.e. they differ in only one coordinate).


FIG1. 10 dimensional graph, rendered using GraphStream



Hypercubes admit nice symmetries and are aesthetically pleasing but there is a downside: for our test we want to cover programs of different sizes, and we would fancy a way to cover different sizes as we want. With hypercubes we do not have a gauge, notice that ‘size’ here, would be the number of edges, and for a N-cube it is N^3, so we can get only programs of sizes 1, 8, 27, 64, … and so not only we are not making programs of arbitrary size but also the sizes aren’t even uniform, and the distance between possible program sizes diverge as N increases. The title image is a glimpse of one of the callgraphs generated by bap. It is very easy with bap:

1
$bap executable -dcfg > output.dot

Due to the symmetry of the hypercubes, it is very efficient to test the resulting callgraphs: if we have the correct number of edges, and each edge is between functions at Hamming distance 1, the callgraph is correct. But to have a better tunable measure for performance I generated programs using another model: a (n,p) program is a program with n functions, and for each function there is a probability p that it calls any other function => the expected number of edges is NN\P. Now we have two parameters to tune the size.

One measure of performance then would be the time it takes for a tool to make the callgraph. But time, is variable. We can take a number of trials and average over them to get a more stable measure, but still from system to system it would differ and also it depends at the state of the system at runtime. Valgrind is a tool which lets you profile machine instructions: the number of instructions does not depend on runtime, and much less on the system as for the same architecture, the same compiler would generate the same machine code.


FIG2. Radare2 takes linear instruction (/time) to find callgraph



FIG3. Radare2 vs BAP (Yellow), x: edge size | y: instuction count


For Radare2, as shown in FIG2, which is a plot of [y: number of edges - x: number of instructions] the number instructions (which roughly is itself, linearly related to time) increases linearly with the number of edges (each edge, is a function call, inside the body of another function) so it seems that the underlying implementation of Radare2 is O(N) -where N is the number of calls/edges-.

When comparing Radare2 with bap (FIG3), we see that even though they are both linear, radare2 is way faster; and for bap there is a threshold after which there is a bump in the slope. Also, rather peculiarly, if you feed bap very large (larger than 100 function) code you get false positives: bap detects functions which are not existent. We should be getting edges like: “ f12 -> f18 “ but we suddenly begin to get bogus ones too like: “ sub_4064ab” -> f308 “.

I tried various ansatzes to find out what is the reason behind this, (radare2 does not show this artifact) but it seems totally arbitrary: even the functions which are messed up, are not at the end/beginning i.e. out of 200 functions, f121 is messed up, f200 is fine. I discussed it with Ivan Gotovchits, the main developer of bap as well, and he said it’s just some false positive somehow, and maybe it’ll be fixed if I turn off byteweight, which didn’t fix it.

The code for the C code generator is available at:https://github.com/S4Lab/Benchmark-Utils

Post by: Iman Hosseini