An excursion in evaluation of binary analysis tools

As outlined in a previous blog post, there are multiple tools for binary and source analysis. They are crafted with slightly different motivations and considerations by their respective developers and thus it is not trivial or well-defined to generally evaluate them against each other.

If we really want a comparative evaluation though, to have a meaningful one we need to identify a feature supported by all the tools under consideration, and generating callgraph is a baseline feature all tools support, and the idea is simple: the tool would analyze the code (be it binary or source code) and identify the functions and if a function foo can call bar, make an edge foo->bar in the callgraph.

The test program which I used to test, was just defining a bunch of functions which call each other. The main was empty, but it suffices as all it matters is the callgraph which is about how these defined functions call each other. And to then make that structured I generated hybercubes with them: for an N-hypercube, for 0 to 2^N - 1 name the functions as: name i = “f”+[binary representation of i] so for i = 2 we have f00, f01, f10, f11. Now treat each binary representation as coordinates in an N-dimensional space, each function would correspond to a vertex of a hypercube, for 2d its a regular square, and for 3d a (normal)cube. How are the vertices connected? Each two points are connected if their Hamming distance is 1 (i.e. they differ in only one coordinate).

FIG1. 10 dimensional graph, rendered using GraphStream

Hypercubes admit nice symmetries and are aesthetically pleasing but there is a downside: for our test we want to cover programs of different sizes, and we would fancy a way to cover different sizes as we want. With hypercubes we do not have a gauge, notice that ‘size’ here, would be the number of edges, and for a N-cube it is N^3, so we can get only programs of sizes 1, 8, 27, 64, … and so not only we are not making programs of arbitrary size but also the sizes aren’t even uniform, and the distance between possible program sizes diverge as N increases. The title image is a glimpse of one of the callgraphs generated by bap. It is very easy with bap:

$bap executable -dcfg >

Due to the symmetry of the hypercubes, it is very efficient to test the resulting callgraphs: if we have the correct number of edges, and each edge is between functions at Hamming distance 1, the callgraph is correct. But to have a better tunable measure for performance I generated programs using another model: a (n,p) program is a program with n functions, and for each function there is a probability p that it calls any other function => the expected number of edges is NN\P. Now we have two parameters to tune the size.

One measure of performance then would be the time it takes for a tool to make the callgraph. But time, is variable. We can take a number of trials and average over them to get a more stable measure, but still from system to system it would differ and also it depends at the state of the system at runtime. Valgrind is a tool which lets you profile machine instructions: the number of instructions does not depend on runtime, and much less on the system as for the same architecture, the same compiler would generate the same machine code.

FIG2. Radare2 takes linear instruction (/time) to find callgraph

FIG3. Radare2 vs BAP (Yellow), x: edge size | y: instuction count

For Radare2, as shown in FIG2, which is a plot of [y: number of edges - x: number of instructions] the number instructions (which roughly is itself, linearly related to time) increases linearly with the number of edges (each edge, is a function call, inside the body of another function) so it seems that the underlying implementation of Radare2 is O(N) -where N is the number of calls/edges-.

When comparing Radare2 with bap (FIG3), we see that even though they are both linear, radare2 is way faster; and for bap there is a threshold after which there is a bump in the slope. Also, rather peculiarly, if you feed bap very large (larger than 100 function) code you get false positives: bap detects functions which are not existent. We should be getting edges like: “ f12 -> f18 “ but we suddenly begin to get bogus ones too like: “ sub_4064ab” -> f308 “.

I tried various ansatzes to find out what is the reason behind this, (radare2 does not show this artifact) but it seems totally arbitrary: even the functions which are messed up, are not at the end/beginning i.e. out of 200 functions, f121 is messed up, f200 is fine. I discussed it with Ivan Gotovchits, the main developer of bap as well, and he said it’s just some false positive somehow, and maybe it’ll be fixed if I turn off byteweight, which didn’t fix it.

The code for the C code generator is available at:

Post by: Iman Hosseini

Coffee in S4Lab

Photos by Parmida Vahdatnia

It is now roughly 8 months that I am part of the S4Lab and during this time I have learned about various qualities here that makes S4 great, and recently there has been a new addition which specifically I am very fond of: We’ve got a full stack coffee framework.

Grinding the beans in-house means that you get the aroma of coffee welcoming you to the lab and a fresh cup of coffee is the perfect beginning for a day of work. And if you are thinking decaf…

Honoré de Balzac, famous novelist and playwright, on coffee: “This coffee falls into your stomach, and straightway there is a general commotion. Ideas begin to move like the battalions of the Grand Army of the battlefield, and the battle takes place. Things remembered arrive at full gallop, ensuing to the wind. The light cavalry of comparisons deliver a magnificent deploying charge, the artillery of logic hurry up with their train and ammunition, the shafts of which start up like sharpshooters. Similes arise, the paper is covered with ink; for the struggle commences and is concluded with torrents of black water, just as a battle with powder.” Only for us, instead of similes it is data structures arising, and the instead of paper we have emacs. [Editorial Note: vim would have been/is a much better selection, than emacs]

Post by: Iman Hosseini

Automated Binary Analysis (a short intro)

Automated program analysis is the practice of analyzing computer programs using [other] programs, as opposed to code audits by a developer. The objective can be to find security vulnerabilities, to optimize, or to reverse engineer an obfuscated program and gain knowledge about its flow. The are benefits to this approach. It allows scalability to cater to the huge number of programs at large which would not be possible to audit “manually”. Code auditing is a skill that is not gained easily, so a developer good for the job doesn’t come cheap. Also, online and live monitoring of code might be a necessity.
Automated program analysis can be implemented at two levels: static analysis is concerned with analysis of code at compile time, without actually executing the program. And dynamic analysis is done with executing the program and considering the runtime as well. Also, the input program can be source code of the program, or the compiled binary. Each approach has its benefits and disadvantages. Inferring the callgraph of a program, tracking api-calls, and recovering control flow graph of the program can be among the tasks of analysis platforms.
A tool used for runtime analysis of programs is symbolic execution, which runs through the program assuming symbolic (rather than concrete) values for the variables to gain further knowledge about the flow of a program.

A concrete example

Trail of Bits sketches a way to detect heartbleed. It is based on a characterization of the vulnerability via calls to ntoh and hton which can taint a variable, then calling memcopy and passing a tainted value to it, without bound-checking.
To this end, they have used Clang Analyzer. As part of LLVM, there is a tool to do static analysis of programs: scan-bulid. To perform a custom analysis, everything goes into a C++ class and registers as a plugin. Here’s how the code looks:

void NetworkTaintChecker::checkPostCall(const CallEvent &Call,
CheckerContext &C) const {
const IdentifierInfo *ID = Call.getCalleeIdentifier();
if(ID == NULL) {

if(ID->getName() == "ntohl" || ID->getName() == "ntohs") {
ProgramStateRef State = C.getState();
SymbolRef Sym = Call.getReturnValue().getAsSymbol();

if(Sym) {
ProgramStateRef newState = State->addTaint(Sym);

Similarly, to check for function calls, we could have used BAP). In its original and preferred OCaml it looks like:

module CG = Graphs.Callgraph
module CFG = Graphs.Tid

This gives access to callgraph and cfg. Now imagine we wanted a function that will take the call graph cg, the target function, and the subroutine term sub, and return a sequence of calls that has a destination function, that reaches target in the call graph. BAP provides for us nifty methods.


let callsites cg target sub =
Term.enum blk_t sub |>
Seq.concat_map ~f:(fun blk ->
Term.enum jmp_t blk |> Seq.filter_map ~f:(fun j ->
match Jmp.kind j with
| Goto _ | Ret _ | Int (_,_) -> None
| Call dst -> match dst with
| Direct tid when reaches cg tid target ->
Some (Term.tid blk)
| _ -> None))

In a similar manner, every custom analysis is developed in an OCaml (or other bindings) file and then registered as a plugin to BAP. (instructions here)

So many tools!

There are also other tools built with different motivations in mind, among them angr , ROSE, radare2.
Radare2 is more suited for ctf, ROSE used to be source analysis and thus supports source analysis as well as binary and angr was used to make mechanical phish which won 3rd place at Darpa’s Cyber Grand Challenge. ROSE is developed in C++, and is fast and great for research, but when you want to work on small programs (like a ctf challenge) you’d rather take other options, like bap which is based on OCaml and you can use baptop which is a REPL environment that you can interactively run commands and play around with the binary.

The most recent work published using angr, is (USENIX Security ‘18) HeapHopper: Bringing Bounded Model Checking to Heap Implementation Security which leverages the tool to implement another tool called HeapHopper which inspects heap implementations for vulnerablities. The method has been proven to be successful on a ptmalloc vulnerability, and uses angr as it’s symbolic execution engine checking compiled programs, after attempted exploit, for security violations.

Post by: Iman Hosseini

S4Lab Walkthrough

What we do at S4Lab

The short version of our work is that we like the kind of magic that concerns software. That means while security is not mathematically provable and the world is full of smart programmers with cute programs, we want to show the weakness in software and systems (how cute a program can be?) and find ways to secure them.

Software Security

Researchers have introduced lots of program analysis methods over the last four decades. Initially, the primary goal of these methods was compiler optimization. With the emergence of malware, new dynamic approaches have been introduced for the sake of finding the malicious behavior that only appears at runtime.

In recent years with rising the cost that software bugs may cause, researchers used the available automatic program analysis methods to test programs and answer the question “Does a program have any bugs?”. Altough none of the methods answer this question entirely, the goal of the automatic program testing is identifying as many real bugs as possible with minimal user augmentation and measure/show their risk.
Here at S4lab, we want to answer the following questions:

  • What kind of magic reveals software bugs as much as possible?
  • Can we decide which part of the programs are more vulnerable?
  • Can we improve the magic to prove vulnerabilities?
  • How can we measure the risk each bug has?
    Traditionally the program analysis methods are categorized as follows:


If you want to get familiar with these approaches, we suggest the following references:

System Security

Although using secure software is essential, it is always possible that interactions between the different layer of the computing stack remain vulnerable. Therefore developing tools and methods to detect potential flaws in communication channels, i.e., kernel/user, kernel/virtualization layer, kernel/hardware layer, etc. is important.
To get familiar with this area we suggest reviewing the following papers:

Find what is wrong

Find what is wrong? (artwork by @wackyposters)

You see, but you do not observe. The distinction is clear. [Sherlock Holmes]

While our primary focus is software and system security, any new security-related problems may interest us, especially finding new and innovative ways to measure security or to show that there are flaws in software and system that we use in our everyday life. You should be curious enough to sense that there is something wrong about them, indeed you should observe carefully.
To get familiar with these topics we suggest the following references:

What you should do to join S4Lab

If we knew what it was we were doing, it would not be called research, would it? [Albert Einstein]

You should understand that security related fields are not as straightforward as shown in Hollywood movies. In most cases, we do not know what the next step is, and we are not sure if we can find a way to prove that security is broken. But that doesn’t mean we should stop trying.
You should be patient, ambitious, and determined enough to work at S4Lab,
If you find S4Lab suitable for your work, please follow the steps below and you will see how deep the rabbit hole goes.

Let us know what interests you more

Summarize 2 of papers we have suggested reviewing above, choose the ones that interest you more. The writing should be your own words and try to avoid using the paper sentences.
You can add any related data, possible open problems that you may find, or only focus on the papers.
Each summary should not be longer than 2 pages.

Prove that you are a hacker (i.e. problem solver)

Most hackers are young because young people tend to be adaptable. As long as you remain adaptable, you can always be a good hacker. [Emmanuel Goldstein]

Being a hacker means you do not leave a problem unsolved because it is written in Haskell.
Before joining S4Lab, you need to prove that you are a hacker, to do so, please answer the following questions:

  • Many websites expose their “.git” files, please show how it could be dangerous.
  • Imagine that we have 2**48 text files. Explain how can we find which files are the same.
  • Write a hello-world C program and explain how we can dump its binary code with radare2.

Please remember that you do not need to write very much, and try to solve each task as automatically as you can.

Stay tuned

Upload the requested data, on your public GitHub account and send us its URL.

You should love Git!

Post by: Solmaz Salimi

Top NDSS'18 Papers

It was some time ago that I read a post somewhere (really can’t remember where) with a simple question. What are the top papers which one should read from conferences every year in order to keep up to date with the latest and most important developments. This was an interesting and yet challenging question. Nevertheless, we (i.e. our research team) decided to review papers from the top4 security conferences in our regular weekly meetings and through a process select the top 5 papers from each conference.

How did we go about it?
This was our first try on the issue, so not much of a scientific approach! We started studying papers as they were presented from the first session and on. In every meeting we selected the top papers from the sessions reviewed, and then argued if they should be kept in a top 10 list, and if the top 10 list was full, if any papers should be removed from the list to make room for the new paper selected. At the end of the process, we were left with 10 papers for the whole conference, from which we argued and selected the top 5.

What is considered “TOP”?
“Top” is a subjective matter, and to make matters worse security is a very wide field with very different topics which require very different expertise. Over the sessions, it became clear that there were biases. For example if a topic was new for everyone, they would argue that it should be included in the “top” list, where as there may have been previous research on the topic that the students where unaware of. There was also bias on topic which everyone was more familiar with. In such cases the argument was that a major problem was solved(i.e. although the scope of the work was very narrow). So again, top is a subjective matter. But I believe that as time goes on, and we keep doing this for a few more conferences, this bias would be less of an issue as everyone would have seen more papers to compare with.

Was the process useful?
Most definitely. Although we did not cover every detail in the papers we read, but we did study current day problems which is are being addressed by researchers and the approach taken to tackle the problem. In fact as papers were reviewed from different sessions, students were exposed to different problems and solutions; Where the subject was very different from what the are researching on. I find this to be extremely important. Students usually get too focused on a specific topic, and they lose sight as to the bigger picture. In other words, such exercise is great in providing “breadth” to the student, in addition to the “depth” obtained as part of research on a specific topic.

Here are the Top5 selected papers from NDSS’18:

1- Superset Disassembly: Statically Rewriting x86 Binaries Without Heuristics. Erick Bauman (University of Texas at Dallas), Zhiqiang Lin (University of Texas at Dallas), and Kevin Hamlen (University of Texas at Dallas).
2- Face Flashing: a Secure Liveness Detection Protocol based on Light Reflections. Di Tang (Chinese University of Hong Kong), Zhe Zhou (Fudan University), Yinqian Zhang (Ohio State University), and Kehuan Zhang (Chinese University of Hong Kong).
3- Cloud Strife: Mitigating the Security Risks of Domain-Validated Certificates. Kevin Borgolte (UC Santa Barbara), Tobias Fiebig (TU Delft), Shuang Hao (UT Dallas), Christopher Kruegel (UC Santa Barbara), and Giovanni Vigna (UC Santa Barbara).
4- Device Pairing at the Touch of an Electrode. Marc Roeschlin (University of Oxford), Ivan Martinovic (University of Oxford), and Kasper B. Rasmussen (University of Oxford).
5- Settling Payments Fast and Private: Efficient Decentralized Routing for Path-Based Transactions. Stefanie Roos (University of Waterloo), Pedro Moreno-Sanchez (Purdue University), Aniket Kate (Purdue University), and Ian Goldberg (University of Waterloo).

Last but not least, we are always happy if we could get more people in the room arguing on the papers. So if you like to participate in the process for the next conference review please drop me an email.

Post by: Mehdi Kharrazi