Different program representations and their applications

Graphical representation can help researchers in all research fields and can depict the program features better than the textual source code. In this article, we want to review different graphs that are extracted from source code and represent programs.
A graph consists of some nodes and some edges that connect nodes to each other, these edges show different relationships between nodes.
In general, we can split graphs into two groups [1], flow graph, and dependence graph. in the following, we will review different models of them and at the end, we can see which is the most useful one.

Abstract syntax tree

The first graph that can represent the program is the Abstract syntax tree or AST.
AST is a representation of the syntax of program source code but not in detail, more details are in Concrete syntax tree( Parse tree ). As shown in the example below, its nodes and edges show the structure of the code.
Although it doesn’t have all detailed information about the source code, it resolves ambiguity form code.

AST is the result of the syntax analysis phase of a compiler and is used in program analysis. In order to transforme programs to each other, AST is generated and used.
This is the example code to show the result of different graphs on it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include<stdio.h>
int main()
{
int num,copy_of_num,sum=0,rem;


printf("\nEnter a number:");
scanf("%d",&num);


copy_of_num = num;


while (num != 0)
{
rem = num % 10;
sum = sum + (rem*rem*rem);
num = num / 10;
}

if(copy_of_num == sum)
printf("\n%d is an Armstrong Number",copy_of_num);
else
printf("\n%d is not an Armstrong Number",copy_of_num);
return(0);
}

For generating AST, you can use Clang and it generates the AST with this command :

1
clang -Xclang -ast-dump -fsyntax-only test.cc

This file is the result of above command; download it and see it with this command, colorfully:

1
echo -ne $(cat output.txt | sed  's/$/\\n/' | sed 's/ /\\a /g')

This AST Graph is related to the above sample code:

AST


Call graph and Control flow graph

The Call Graph shows the relationship between subroutines and functions in a program. its nodes show different procedures, and edges show which function calls the other one. Based on the call graph we can find some lines of code that are never called [1] and analyze the general flow of data in the program. Call graphs can also be used to detect code injection attacks or anomalies of program execution. So the call graph gives an inter-procedural view of a program.

The control flow graph or CFG provides more details about each path, it is an intra-procedural view of a subroutine. Each node represents a basic block of code and each edge shows a control flow transfer. A basic block is a set of consecutive statements. Compared with AST, CFGs have less redundancy but from AST we can get the source code easier.

This is the CFG corresponding to our example code, which is generated using clang; for more detail about clang read this blog post:
CFG

Each basic block in CFG contains some statements, we can split them in each node and this will be the result for CFG above. This CFG can show dependencies better that basic block.

CFG


Program dependence graph

By aggregating the data dependence graph(DDG) and the control dependence graph(CDG), we will have a program dependence graph(PDG). Nodes are basic blocks and edges are control and data dependence edges. In CDG, edges represent direct control dependence. In the Data dependence graph, if a variable v which is declared in statement X, is used in statement Y and there is a path from X to Y in the program that doesn’t define v, there will be an edge [1] from X to Y.

For more detail about PDG and a method to build it, see this lecture.

Now, This is the PDG for the previous algorithm:

red edges are control dependence and green edges are data dependence.

PDG

Recent conferences and graph representation

Top conferences can show the usage of the above graph representation in different research areas. To this end, I review related articles in these recent conferences: USENIX 2020 & 2019, NDSS 2020 & 2019, S&P 2020 & 2019, ICSE 2020 & 2019 and CSS 2019. Here is a bar chart to show the distribution of graph usage in their articles that are based on C/C++ programs. As it shows, CFG is the most used graph because although the graph is not very large, it has useful information about code and can illustrate the code well.

plot

and these are their titles:

AST

  1. MVP: Detecting Vulnerabilities using Patch-Enhanced Vulnerability Signatures USENIX 2020
  2. Automating Patching of Vulnerable Open-Source Software Versions in Application Binaries NDSS 2019
  3. Impact Analysis of Cross-Project Bugs on Software Ecosystems ICSE 2020
  4. A Novel Neural Source Code Representation based on Abstract Syntax Tree ICSE 2019
  5. Superion: Grammar-Aware Greybox Fuzzing ICSE 2019

CFG

  1. AURORA: Statistical Crash Analysis for Automated Root Cause Explanation USENIX 2020
  2. FuzzGen: Automatic Fuzzer Generation USENIX 2020
  3. ParmeSan: Sanitizer-guided Greybox Fuzzing USENIX 2020
  4. SpecFuzz: Bringing Spectre-type vulnerabilities to the surface USENIX 2020
  5. FuzzGuard: Filtering out Unreachable Inputs in Directed Grey-box Fuzzing through Deep Learning USENIX 2020
  6. MUZZ: Thread-aware Grey-box Fuzzing for Effective Bug Hunting in Multithreaded Programs USENIX 2020
  7. GREYONE: Data Flow Sensitive Fuzzing USENIX 2020
  8. GRIMOIRE: Synthesizing Structure while Fuzzing USENIX 2019
  9. DEEPBINDIFF: Learning Program-Wide Code Representations for Binary Diffing NDSS 2020
  10. Not All Coverage Measurements Are Equal: Fuzzing by Coverage Accounting for Input Prioritization NDSS 2020
  11. Send Hardest Problems My Way: Probabilistic Path Prioritization for Hybrid Fuzzing NDSS 2019
  12. Matryoshka: Fuzzing Deeply Nested Branches CSS 2019
  13. PANGOLIN: Incremental Hybrid Fuzzing with Polyhedral Path Abstraction S&P 2020
  14. SAVIOR: Towards Bug-Driven Hybrid Testing S&P 2020
  15. Full-Speed Fuzzing: Reducing Fuzzing Overhead through Coverage-Guided Tracing S&P 2019
  16. Iodine: Fast Dynamic Taint Tracking Using Rollback-free Optimistic Hybrid Analysis S&P 2019
  17. A Large-Scale Empirical Study on Vulnerability Distribution within Projects and the Lessons Learned ICSE 2020

PDG

  1. MVP: Detecting Vulnerabilities using Patch-Enhanced Vulnerability Signatures USENIX 2020
  2. FuzzGen: Automatic Fuzzer Generation USENIX 2020
  3. REDQUEEN: Fuzzing with Input-to-State Correspondence NDSS 2019
  4. Practical Fault Detection in Puppet Programs ICSE 2020

Call graph

  1. KOOBE: Towards Facilitating Exploit Generation of Kernel Out-Of-Bounds Write Vulnerabilities USENIX 2020
  2. Temporal System Call Specialization for Attack Surface Reduction USENIX 2020
  3. FuzzGuard: Filtering out Unreachable Inputs in Directed Grey-box Fuzzing through Deep Learning USENIX 2020
  4. MUZZ: Thread-aware Grey-box Fuzzing for Effective Bug Hunting in Multithreaded Programs USENIX 2020
  5. Precisely Characterizing Security Impact in a Flood of Patches via Symbolic Rule Comparison NDSS 2020
  6. Iodine: Fast Dynamic Taint Tracking Using Rollback-free Optimistic Hybrid Analysis S&P 2019
  7. A Large-Scale Empirical Study on Vulnerability Distribution within Projects and the Lessons Learned ICSE 2020
  8. Practical Fault Detection in Puppet Programs ICSE 2020

Code property graph

  1. LEOPARD: Identifying Vulnerable Code for Vulnerability Assessment through Program MetricsTechnical Track ICSE 2019

Use-flow graph,value-flow graph

  1. SMOKE: Scalable Path-Sensitive Memory Leak Detection for Millions of Lines of Code ICSE 2019

References

  1. Arora, V., Bhatia, R.K. and Singh, M., 2012. Evaluation of flow graph and dependence graphs for program representation. International Journal of Computer Applications, 56(14).

Post by: Maryam Ebrahimzadeh