The goal of this homework is to implement a static analyzer that detects potentially vulnerable data flows from source points to sink points (so called taint analysis) in C programs. Students will write the following functions in src/analyzer.ml based on the definitions in the lecture slides:
- the function
extractusingextract_*(e.g.,extract_cedge,extract_def, ..) that extract initial facts. - the functions
derive_*(e.g.,derive_kill,derive_path, ..) that derive inductive facts. - the function
solvethat computes the fixed point.
In short, replace all failwith "Not implemented" with your own implementation.
You assume that input programs are always syntactically valid.
Notice that students are not permitted to change directory structures and types of the functions.
All the functions you implement must have the same types as described in the module signatures.
However, students are allowed to change let to let rec if needed.
This homework assumes that students set up the OCaml environment following the same instructions in the previous homework. After running make under the root directory of this repository, you can run the analyzer with the provided C programs:
# run the analyzer
$ ./analyzer test/example1.cThe command line argument specifies the input file path. Once you complete the implementation, the following output will be printed:
$ ./analyzer test/example1.c
Potential Error @ example1.c:4 (example1.c:5)Each line reports a potentially vulnerable source and sink point at a certain source code location denoted by [file]:[source_line] ([file]:[sink_line]).
The analyzer converts a C source file into a control-flow graph using function of_cfile in src/CFG.ml.
In the graph representation of a program, each node has a unique ID (label) and is associated with
a command (module Command in src/CFG.ml).
Note that the IDs do not necessarily correspond to the line numbers in the C source code.
Here are the descriptions of each command (interchangably, node):
Assign (l, v, e): The commandlassigns an expressioneto variablev.Source (l, v): The commandlgets user input by calling thesourcefunction (declared in test/homework.h) and assign the value to variablev. Then, the variablevhas a tainted value.Sanitizer (l, v, e): The commandlsanitizes the expressioneby calling thesanitizerfunction and assigns the sanitized value to variablev. Then, the value ofvis not tainted anymore.Sink (l, e): The commandlcalls thesinkfunction with an expressioneas an argument. Ifecontains any tainted value, the analyzer reports this function call as a potentially vulnerable point.Branch l: The commandlis the beginning of the branching out control-flows. For simplicity, we ignore the branch condition.Skip l: The commandlis considered to be a no-op. All the other commands except for the above will be represented asSkipsuch asreturn.
Consider the following small C program with an if-then-else statement:
0: x = source();
1: if (cond) {
2: y = 1;
} else {
3: y = sanitize(x);
}
4: sink(y);
5: return 0;The corresponding control-flow graph is as follows:
┌───────┐
│ 0 │
└───┬───┘
┌───┴───┐
│ 1 │
└───┬───┘
┌──────┴──────┐
┌───┴───┐ ┌───┴───┐
│ 2 │ │ 3 │
└───┬───┘ └───┬───┘
└──────┬──────┘
┌───┴───┐
│ 4 │
└───┬───┘
┌───┴───┐
│ 5 │
└───────┘
The program will be translated to the following set of commands:
Source(0, x), Branch(1), Assign(2, y, 1), Sanitizer(3, y, x), Sink(4, y), and Skip(5).
After running the analyzer, the information will be shown in cfg.txt.
Given a program, students will extract basic facts using the extract function.
Function extract will extract the following basic facts:
-
Relation
CEdgedenotes control-flow edges in a program. Control-flow edges specify the execution order. That is, tupleCEdge(a, b)represents the fact thatbis executed right after commanda. In the example, the extracted control-flow edges will beCEdge(0, 1),CEdge(1, 2),CEdge(1, 3),CEdge(2, 4),CEdge(3, 4)andCEdge(4, 5). -
Relation
Sourcedenotes the source commands. The factSource(l)is directly derived from the commandSource (l, v). In the example,Source(0)will be extracted. -
Relation
Sanitizerdenotes the sanitizer commands. The factSanitizer(l)is directly derived from the commandSanitizer (l, v, e). In the example,Sanitizer(4)will be extracted. -
Relation
Sinkdenotes the sink commands. The factSink(l)is directly derived from the commandSink (l, e). In the example,Sink(6)will be extracted. -
Relation
Killdenotes the definitions potentially killed at each point. The factKill(l1, l2)represents that the definition atl2is killed by the definition atl1. In the example,Kill(2, 3)andKill(3, 2)will be extracted. -
Relations
DefandUsedenote definitions and uses of values, respectively. The factsDef(l, v)andUse(l, v')are derived from the commandAssign (l, v, e)wherev'is a variable ine. Given an expressione, one can get all variables ineusing functionget_varsin theCommandmodule. Here,Def(0, x),Def(2, y),Def(4, y),Use(4, x), andUse(6, y)will be extracted.
You will extract the facts by traversing control-flow graphs. The CFG module provides two functions:
- Function
fold_vertexfor traversing vertices of a graph - Function
fold_edgesfor traversing edges of a graph
The functions behave similarly to List.fold_left in the OCaml standard library.
For their usage, see the example implementations of num_of_assignments and num_of_obvious_bugs in src/CFG.ml.
Students will write rules to derive inductive facts.
Given a set of constraints, each function derive_* derives new facts following the definition in the lecture slides.
The ultimate goal of the analysis is to derive all inductive facts until reaching a fixed point.
The fixed point solving can be done by iteratevly applying the analysis rules (i.e., derive_*) to derive analysis facts
until there is no more changes.
Students will implement this solver in the solve function.
After running the analyzer, the analyzer will report potentially vulnerable points (i.e., Alarm).
Also, the final set of derived facts will be shown in result.txt.
Input programs in this homework are assumed to have only sub-features of the C language as follows:
- Programs only have a single function definition of
mainand do not have global variables. - You should handle assignments, branches, and loops.
- There is no function call except for the ones to
source,sanitizerandsinkthat are declared in this file. - All the other instructions are considered to be no-op.