A complete, teaching-oriented compiler implementation in Java that compiles C-style source code into assembly-like intermediate code. This project demonstrates all major stages of compilation: lexical analysis, parsing, semantic analysis, symbol table management, and code generation.
Disclaimer: This project is designed for educational purposes and learning compiler construction concepts.
- β Complete Compilation Pipeline: From source code to generated assembly instructions
- β Lexical Analysis: Tokenizes source code with support for keywords, operators, numbers, and identifiers
- β Recursive Descent Parser: Builds Abstract Syntax Tree (AST) from token stream
- β Symbol Table Management: Hierarchical scope-based symbol tables with offset tracking
- β
Type System: Support for
int,float,boolean, and array types - β Code Generation: Generates stack-based assembly-like intermediate code
- β
File Output: Saves compiled code to
.asmfiles in acompiled/directory - β Comprehensive Test Suite: 10 test files covering various language features
- Quick Start
- Compiler Architecture
- Language Features
- Project Structure
- Usage Examples
- Output Format
- Test Suite
- Development
- Java Development Kit (JDK) 11 or newer
From the project root directory:
# Create output directory
mkdir -p out
# Compile all Java source files
javac -d out src/**/*.javaNavigate to the out directory and run the compiler:
cd out
# Compile a test file
java javaCompiler.JavaCompiler ../src/test/09_mixed_operations.tstThis will:
- Parse the source code and build the AST
- Display the symbol table tree
- Generate and display intermediate code
- Save the compiled code to
compiled/09_mixed_operations.asm
This compiler implements a complete compilation pipeline with the following stages:
Location: src/lexer/
The lexer tokenizes source code into a stream of tokens. It recognizes:
- Keywords:
if,else,while,do,break,true,false - Data Types:
int,float,boolean - Operators:
+,-,*,/,<,>,<=,>=,==,!=,&&,||,! - Literals: Integer numbers, floating-point numbers, boolean values
- Identifiers: Variable names
- Delimiters:
{,},(,),[,],;,=
Key Classes:
Lexer.java- Main tokenizerToken.java- Base token classKeyword.java- Keyword tokensNum.java- Integer literalsReal.java- Float literalsTag.java- Token type constants
Location: src/parser/
Uses recursive descent parsing to build an Abstract Syntax Tree (AST). The parser implements the following grammar:
PROG β BLOCK
BLOCK β { DECLS STMTS }
DECLS β DECLS DECL | Ξ΅
DECL β TYPE id ;
TYPE β TYPE [num] | int | float | boolean
STMTS β STMTS STMT | Ξ΅
STMT β id = BOOL ; | if (BOOL) STMT | if (BOOL) STMT else STMT
| while (BOOL) STMT | do STMT while (BOOL) ;
| break ; | BLOCK
BOOL β BOOL || JOIN | JOIN
JOIN β JOIN && EQUALITY | EQUALITY
EQUALITY β EQUALITY == REL | EQUALITY != REL | REL
REL β EXPR < EXPR | EXPR <= EXPR | EXPR >= EXPR | EXPR > EXPR | EXPR
EXPR β EXPR + TERM | EXPR - TERM | TERM
TERM β TERM * UNARY | TERM / UNARY | UNARY
UNARY β !UNARY | -UNARY | FACTOR
FACTOR β (BOOL) | id | id[BOOL] | num | real | true | false
Key Features:
- Handles operator precedence correctly
- Supports nested scopes with proper symbol table management
- Error reporting with line numbers
Location: src/inter/
The AST is composed of two main categories:
Constant- Literal values (numbers, booleans)Id- Variable identifiersArith- Arithmetic operations (+, -, *, /)Rel- Relational operations (<, >, <=, >=, ==, !=, &&, ||)Not- Logical negationUnary- Unary minusAccess- Array element access
Set- Variable assignmentSetArrayElem- Array element assignmentIf- Conditional statementIfElse- Conditional with else branchWhile- While loopDoWhile- Do-while loopBreak- Break statementStmtSeq- Statement sequence
Location: src/symbol/
Manages variable declarations and type information with:
- Hierarchical Scopes: Nested block scopes with parent-child relationships
- Type Checking: Support for int, float, boolean, and array types
- Offset Tracking: Calculates memory offsets for variables
- Scope Resolution: Proper variable lookup through scope chain
Key Classes:
Env.java- Symbol table environment (one per scope)Type.java- Type representationsArray.java- Array type with dimensionsSymbolTablePrinter.java- Pretty-prints the symbol table tree
Location: src/codegen/
Generates stack-based assembly-like intermediate code with instructions:
Instructions:
PUSH <value>- Push value onto stackLOAD <var>- Load variable value onto stackSTORE <var>- Store top of stack to variableADD,SUB,MUL,DIV- Arithmetic operationsCMP <op>- Compare operationNOT- Logical negationJZ <label>- Jump if zeroJNZ <label>- Jump if not zeroJMP <label>- Unconditional jumpLABEL <name>- Label definitionLOADARR <array>- Load array elementSTOREARR <array>- Store to array element
Key Classes:
CodeGen.java- Code generator (visitor pattern)Instruction.java- Instruction representation
int i;
float x;
boolean done;
int arr[10];x = a + b * c;
y = (a - b) / 2.0;
z = -x;if (x > 5 && y < 10) { ... }
while (i < 100 || !done) { ... }
if (a == b) { ... }// If-else
if (condition) {
statement;
} else {
statement;
}
// While loop
while (condition) {
statement;
}
// Do-while loop
do {
statement;
} while (condition);
// Break
while (true) {
if (condition) break;
}{
int x;
x = 5;
{
float y;
y = 2.5;
{
boolean flag;
flag = true;
}
}
}C-Compiler/
βββ README.md
βββ src/
β βββ codegen/ # Code generation
β β βββ CodeGen.java # Main code generator
β β βββ Instruction.java # Instruction representation
β βββ inter/ # Intermediate representation (AST)
β β βββ Node.java # Base AST node
β β βββ expr/ # Expression nodes
β β β βββ Constant.java
β β β βββ Id.java
β β β βββ Access.java
β β β βββ Expr.java
β β β βββ Rel.java
β β β βββ arith/
β β β β βββ Arith.java
β β β β βββ Unary.java
β β β βββ logic/
β β β βββ Not.java
β β βββ stmt/ # Statement nodes
β β βββ Stmt.java
β β βββ Set.java
β β βββ SetArrayElem.java
β β βββ If.java
β β βββ IfElse.java
β β βββ While.java
β β βββ DoWhile.java
β β βββ Break.java
β β βββ StmtSeq.java
β βββ parser/ # Parser
β β βββ Parser.java # Recursive descent parser
β βββ lexer/ # Lexical analyzer
β β βββ Lexer.java # Tokenizer
β β βββ Token.java # Base token
β β βββ Keyword.java # Keywords
β β βββ Num.java # Integer literals
β β βββ Real.java # Float literals
β β βββ Tag.java # Token tags
β βββ symbol/ # Symbol table & types
β β βββ Env.java # Environment/scope
β β βββ Type.java # Type system
β β βββ Array.java # Array types
β β βββ SymbolTablePrinter.java
β βββ javaCompiler/ # Main driver
β β βββ JavaCompiler.java # Entry point
β βββ test/ # Test files
β βββ 01_empty_block.tst
β βββ 02_variable_declarations.tst
β βββ 03_simple_assignment.tst
β βββ 04_arithmetic_ops.tst
β βββ 05_boolean_expressions.tst
β βββ 06_if_else.tst
β βββ 07_while_loop.tst
β βββ 08_do_while_break.tst
β βββ 09_mixed_operations.tst
β βββ 10_complex_boolean.tst
βββ out/
βββ [compiled class files]
βββ compiled/ # Generated assembly files
βββ *.asm
Input (test.tst):
{
int x;
int y;
int result;
x = 10;
y = 20;
result = x + y * 2;
}Run:
java javaCompiler.JavaCompiler test.tstOutput:
=== Symbol Table Tree ===
Scope Level 0
Scope Level 1
+------------+----------+--------+
| Identifier | Type | Offset |
+------------+----------+--------+
| result | int | 8 |
| y | int | 4 |
| x | int | 0 |
+------------+----------+--------+
=== Generated Code ===
PUSH 10
STORE x
PUSH 20
STORE y
LOAD x
LOAD y
PUSH 2
MUL
ADD
STORE result
=== Output saved to: compiled/test.asm ===
Input:
{
int i;
i = 0;
while (i < 5) {
i = i + 1;
}
}Generated Code:
PUSH 0
STORE i
LABEL L0
LOAD i
PUSH 5
CMP <
JZ L1
LOAD i
PUSH 1
ADD
STORE i
JMP L0
LABEL L1
The project includes 10 comprehensive test files:
| Test File | Description |
|---|---|
01_empty_block.tst |
Empty block |
02_variable_declarations.tst |
Variable declarations with different types |
03_simple_assignment.tst |
Basic assignment statements |
04_arithmetic_ops.tst |
Arithmetic operations and precedence |
05_boolean_expressions.tst |
Boolean logic and comparisons |
06_if_else.tst |
Conditional statements |
07_while_loop.tst |
While loop structures |
08_do_while_break.tst |
Do-while loops with break |
09_mixed_operations.tst |
Complex nested structures |
10_complex_boolean.tst |
Advanced boolean expressions |
Linux/macOS:
cd out
for f in ../src/test/*.tst; do
echo "--- Running: $f ---"
java javaCompiler.JavaCompiler "$f"
echo ""
doneWindows (PowerShell):
cd out
Get-ChildItem ..\src\test\*.tst | ForEach-Object {
Write-Host "--- Running: $($_.Name) ---"
java javaCompiler.JavaCompiler $_.FullName
Write-Host ""
}# Clean previous build
rm -rf out
# Create output directory
mkdir -p out
# Compile
javac -d out src/**/*.java
# Navigate to output
cd outThe codebase is designed to be extensible:
- New Operators: Add to
Lexer.javaand create corresponding AST nodes ininter/expr/ - New Statements: Create new statement classes in
inter/stmt/and updateParser.java - New Data Types: Extend
Type.javaand update the parser'stype()method - Code Generation: Add new instruction types in
Instruction.javaand updateCodeGen.java
- Single Responsibility: Each class has a clear, focused purpose
- Visitor Pattern: Code generation uses visitor pattern for AST traversal
- Recursive Descent: Parser mirrors the grammar structure
- Hierarchical Scopes: Symbol table uses parent pointers for scope chain
This compiler is organized to demonstrate key compiler concepts:
- Lexical Analysis: See
Lexer.javafor tokenization techniques - Parsing: Study
Parser.javafor recursive descent implementation - AST Construction: Review
inter/for tree node design - Symbol Tables: Examine
Env.javafor scope management - Code Generation: Analyze
CodeGen.javafor instruction generation
Potential improvements and extensions:
- Optimization: Constant folding, dead code elimination
- Arrays: Full multi-dimensional array support
- Functions: Function declarations and calls
- Structs: User-defined types
- Error Recovery: Better error handling and recovery
- Target Code: Generate actual machine code or bytecode
- Virtual Machine: Implement an interpreter for the generated code
- Debugger: Add debugging support with breakpoints
This is an educational project. Contributions that enhance learning value are welcome!
- Keep changes focused and well-documented
- Add test cases for new features
- Maintain the teaching-oriented style
- Comment complex algorithms
- Follow existing code conventions
For questions, issues, or suggestions:
- Open an issue on the repository
- Create a pull request with clear descriptions
- Reach out for collaboration opportunities
This project is unlicensed and provided as-is for educational purposes.
Happy Compiling! π