Skip to content

C++ implementation of Lox programming language. It compiles the source code to bytecode, and executes it on a VM. It also implements, lists, maps and bytecode serialization

License

Notifications You must be signed in to change notification settings

ananthvk/cpp-lox

Repository files navigation

cpp-lox

My implementation of second part of Lox programming language from Crafting interpreters in C++. This implementation uses a bytecode VM to execute the programs.

How to run?

Install meson, ninja and a C++ 20 capable compiler

$ meson setup builddir
$ cd builddir
$ ninja -j8

If you are on linux, run

$ ./src/bin/cpplox

On windows, run

.\src\bin\cpplox.exe

After building, three binaries are produced

./builddir/src/bin/cpplox - Includes a Compiler, VM and a REPL

./builddir/src/bin/loxvm - Standalone VM, does not include compiler, used for running compiled lox programs

./builddir/src/bin/loxdump - For disassembling compiled lox programs

./builddir/src/bin/loxcrc - For updating the CRC checksum of a compiled lox program (operates on the file inplace)

To run tests

To run tests, you need to have Python 3

First install pytest, a test automation library used to test programs.

Create a virtual environment

$ python -m venv .venv

Install dependencies

$ pip install -r requirements.txt

Run tests

$ meson setup builddir -Denable-tests=true
$ cd builddir
$ ninja -j8
$ meson test -v

To run benchmarks

$ cd builddir
$ meson test --benchmark --interactive

Development build

Have Clang installed along with asan. You also need to follow the steps in the above section, pytest must be installed and available on the PATH

Then run the following command,

$ ./development.sh

Help

$ cpplox --help
Lox compiler and interpreter in C++
Usage:
  cpplox [OPTION...] positional parameters

      --print-tokens            Show all tokens generated by the lexer 
                                along with line position
      --compile-only            Compiles the source code to bytecode but 
                                does not execute it
      --dump-bytecode           Dumps the bytecode generated by the 
                                compiler before execution
      --trace-execution         Used for debugging: Prints the bytecode 
                                instruction before executing it
      --trace-eval-stack        Used for debugginig: Prints the internal 
                                evaluation stack before the instruction is 
                                executed
      --enable-step-mode        Used for debugging: Enables step mode and 
                                waits for input before an instruction is 
                                executed
      --eval-stack-size arg     Maximum evaluation stack size allowed 
                                (default: 1024)
      --gc-initial-collection-threshold arg
                                Number of bytes after which GC will collect 
                                garbage for the first time (default: 
                                1048576)
      --log-gc                  Logs events that occur during a GC 
                                collection
      --display-mem-stats       Display memory stats, such as number of 
                                objects allocated, bytes allocated, etc on 
                                program exit
      --stress-gc               Stresses the GC by triggering a collection 
                                everytime a new object is created
  -c, --command arg             Execute the given command and exit
  -g, --emit-debug-info         Include debug information in compiled 
                                bytecode (line number mappings). Can only 
                                be used with -o/--output flag
  -o, --output arg              Compiles the script, generates the bytecode 
                                and store it in the specified file. Does 
                                not execute the program
  -h, --help                    Prints this help message
  -v, --version                 Prints program version

Examples

Check out the examples in the examples/ directory.

For a complex example see, a math evaluation program using Shunting-Yard algorithm here

Notable Changes from the book & Usage

Serialization of bytecode to file and execution of it

Supports storing the compiled bytecode in a file, then loading and executing it. See bytecode file format to learn more about how the compiled program is stored on disk.

To generate a bytecode file, use the -o/--output flag, and optionally the -g/--emit-debug-info flag to emit the line-bytecode offset mapping.

Example:

$ cpp-lox program.lox -o compiled.loxc

or

$ cpp-lox program.lox -o compiled.loxc -g

If you also want to include bytecode line positions.

Execute it with,

$ cpp-lox compiled.loxc

Use src/bin/loxdump to disassemble a compiled lox program

$ ./src/bin/loxdump compiled.loxc

Lists (dynamic arrays)

Implement list data structure (Note: It's implemented as a dynamic array (vector) rather than a linked list (like CPython))

Example:

var x = [1, 2, 3]; // Create a list
println(x[0]); // Get element at index, prints 1
println(x); // Print the list
println(x[0] = 8); // Prints 8, sets index 0 to 8

Implementation: Two new opcodes LIST and LIST_APPEND have been added. LIST takes a single byte operand, the intial size of the list (n), and it reads & pops the top n values on stack and creates a new list object. If n >= 10, the compiler will emit a LIST list instruction to build the list with 10 values, and for the remaining values, it adds a LIST_APPEND after emitting the bytecode for the expression. This is done so that the stack does not overflow when a list is declared with a lot of elements.

Note: Allocating huge lists eg (list(10000000, 0)) may cause the program to pause because the GC has to trace through all those values even if there are no object references. In the future, try to optimize this

Another two instructions: LOAD_INDEX and STORE_INDEX to access and modify list members

For STORE_INDEX: [container, index, value] -> [ ] after execution For LOAD_INDEX: [container, index] -> [value]

Native functions to work with lists

  • len(list) int: Returns the length of the list, works the same with strings
  • cap(list) int: Returns the capacity of the backing array of the list
  • list(len, default, cap) list: Creates a new list with the specified length, capacity, and default value. All three arguments are optional. len is the initial length of the list, cap is the capacity of the list (after which a new reallocation will be triggered), and default is the default value when creating the list (default is nil).
  • append(list, element): Adds an element to the end of a list
  • delete(list, index): Removes an element at specified index from list
  • clear(list): Clears all values of the list
  • pop(list) value: Removes and returns the last element from a list

String indexing

Indexing works with strings too

The result of indexing a string results in another string (there is no character datatype)

Example:

var x = "hello";
println(x[1]); // prints e

x[2] = "x"; // Not allowed, throws error since strings are immutable

Maps (Hashmap)

This implementation also supports associative data structure, map. Internally it is implemented using hash maps (the same implementation is used for methods and fields)

Declaring a new map,

// Keys & values can be any expression
var x = {
    "name": "hello",
    3: "three",
    3: "three",
};

// Empty map
var x = {};

// Can also be declared as
var x = map();

// Printing type
println(type(x)) // map

Accessing & Modifying values in a map

var x  = {
    "three": 3,
    "four": 4
};
println(x["three"]); // Prints 3
x["three"] = -3; // Updates the value of key "three" to -3
x["two"] = 2; // If a key does not exist, it gets created
println(x["five"]); // If a key does not exist, throws runtime error

Native functions to work with maps

  • len(map) int: Returns the number of key-value pairs in the map
  • map() map: Creates a new map
  • append(map, key, value): Adds the key-value pair to the map (same as using the [] operator)
  • delete(map, key) bool: Deletes the key from the map, returns true if the key was removed, false if the key does not exist
  • keys(map) list: Returns a list of keys of the map
  • values(map) list: Returns a list of values of the map
  • has(map, key) bool: Returns true if the key exists in the map
  • clear(map): Clears all key-value pairs of the map
  • get(map, key, default): Returns the value associated with the key in the map, if the key does not exist, returns the set default value. default is optional, and if not specified, returns nil

Note: Currently there is no way to iterate over all elements of the map using a loop (no range for loop yet), so use the keys function to retrieve all the keys, then iterate over this list using a loop.

Note: Also objects can be used as map keys, but there is no way to specify custom hash or equality operators, so they will be checked for equality based on whether they are the same object. (same memory location)

Lists & Maps are unhashable, and cannot be used as map keys (similar behavior to CPython)

De-duplication of integers

The compiler de-duplicates integers (but not real numbers)

More constants

Supports 65k constants, and 65k global variables (accessed with 2 byte unsigned index)

const variables

const variables

In global scope,

If a constant of the same name has already been declared, do not allow
redeclaration with var Example:
>> const x = 32;
>> var x = 50; // Not allowed

>> const x = 80;
>> const x = 90; // Not Allowed

>> var x = 95;
>> var x = 100; // Allowed

One difference is that redeclaration with const is not allowed since it helps make the vm simpler, otherwise an extra flag/byte needs to be maintained to identify if it's a const declaration or a var declaration.

switch statement

Supports C-style switch statements with cases and default:

switch (value) {
    case 1:
        print "one";
        break;
    case 2:
        var foo = "hello";
        print foo;
        break;
    default:
        print "other";
}
  • Both the selector and the case value can be expressions
  • Each arm of the switch statement defines a local scope, so variables can be declared within it
  • There is no break statement, there is no fallthrough
  • Cases cannot appear after the default statement

Local scope redeclaration

In local scope redeclaration of any form is disallowed

List of native functions

  • sqrt(value number) double - Returns the square root of the number

  • exit(status_code int) - Exits the interpreter with the status code

  • input() string - Takes a line of input from stdin, does not include terminating \n

  • print(args...) - Prints all the arguments, separated by a space

  • println(args...) - Prints all the arguments, separated by a space, and adds a newline at the end

  • len(str string) int - Returns the length of the string / list

  • to_string(value) string - Converts the given value into a string

  • to_int(value) int - Converts the given value into an int

  • to_double(value) double - Converts the given value into a double

  • type(value) string - Returns the type of the value as string

  • rand() double - Returns a random double between [0, 1) (inclusive of 0, exclusive of 1)

  • randint(m int, n int) int - Returns a random integer between m & n (inclusive at both ends)

  • assert(value bool, message string) - Raises a runtime error & prints the message if value is false. Does nothing if value is true

  • sys__mem_get_bytes_allocated() int - Returns the total number of bytes allocated by the garbage collector

  • sys__mem_get_bytes_freed() int - Returns the total number of bytes freed by the garbage collector

  • sys__mem_get_next_gc() int - Returns the threshold in bytes at which the next garbage collection will trigger

  • sys__mem_get_objects_created() int - Returns the total number of objects created by the garbage collector

  • sys__mem_get_objects_freed() int - Returns the total number of objects freed by the garbage collector

  • sys__mem_get_live_objects() int - Returns the current number of live objects managed by the garbage collector

  • sys__mem_get_net_bytes() int - Returns the net bytes currently allocated (allocated - freed)

  • sys__mem_display_gc_stats() - Prints detailed garbage collection statistics to the console

  • hash(value) - Returns the hash value (int64_t of a value), throws error if the type is not hashable

  • is_hashable(value) bool - Returns true if the value is hashable, false otherwise (like lists, maps)

No print statements

print statement is renamed to echo. This was done to maintain compatability with tests without major refactoring. Do not use this statement unless needed, and instead use the print() and println() functions.

Note: The GC implementation in my project does not track all memory and it will use a bit of extra memory. This is because some parts of the application uses std::vector and std::string that are not managed by the garbage collector. In the book, all memory is allocated through reallocate() hence the GC has complete control over the memory. But in my implementation, all vectors & strings are wrapped in VM objects, so once they are freed, the associated vectors & strings get freed too.

Native functions for accessing properties

Native functions for manipulating and accessing properties (fields)

  • has_property(instance, string) bool - Returns true if the property exists on the instance (similar to hasattr of python)
  • get_property(instance, string) value - Returns the value of the property on the instance (similar to getattr of python)
  • set_property(instance, string, value) - Sets the property on the instance to the given value (similar to setattr of python)
  • del_property(instance, string) bool - Deletes a property on an instance, returns true if the property was deleted, false if the property did not not exist

Optional chaining operator ?.

This operator is similar to the one in JS and can be used to access properties of nested objects easily (when some intermediate object is nil / not an instance)

Example:

var result = x?.y?.z?.w;

If any of x, y, or z is nil, or the property does not exist, or the property is not an instance, the result is nil

If the base (x) is not defined, the expression results in an error

class Foo {}
var f = Foo();
f.x = print;
println(f.x?.prop); // nil (becasue x is a native function)

: instead of < for inheritance

For inheritance, instead of < operator, I have decided to use : in my language (similar to C++)

Example:

class Foo {
}
class Bar : Foo {
}

Additional OPCODES for optimization of constants

Three new instructions, ZERO, ONE, MINUS_ONE, that push 0, 1, and -1 respectively on to the stack. This was added as an optimization, so that the value need not be loaded from the constant table whenever these values are present.

Documents

Opcodes list

Bytecode file format

TODO

  • Fix division by zero error
  • Implement break statements
  • Implement slices
  • Implement clear() method for lists
  • Implement copy() method for lists & maps (shallow copy, copies only the top level values)
  • Range for loops (to iterate over maps/lists)
  • Support custom hash, comparision operators between objects
  • Fix opts.debug_trace_execution (L70 in vm)
  • flush native method (when using print)
  • implement a chr and ord function (to convert to and from ascii values)

About

C++ implementation of Lox programming language. It compiles the source code to bytecode, and executes it on a VM. It also implements, lists, maps and bytecode serialization

Topics

Resources

License

Stars

Watchers

Forks