-
Notifications
You must be signed in to change notification settings - Fork 2
[InferAtomsPass] Instruction scheduling #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
robertzhidealx
wants to merge
18
commits into
CMUAbstract:master
Choose a base branch
from
robertzhidealx:optimizations
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…tor code Below are the key changes: - Use LLVM's new pass manager, a major improvement from the legacy one. - Fix a shortcoming of the inference algorithm to actually collect all uses of a fresh/consistent variable. - Optimize the inference cleanup algorithm to remove all instructions associated with the arguments of fresh/consistent annotations. - Thoroughly log debug messages throughout the components of the pass for a clearer view of the process. - Rename files, structs, functions, variables, etc. to be more descriptive and consistent. - General code style refactoring (e.g., use `auto` and structured bindings (destructuring) where possible). - Added simple C tests to `benchmarks/ctests`.
Useful extensible shortcuts to running tests.
Step 1 of optimizing atomic regions for (smaller) size. In essence, it's now necessary to have a complete picture of which instructions are tainted (whereas before we really only needed to know the boundaries of a region). Test plan: `make eg3` for an example where the freshness atomic region size is reduced thanks to the optimization.
6083f0e to
cde1b66
Compare
robertzhidealx
commented
Dec 19, 2023
| errs() << "[regionsNeeded] Go over all block insts\n"; | ||
| #endif | ||
| std::set<BasicBlock*> seenBlocks; | ||
| for (auto& [_, B] : blocks) { |
Author
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instruction scheduling starts here.
The optimization is now much more robust against general source programs. Freshness annotations now work pretty well! The main fix to the previous setup involves a mapping from old instructions to cloned ones. Since cloning an instruction (e.g., BinaryOperator) doesn't automatically clone its operands, this mapping is required to help replace the operands of cloned instructions with the clones of those operands. Cloning is the only approach to such replacements due to the LLVM IR being in SSA form. Test plan: Run examples01/02/03 to see the tranformations. For example, ```sh make eg3 ``` Before optimization: ```llvm define void @app() #0 { entry: %x = alloca i32, align 4 %y = alloca i32, align 4 %z = alloca i32, align 4 call void @atomic_start() ; <--- START %call = call i32 @input() store i32 %call, ptr %x, align 4 store i32 1, ptr %y, align 4 %0 = load i32, ptr %y, align 4 %add = add nsw i32 %0, 1 store i32 %add, ptr %z, align 4 %1 = load i32, ptr %z, align 4 call void @log(i32 noundef %1) %2 = load i32, ptr %x, align 4 call void @log(i32 noundef %2) call void @atomic_end() ; <--- END ret void } ``` After optimization: ```llvm define void @app() #0 { entry: %x = alloca i32, align 4 %y = alloca i32, align 4 %z = alloca i32, align 4 call void @atomic_start() ; <--- START %call = call i32 @input() store i32 %call, ptr %x, align 4 %0 = load i32, ptr %x, align 4 call void @log(i32 noundef %0) call void @atomic_end() ; <--- END store i32 1, ptr %y, align 4 %1 = load i32, ptr %y, align 4 %2 = add nsw i32 %1, 1 store i32 %2, ptr %z, align 4 %3 = load i32, ptr %z, align 4 call void @log(i32 noundef %3) ret void } ``` You may also link, build, and run an executable via: ```sh make run_eg3 && ../../benchmarks/ctests/example03.out ```
f531af3 to
c772992
Compare
...by moving non-IO instructions out of regions.
…regions Mostly working, except optimizations done on a FreshConsistent region need to converge back into a single (nested) region.
…ation When a variable has both freshness and consistency constraints, the overlap between the optimized inferred atomic region is now properly handled, by nesting them such that only the outermost bounds count. See benchmarks/ctests/example04.ll for an example. Before: ```llvm define void @app() #0 { entry: %x = alloca i32, align 4 %y = alloca i32, align 4 call void @atomic_start() ; <-- OUTER START %call = call i32 @input() store i32 %call, ptr %x, align 4 call void @atomic_start() ; <-- INNER START %call1 = call i32 @input() call void @atomic_end() ; <-- INNER END store i32 %call1, ptr %y, align 4 %0 = load i32, ptr %x, align 4 call void @log(i32 noundef %0) %1 = load i32, ptr %y, align 4 call void @log(i32 noundef %1) call void @atomic_end() ; <-- OUTER END ret void } ``` After: ```llvm define void @app() #0 { entry: %x = alloca i32, align 4 %y = alloca i32, align 4 call void @atomic_start() ; <-- OUTER START %call = call i32 @input() call void @atomic_start() ; <-- INNER START %call1 = call i32 @input() call void @atomic_end() ; <-- INNER END store i32 %call1, ptr %y, align 4 %0 = load i32, ptr %y, align 4 call void @log(i32 noundef %0) call void @atomic_end() ; <-- OUTER END store i32 %call, ptr %x, align 4 %1 = load i32, ptr %x, align 4 call void @log(i32 noundef %1) ret void } ```
…re optimizing loops One objective as of now is to make optimizations even more robust by supporting more corner cases. For an example where the IO function is `input(int i)` (`benchmarks/ctests/example06.c`), optimizations shouldn't incorrectly delay the instructions related to the argument `i`, and should instead produce: ```llvm define void @app() #0 { entry: %i = alloca i32, align 4 %x = alloca i32, align 4 store i32 1, ptr %i, align 4 <-- %0 = load i32, ptr %i, align 4 <-- call void @atomic_start() %call = call i32 @input(i32 noundef %0) <-- DEPENDS ON THE ABOVE store i32 %call, ptr %x, align 4 %1 = load i32, ptr %x, align 4 call void @log(i32 noundef %1) call void @atomic_end() ret void } ``` As for loop optimizations, unlike WARio (which targets checkpointing runtimes), loop unrolling (i.e., creating multiple smaller copies of the loop) doesn't help in atomic region inference, since these loops must still be in the same region. Thus, the "costliness" of the region won't be lessened. There are optimizations to be done though. For instance, loops entirely untainted by inputs under constraint(s) can be delayed and moved out of atomic regions just like many other instructions can. The difficulty with this part lies in rewiring the complex branching/connections among the basic blocks that form these loops, making an optimizing analysis harder to devise. `benchmarks/ctests/example05` illustrates an instance where the optimization above applies. I will be working on this as a next step.
0907a4a to
6859d35
Compare
f5316b2 to
f56c010
Compare
Extract untainted instructions into their own loop that doesn't go into the atomic region. Test plan: `make eg5` and observe the difference between `benchmarks/ctests/example05.ll` (optimized) and `benchmarks/ctests/example05.orig.ll` (original), or `make eg7`.
f56c010 to
b8b0037
Compare
…and refactoring for concision Now, the instructions "tainted" by an IO call will be included in the fresh set as well, making it so that they remain preceeding the IO call, within their atomic region. This is a more fundamental solution than before, where exceptions were only made to these instructions during optimization. The optimization now has a more modular structure where common instruction patching logic is extracted into a reusable procedure to be run more than once (`Helpers::patchClonedBlock`). It comes into play after cloning a basic block, to rewire its instructions to properly reference each other. Test plan: `make`
In the case of loop conditions that depend on fresh/consistent input
values, no instruction in the loop body can be extracted out from the
atomic region, as shown in the example below:
```rust
fn app() -> () {
let x = input();
for _ in 0..10 {
let y = 1;
log(y + 2);
log(x);
}
Fresh(x);
}
```
Test plan:
`make eg8`
Fix an issue with extracting IO functions from source code. Add several tests, including a few in Rust.
Only small changes are required for the optimization to work on Rust programs involving loops. See tests `example.rs`, `example11.rs`, and `example12.rs`.
f04cba1 to
883cb9c
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Optimizing atomic regions for (smaller) size.
It's now necessary to have a complete picture of which instructions are tainted and which aren't (whereas before we really only needed to know the boundaries of a region).
Test plan:
make eg3for an example where the freshness atomic region size is reduced (quite substantially) thanks to the optimization.Before optimization:
After optimization (example03.ll):
For an example of an optimization of FreshConsistent regions, check out example04.*, or run the example yourself via
make eg4/make run_eg4.Loops are also now optimized. Untainted loop instructions are extracted into their own loop, not to be wrapped in the atomic region, thereby reducing its size. Check out example05.* and example07.* for example transformations.
In this process, much care was taken to soundly clone and rewire the basic blocks constituting loops together, as you may observe from the example IRs and the code. However, it is an overall elegant approach, with clear cloning logic and special bookkeeping mostly for blocks that handle loop condition checking and loop variable updating.