Skip to content

Conversation

@robertzhidealx
Copy link

@robertzhidealx robertzhidealx commented Dec 19, 2023

Optimizing atomic regions for (smaller) size.

It's now necessary to have a complete picture of which instructions are tainted and which aren't (whereas before we really only needed to know the boundaries of a region).

Test plan: make eg3 for an example where the freshness atomic region size is reduced (quite substantially) thanks to the optimization.

Before optimization:

define void @app() #0 {
entry:
  %x = alloca i32, align 4
  %y = alloca i32, align 4
  %z = alloca i32, align 4
  call void @atomic_start()         ; <--- START
  %call = call i32 @input()
  store i32 %call, ptr %x, align 4
  store i32 1, ptr %y, align 4
  %0 = load i32, ptr %y, align 4
  %add = add nsw i32 %0, 1
  store i32 %add, ptr %z, align 4
  %1 = load i32, ptr %z, align 4
  call void @log(i32 noundef %1)
  %2 = load i32, ptr %x, align 4
  call void @log(i32 noundef %2)
  call void @atomic_end()           ; <--- END
  ret void
}

After optimization (example03.ll):

define void @app() #0 {
entry:
  %x = alloca i32, align 4
  %y = alloca i32, align 4
  %z = alloca i32, align 4
  call void @atomic_start()         ; <--- START
  %call = call i32 @input()
  store i32 %call, ptr %x, align 4
  %0 = load i32, ptr %x, align 4
  call void @log(i32 noundef %0)
  call void @atomic_end()           ; <--- END
  store i32 1, ptr %y, align 4
  %1 = load i32, ptr %y, align 4
  %2 = add nsw i32 %1, 1
  store i32 %2, ptr %z, align 4
  %3 = load i32, ptr %z, align 4
  call void @log(i32 noundef %3)
  ret void
}

For an example of an optimization of FreshConsistent regions, check out example04.*, or run the example yourself via make eg4/make run_eg4.

Loops are also now optimized. Untainted loop instructions are extracted into their own loop, not to be wrapped in the atomic region, thereby reducing its size. Check out example05.* and example07.* for example transformations.

In this process, much care was taken to soundly clone and rewire the basic blocks constituting loops together, as you may observe from the example IRs and the code. However, it is an overall elegant approach, with clear cloning logic and special bookkeeping mostly for blocks that handle loop condition checking and loop variable updating.

…tor code

Below are the key changes:
- Use LLVM's new pass manager, a major improvement from the legacy one.
- Fix a shortcoming of the inference algorithm to actually collect all
uses of a fresh/consistent variable.
- Optimize the inference cleanup algorithm to remove all instructions
associated with the arguments of fresh/consistent annotations.
- Thoroughly log debug messages throughout the components of the pass
for a clearer view of the process.
- Rename files, structs, functions, variables, etc. to be more
descriptive and consistent.
- General code style refactoring (e.g., use `auto` and structured
bindings (destructuring) where possible).
- Added simple C tests to `benchmarks/ctests`.
Useful extensible shortcuts to running tests.
Step 1 of optimizing atomic regions for (smaller) size.

In essence, it's now necessary to have a complete picture of which
instructions are tainted (whereas before we really only needed to know
the boundaries of a region).

Test plan: `make eg3` for an example where the freshness atomic region
size is reduced thanks to the optimization.
errs() << "[regionsNeeded] Go over all block insts\n";
#endif
std::set<BasicBlock*> seenBlocks;
for (auto& [_, B] : blocks) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instruction scheduling starts here.

The optimization is now much more robust against general source
programs. Freshness annotations now work pretty well!

The main fix to the previous setup involves a mapping from old
instructions to cloned ones. Since cloning an instruction (e.g.,
BinaryOperator) doesn't automatically clone its operands, this mapping
is required to help replace the operands of cloned instructions with the
clones of those operands. Cloning is the only approach to such
replacements due to the LLVM IR being in SSA form.

Test plan:

Run examples01/02/03 to see the tranformations. For example,

```sh
make eg3
```

Before optimization:

```llvm
define void @app() #0 {
entry:
  %x = alloca i32, align 4
  %y = alloca i32, align 4
  %z = alloca i32, align 4
  call void @atomic_start()         ; <--- START
  %call = call i32 @input()
  store i32 %call, ptr %x, align 4
  store i32 1, ptr %y, align 4
  %0 = load i32, ptr %y, align 4
  %add = add nsw i32 %0, 1
  store i32 %add, ptr %z, align 4
  %1 = load i32, ptr %z, align 4
  call void @log(i32 noundef %1)
  %2 = load i32, ptr %x, align 4
  call void @log(i32 noundef %2)
  call void @atomic_end()           ; <--- END
  ret void
}
```

After optimization:

```llvm
define void @app() #0 {
entry:
  %x = alloca i32, align 4
  %y = alloca i32, align 4
  %z = alloca i32, align 4
  call void @atomic_start()         ; <--- START
  %call = call i32 @input()
  store i32 %call, ptr %x, align 4
  %0 = load i32, ptr %x, align 4
  call void @log(i32 noundef %0)
  call void @atomic_end()           ; <--- END
  store i32 1, ptr %y, align 4
  %1 = load i32, ptr %y, align 4
  %2 = add nsw i32 %1, 1
  store i32 %2, ptr %z, align 4
  %3 = load i32, ptr %z, align 4
  call void @log(i32 noundef %3)
  ret void
}
```

You may also link, build, and run an executable via:

```sh
make run_eg3 && ../../benchmarks/ctests/example03.out
```
...by moving non-IO instructions out of regions.
…regions

Mostly working, except optimizations done on a FreshConsistent region
need to converge back into a single (nested) region.
…ation

When a variable has both freshness and consistency
constraints, the overlap between the optimized
inferred atomic region is now properly handled, by
nesting them such that only the outermost bounds
count.

See benchmarks/ctests/example04.ll for an example.

Before:

```llvm
define void @app() #0 {
entry:
  %x = alloca i32, align 4
  %y = alloca i32, align 4
  call void @atomic_start()         ; <-- OUTER START
  %call = call i32 @input()
  store i32 %call, ptr %x, align 4
  call void @atomic_start()         ; <-- INNER START
  %call1 = call i32 @input()
  call void @atomic_end()           ; <-- INNER END
  store i32 %call1, ptr %y, align 4
  %0 = load i32, ptr %x, align 4
  call void @log(i32 noundef %0)
  %1 = load i32, ptr %y, align 4
  call void @log(i32 noundef %1)
  call void @atomic_end()           ; <-- OUTER END
  ret void
}
```

After:

```llvm
define void @app() #0 {
entry:
  %x = alloca i32, align 4
  %y = alloca i32, align 4
  call void @atomic_start()         ; <-- OUTER START
  %call = call i32 @input()
  call void @atomic_start()         ; <-- INNER START
  %call1 = call i32 @input()
  call void @atomic_end()           ; <-- INNER END
  store i32 %call1, ptr %y, align 4
  %0 = load i32, ptr %y, align 4
  call void @log(i32 noundef %0)
  call void @atomic_end()           ; <-- OUTER END
  store i32 %call, ptr %x, align 4
  %1 = load i32, ptr %x, align 4
  call void @log(i32 noundef %1)
  ret void
}
```
…re optimizing loops

One objective as of now is to make optimizations
even more robust by supporting more corner cases.

For an example where the IO function is
`input(int i)` (`benchmarks/ctests/example06.c`),
optimizations shouldn't incorrectly delay the
instructions related to the argument `i`, and
should instead produce:

```llvm
define void @app() #0 {
entry:
  %i = alloca i32, align 4
  %x = alloca i32, align 4
  store i32 1, ptr %i, align 4            <--
  %0 = load i32, ptr %i, align 4          <--
  call void @atomic_start()
  %call = call i32 @input(i32 noundef %0) <-- DEPENDS ON THE ABOVE
  store i32 %call, ptr %x, align 4
  %1 = load i32, ptr %x, align 4
  call void @log(i32 noundef %1)
  call void @atomic_end()
  ret void
}
```

As for loop optimizations, unlike WARio (which
targets checkpointing runtimes), loop unrolling
(i.e., creating multiple smaller copies of the
loop) doesn't help in atomic region inference,
since these loops must still be in the same
region. Thus, the "costliness" of the region won't
be lessened.

There are optimizations to be done though. For
instance, loops entirely untainted by inputs
under constraint(s) can be delayed and moved out
of atomic regions just like many other
instructions can. The difficulty with this part
lies in rewiring the complex branching/connections
among the basic blocks that form these loops,
making an optimizing analysis harder to devise.

`benchmarks/ctests/example05` illustrates an
instance where the optimization above applies.
I will be working on this as a next step.
@robertzhidealx robertzhidealx force-pushed the optimizations branch 4 times, most recently from f5316b2 to f56c010 Compare March 10, 2024 23:23
Extract untainted instructions into their own loop that doesn't go into
the atomic region.

Test plan:

`make eg5` and observe the difference between
`benchmarks/ctests/example05.ll` (optimized) and
`benchmarks/ctests/example05.orig.ll` (original), or `make eg7`.
…and refactoring for concision

Now, the instructions "tainted" by an IO call will be included in the
fresh set as well, making it so that they remain preceeding the IO call,
within their atomic region. This is a more fundamental solution than
before, where exceptions were only made to these instructions during
optimization.

The optimization now has a more modular structure where common
instruction patching logic is extracted into a reusable procedure to be
run more than once (`Helpers::patchClonedBlock`). It comes into play
after cloning a basic block, to rewire its instructions to properly
reference each other.

Test plan:

`make`
In the case of loop conditions that depend on fresh/consistent input
values, no instruction in the loop body can be extracted out from the
atomic region, as shown in the example below:

```rust
fn app() -> () {
  let x = input();
  for _ in 0..10 {
    let y = 1;
    log(y + 2);
    log(x);
  }
  Fresh(x);
}
```

Test plan:

`make eg8`
Fix an issue with extracting IO functions from source code. Add several
tests, including a few in Rust.
Only small changes are required for the
optimization to work on Rust programs involving
loops.

See tests `example.rs`, `example11.rs`, and `example12.rs`.
@robertzhidealx robertzhidealx changed the title [WIP][InferAtomsPass] Instruction scheduling [InferAtomsPass] Instruction scheduling Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant