Exploring performance optimization and compiler logic using a simple task: counting whitespaces in a string.
count_whitespaces_std: Iterates over Unicode (char) values. Since UTF-8 is a variable-length encoding (1 to 4 bytes), this iterator does decode every sequence and reconstruct the code point, executing significant logic for every byte.
Takeaway: When performance is critical, avoidchars()and work with the raw bytes.count_whitespaces_bytes: Iterates directly over the rawu8bytes of the slice. This bypasses the costly UTF-8 decoding entirely.
Takeaway: Removing the decoding abstraction results in an orders-of-magnitude speedup.count_whitespaces_loop: Replaces the iterators with an explicit for loop. The performance is identical to the bytes version.
Takeaway: Rust have zero-cost abstractions. High-level iterators and low-level loops often compile down to the exact same machine code.count_whitespaces_chunk: Hints to the compiler about the data structure by iterating over fixed-size 32-byte arrays. This forces Loop Unrolling, allowing the CPU to execute multiple instructions in parallel (Instruction Level Parallelism) or triggering auto-vectorization.count_whitespaces_avx2: The final optimization using explicit SIMD AVX2 intrinsics, processing 256 bits (32 bytes) in parallel.
Takeaway: This achieves near-theoretical hardware limits, without using specialized algorithm and bits magic.
| Method | Speed |
|---|---|
| bench_avx2 | 3.53 ns/iter (+/- 0.15) |
| bench_bytes | 4.40 ns/iter (+/- 0.53) |
| bench_chunk | 4.99 ns/iter (+/- 0.35) |
| bench_loop | 4.18 ns/iter (+/- 0.16) |
| bench_std | 7.77 ns/iter (+/- 0.46) |
| Method | Speed |
|---|---|
| bench_avx2 | 60.28 ns/iter (+/- 6.61) |
| bench_bytes | 400.94 ns/iter (+/- 27.71) |
| bench_chunk | 71.25 ns/iter (+/- 3.13) |
| bench_loop | 404.32 ns/iter (+/- 20.94) |
| bench_std | 656.18 ns/iter (+/- 56.75) |
While trying to squeeze every bit of performance out of Rust, I discovered the target-cpu=native flag. This tells the
compiler to unlock every instruction set supported by my CPU (AMD R5), effectively enabling AVX2.
I expected everything to get faster, but this is not what happened. The count_whitespaces_chunk method collapsed,
slowing down from 71ns to 274ns and dropping to second-to-last place.
So, what happened ? To understand that I'll use cargo asm read by me an AI to understand the assembly
structure.
The chunk method occupies a "dangerous middle ground" between high-level iterators and low-level intrinsics. By forcing
the compiler to iterate over fixed [u8; 32] arrays without using explicit SIMD instructions, we confused the
optimizer:
-
The compiler attempted to utilize the 256-bit YMM registers enabled by AVX2, but failed to vectorize the memory access.
-
Instead of issuing a single instruction to load 32 bytes from memory (
vmovdqu), the compiler generated dozens of separate instructions (vpinsrb) to manually build the vector byte-by-byte, destroying memory throughput. -
The compiler failed to recognize the bitwise nature of the problem. It expanded every 8-bit byte into a 64-bit integer (
vpmovzxbq) to perform vector addition, increasing register pressure and data volume by 8x compared to the optimalpopcntinstruction.
Takeaway: Do not try to "half-optimize" by manually chunking memory:
-
Write simple, idiomatic iterators (
count_whitespaces_bytes) and trust the compiler's auto-vectorizer. -
Write explicit SIMD intrinsics (
count_whitespaces_simd) to take full control.
And most importantly: Benchmark exactly on the target hardware with production compilation flags. Adding flags like
target-cpu=native or target-feature=+avx2 later can fundamentally alter the program's behavior !
| Method | Speed |
|---|---|
| bench_avx2 | 4.03 ns/iter (+/- 2.75) |
| bench_bytes | 3.49 ns/iter (+/- 0.15) |
| bench_chunk | 6.07 ns/iter (+/- 0.48) |
| bench_loop | 2.24 ns/iter (+/- 0.45) |
| bench_std | 6.98 ns/iter (+/- 0.96) |
| Method | Speed |
|---|---|
| bench_avx2 | 18.77 ns/iter (+/- 1.75) |
| bench_bytes | 121.87 ns/iter (+/- 12.34) |
| bench_chunk | 274.38 ns/iter (+/- 35.35) |
| bench_loop | 120.38 ns/iter (+/- 12.31) |
| bench_std | 652.94 ns/iter (+/- 64.69) |
Optimizing a task as simple as counting spaces reveals how much performance can be thrown off the window.
-
"Abstraction Trap": Moving from
chars()to rawbytes()cuts execution time by 80% just by skipping unnecessary UTF-8 decoding. -
"Half-Optimization Trap": Manual chunking worked well with basic settings but became 4x slower under AVX2. The compiler tried to use advanced registers but did so inefficiently, proving that manual "hints" can often backfire.
-
Vectorization dominates: For large strings, SIMD is roughly 35x faster than the standard iterator and 6.5x faster than a simple byte loop.
Conclusion: For most cases, stick to raw byte iterators, they are fast, clean, and let the compiler auto-vectorize safely. If you need every last drop of speed, use SIMD with intrinsics to stay in total control.
Remember to enable the CPU features when running your code on a single CPU architecture, for potential 4x improvements as shown in this benchmark.