Count whitespace

Exploring performance optimization and compiler logic using a simple task: counting whitespaces in a string.

count_whitespaces_std: Iterates over Unicode (char) values. Since UTF-8 is a variable-length encoding (1 to 4 bytes), this iterator does decode every sequence and reconstruct the code point, executing significant logic for every byte.
Takeaway: When performance is critical, avoid chars() and work with the raw bytes.
count_whitespaces_bytes: Iterates directly over the raw u8 bytes of the slice. This bypasses the costly UTF-8 decoding entirely.
Takeaway: Removing the decoding abstraction results in an orders-of-magnitude speedup.
count_whitespaces_loop: Replaces the iterators with an explicit for loop. The performance is identical to the bytes version.
Takeaway: Rust have zero-cost abstractions. High-level iterators and low-level loops often compile down to the exact same machine code.
count_whitespaces_chunk: Hints to the compiler about the data structure by iterating over fixed-size 32-byte arrays. This forces Loop Unrolling, allowing the CPU to execute multiple instructions in parallel (Instruction Level Parallelism) or triggering auto-vectorization.
count_whitespaces_avx2: The final optimization using explicit SIMD AVX2 intrinsics, processing 256 bits (32 bytes) in parallel.
Takeaway: This achieves near-theoretical hardware limits, without using specialized algorithm and bits magic.

Benches

Small (12 bytes)

Method	Speed
bench_avx2	3.53 ns/iter (+/- 0.15)
bench_bytes	4.40 ns/iter (+/- 0.53)
bench_chunk	4.99 ns/iter (+/- 0.35)
bench_loop	4.18 ns/iter (+/- 0.16)
bench_std	7.77 ns/iter (+/- 0.46)

Large (1200 bytes)

Method	Speed
bench_avx2	60.28 ns/iter (+/- 6.61)
bench_bytes	400.94 ns/iter (+/- 27.71)
bench_chunk	71.25 ns/iter (+/- 3.13)
bench_loop	404.32 ns/iter (+/- 20.94)
bench_std	656.18 ns/iter (+/- 56.75)

Down the AVX2 rabbit hole

While trying to squeeze every bit of performance out of Rust, I discovered the target-cpu=native flag. This tells the compiler to unlock every instruction set supported by my CPU (AMD R5), effectively enabling AVX2.

I expected everything to get faster, but this is not what happened. The count_whitespaces_chunk method collapsed, slowing down from 71ns to 274ns and dropping to second-to-last place.

What happened

So, what happened ? To understand that I'll use cargo asm read by me an AI to understand the assembly structure.

The chunk method occupies a "dangerous middle ground" between high-level iterators and low-level intrinsics. By forcing the compiler to iterate over fixed [u8; 32] arrays without using explicit SIMD instructions, we confused the optimizer:

The compiler attempted to utilize the 256-bit YMM registers enabled by AVX2, but failed to vectorize the memory access.
Instead of issuing a single instruction to load 32 bytes from memory (vmovdqu), the compiler generated dozens of separate instructions (vpinsrb) to manually build the vector byte-by-byte, destroying memory throughput.
The compiler failed to recognize the bitwise nature of the problem. It expanded every 8-bit byte into a 64-bit integer (vpmovzxbq) to perform vector addition, increasing register pressure and data volume by 8x compared to the optimal popcnt instruction.

Takeaway: Do not try to "half-optimize" by manually chunking memory:

Write simple, idiomatic iterators (count_whitespaces_bytes) and trust the compiler's auto-vectorizer.
Write explicit SIMD intrinsics (count_whitespaces_simd) to take full control.

And most importantly: Benchmark exactly on the target hardware with production compilation flags. Adding flags like target-cpu=native or target-feature=+avx2 later can fundamentally alter the program's behavior !

Benches

Small (12 bytes)

Method	Speed
bench_avx2	4.03 ns/iter (+/- 2.75)
bench_bytes	3.49 ns/iter (+/- 0.15)
bench_chunk	6.07 ns/iter (+/- 0.48)
bench_loop	2.24 ns/iter (+/- 0.45)
bench_std	6.98 ns/iter (+/- 0.96)

Large (1200 bytes)

Method	Speed
bench_avx2	18.77 ns/iter (+/- 1.75)
bench_bytes	121.87 ns/iter (+/- 12.34)
bench_chunk	274.38 ns/iter (+/- 35.35)
bench_loop	120.38 ns/iter (+/- 12.31)
bench_std	652.94 ns/iter (+/- 64.69)

Summary

Optimizing a task as simple as counting spaces reveals how much performance can be thrown off the window.

"Abstraction Trap": Moving from chars() to raw bytes() cuts execution time by 80% just by skipping unnecessary UTF-8 decoding.
"Half-Optimization Trap": Manual chunking worked well with basic settings but became 4x slower under AVX2. The compiler tried to use advanced registers but did so inefficiently, proving that manual "hints" can often backfire.
Vectorization dominates: For large strings, SIMD is roughly 35x faster than the standard iterator and 6.5x faster than a simple byte loop.

Conclusion: For most cases, stick to raw byte iterators, they are fast, clean, and let the compiler auto-vectorize safely. If you need every last drop of speed, use SIMD with intrinsics to stay in total control.

Remember to enable the CPU features when running your code on a single CPU architecture, for potential 4x improvements as shown in this benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.cargo		.cargo
asm		asm
benches		benches
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Count whitespace

Benches

Small (12 bytes)

Large (1200 bytes)

Down the AVX2 rabbit hole

What happened

Benches

Small (12 bytes)

Large (1200 bytes)

Summary

About

Uh oh!

Releases

Packages

Languages

License

agmbk/count_whitespace

Folders and files

Latest commit

History

Repository files navigation

Count whitespace

Benches

Small (12 bytes)

Large (1200 bytes)

Down the AVX2 rabbit hole

What happened

Benches

Small (12 bytes)

Large (1200 bytes)

Summary

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages