Skip to content

agmbk/count_whitespace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Count whitespace

Exploring performance optimization and compiler logic using a simple task: counting whitespaces in a string.

  • count_whitespaces_std: Iterates over Unicode (char) values. Since UTF-8 is a variable-length encoding (1 to 4 bytes), this iterator does decode every sequence and reconstruct the code point, executing significant logic for every byte.
    Takeaway: When performance is critical, avoid chars() and work with the raw bytes.
  • count_whitespaces_bytes: Iterates directly over the raw u8 bytes of the slice. This bypasses the costly UTF-8 decoding entirely.
    Takeaway: Removing the decoding abstraction results in an orders-of-magnitude speedup.
  • count_whitespaces_loop: Replaces the iterators with an explicit for loop. The performance is identical to the bytes version.
    Takeaway: Rust have zero-cost abstractions. High-level iterators and low-level loops often compile down to the exact same machine code.
  • count_whitespaces_chunk: Hints to the compiler about the data structure by iterating over fixed-size 32-byte arrays. This forces Loop Unrolling, allowing the CPU to execute multiple instructions in parallel (Instruction Level Parallelism) or triggering auto-vectorization.
  • count_whitespaces_avx2: The final optimization using explicit SIMD AVX2 intrinsics, processing 256 bits (32 bytes) in parallel.
    Takeaway: This achieves near-theoretical hardware limits, without using specialized algorithm and bits magic.

Benches

Small (12 bytes)

Method Speed
bench_avx2 3.53 ns/iter (+/- 0.15)
bench_bytes 4.40 ns/iter (+/- 0.53)
bench_chunk 4.99 ns/iter (+/- 0.35)
bench_loop 4.18 ns/iter (+/- 0.16)
bench_std 7.77 ns/iter (+/- 0.46)

Large (1200 bytes)

Method Speed
bench_avx2 60.28 ns/iter (+/- 6.61)
bench_bytes 400.94 ns/iter (+/- 27.71)
bench_chunk 71.25 ns/iter (+/- 3.13)
bench_loop 404.32 ns/iter (+/- 20.94)
bench_std 656.18 ns/iter (+/- 56.75)

Down the AVX2 rabbit hole

While trying to squeeze every bit of performance out of Rust, I discovered the target-cpu=native flag. This tells the compiler to unlock every instruction set supported by my CPU (AMD R5), effectively enabling AVX2.

I expected everything to get faster, but this is not what happened. The count_whitespaces_chunk method collapsed, slowing down from 71ns to 274ns and dropping to second-to-last place.

What happened

So, what happened ? To understand that I'll use cargo asm read by me an AI to understand the assembly structure.

The chunk method occupies a "dangerous middle ground" between high-level iterators and low-level intrinsics. By forcing the compiler to iterate over fixed [u8; 32] arrays without using explicit SIMD instructions, we confused the optimizer:

  • The compiler attempted to utilize the 256-bit YMM registers enabled by AVX2, but failed to vectorize the memory access.

  • Instead of issuing a single instruction to load 32 bytes from memory (vmovdqu), the compiler generated dozens of separate instructions (vpinsrb) to manually build the vector byte-by-byte, destroying memory throughput.

  • The compiler failed to recognize the bitwise nature of the problem. It expanded every 8-bit byte into a 64-bit integer (vpmovzxbq) to perform vector addition, increasing register pressure and data volume by 8x compared to the optimal popcnt instruction.

Takeaway: Do not try to "half-optimize" by manually chunking memory:

  • Write simple, idiomatic iterators (count_whitespaces_bytes) and trust the compiler's auto-vectorizer.

  • Write explicit SIMD intrinsics (count_whitespaces_simd) to take full control.

And most importantly: Benchmark exactly on the target hardware with production compilation flags. Adding flags like target-cpu=native or target-feature=+avx2 later can fundamentally alter the program's behavior !

Benches

Small (12 bytes)

Method Speed
bench_avx2 4.03 ns/iter (+/- 2.75)
bench_bytes 3.49 ns/iter (+/- 0.15)
bench_chunk 6.07 ns/iter (+/- 0.48)
bench_loop 2.24 ns/iter (+/- 0.45)
bench_std 6.98 ns/iter (+/- 0.96)

Large (1200 bytes)

Method Speed
bench_avx2 18.77 ns/iter (+/- 1.75)
bench_bytes 121.87 ns/iter (+/- 12.34)
bench_chunk 274.38 ns/iter (+/- 35.35)
bench_loop 120.38 ns/iter (+/- 12.31)
bench_std 652.94 ns/iter (+/- 64.69)

Summary

Optimizing a task as simple as counting spaces reveals how much performance can be thrown off the window.

  • "Abstraction Trap": Moving from chars() to raw bytes() cuts execution time by 80% just by skipping unnecessary UTF-8 decoding.

  • "Half-Optimization Trap": Manual chunking worked well with basic settings but became 4x slower under AVX2. The compiler tried to use advanced registers but did so inefficiently, proving that manual "hints" can often backfire.

  • Vectorization dominates: For large strings, SIMD is roughly 35x faster than the standard iterator and 6.5x faster than a simple byte loop.

Conclusion: For most cases, stick to raw byte iterators, they are fast, clean, and let the compiler auto-vectorize safely. If you need every last drop of speed, use SIMD with intrinsics to stay in total control.

Remember to enable the CPU features when running your code on a single CPU architecture, for potential 4x improvements as shown in this benchmark.

About

Chasing performance

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published