As noted in the source:
// Depending on the size of the matrices involved, and the depth of
// pipelining on the arch we're running on (since conditional branching
// destroys pipelining gains), it may be faster to just XOR the entire
// matrices (with final masks) together, ORing against a "running result"
// as we go. The NOT of the running result is then the final result. So
// this way no conditionals or comparisons are required, and it amounts to
// just streaming through memory doing fast bitwise operations.
So, implement this (since it shouldn't be hard) and then run some benchmarks to see if it's worth keeping, and when.