-
Notifications
You must be signed in to change notification settings - Fork 5
Description
In #373 we've introduced a few loops for batch initialization of objects, and what do you know, they compile nicely in AVX instructions. This case has rather minimal impact on performance (the code is executed few times), but a similar principle could be applied elsewhere too.
A good case for that would be this: Jelly-RDF/jelly-protobuf#41
But it could be generalized to have the encoder process minibatches of triples (say, 8–16 triples at a time). This would reduce the number of calls in the stack and allow us to write mini-loops of 8–16 operations everywhere.
Note that 16*4B per pointer = 64B, which is often one cache line in L1.
The widest ops on x86-64 are 512-bit wide, which is exactly 64B. Though, these tend to be not-so-efficient, and I think like literally one CPU line actually implements these in a single clock.