Investigate opportunities for auto-vectorization

In #373 we've introduced a few loops for batch initialization of objects, and what do you know, they compile nicely in AVX instructions. This case has rather minimal impact on performance (the code is executed few times), but a similar principle could be applied elsewhere too.

A good case for that would be this: https://github.com/Jelly-RDF/jelly-protobuf/issues/41

But it could be generalized to have the encoder process minibatches of triples (say, 8–16 triples at a time). This would reduce the number of calls in the stack and allow us to write mini-loops of 8–16 operations everywhere.

Note that 16*4B per pointer = 64B, which is often one cache line in L1.

The widest ops on x86-64 are 512-bit wide, which is exactly 64B. Though, these tend to be not-so-efficient, and I think like literally one CPU line actually implements these in a single clock.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate opportunities for auto-vectorization #374

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate opportunities for auto-vectorization #374

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions