Hacky nonmult8 for VNNI by XapaJIaMnu · Pull Request #90 · kpu/intgemm

XapaJIaMnu · 2021-07-17T00:45:29Z

It's not a purr fect implementation, but it is a start...
This patch implements the following:

PrepareB for arbitrary columns matrices for all architectures. The last non-multiple-of-eight-columns are prepared and compressed as a small independent width by 8 matrix, zero'ed blocks of register_width are stripped. Unfortunately, this is not done in place in the current implementation, and involves memory copying. This can be improved in the future. I am using some inlined functions that don't have CPU_ATTR set. as I was lazy. I hope that inlining means they would be generated with the proper ISA limitataions. Regardless, so far only VNNI multiply is implemented anyways.
Avx512VNNI multiplication of matrices of arbitrary number of columns + tests. The multiplication proceeds as normal until it reaches the last non-multiple-of-eight column and then proceeds to do those in a separate loop.

Example: If we have A = 2x64 matrix and B = 64x9, we will perform a multiplication first, of 2x64 times 64x8 and then 1x64 times 64x1 (to produce the last column)

Unfortunately, now that we can have matrices that have non-multiple-of-eight columns, but we no longer write the columns consecutively, we get unaligned memory access when writing and we segfault. For this reason I have replaced the store routine with storeu.

Preliminary performance benchmarks with the builtin

intgemm/benchmarks/biasmultiply.cc

Line 267 in 6228d01

newTimeAVX512VNNI += testNew<AVX512VNNI::Kernels8>(8, 256, 256);

to check for any performance regressions. (This is not including irregularly shaped non-multiple-of-8 matrices)
This branch (n=1)

taskset --cpu-list 0 ./biasmultiply
1000 iterations of SSSE3 without bias took: 2.31014 seconds.
1000 iterations of SSSE3 took: 2.39446 seconds.
1000 iterations of Shifted SSSE3 took: 1.98965 seconds.
1000 iterations of AVX2 without bias took: 1.33628 seconds.
1000 iterations of AVX2 took: 1.33306 seconds.
1000 iterations of Shifted AVX2 took: 1.20668 seconds.
1000 iterations of AVX512 without bias took: 1.01728 seconds.
1000 iterations of AVX512 took: 1.04101 seconds.
1000 iterations of Shifted AVX512 took: 0.779364 seconds.
1000 iterations of AVX512VNNI without bias took: 0.754878 seconds.
1000 iterations of AVX512VNNI took: 0.771353 seconds.
1000 iterations of Shifted AVX512VNNI took: 0.539761 seconds.

Master (n=1)

taskset --cpu-list 0 ./biasmultiply
1000 iterations of SSSE3 without bias took: 2.31003 seconds.
1000 iterations of SSSE3 took: 2.37843 seconds.
1000 iterations of Shifted SSSE3 took: 1.97674 seconds.
1000 iterations of AVX2 without bias took: 1.28795 seconds.
1000 iterations of AVX2 took: 1.33322 seconds.
1000 iterations of Shifted AVX2 took: 1.20815 seconds.
1000 iterations of AVX512 without bias took: 1.01804 seconds.
1000 iterations of AVX512 took: 1.06707 seconds.
1000 iterations of Shifted AVX512 took: 0.779698 seconds.
1000 iterations of AVX512VNNI without bias took: 0.776488 seconds.
1000 iterations of AVX512VNNI took: 0.772831 seconds.
1000 iterations of Shifted AVX512VNNI took: 0.653334 seconds.

Speed seems to be even better, but I don't trust that. Maybe some of the instruction reordering makes the benchmark perform better. I will have test it in a real world situation later on.

…nmult8

XapaJIaMnu added 5 commits July 13, 2021 18:22

non mult8 prepare_b_8

743c04a

Add testing program

67cd72b

Manual test seems to work

058df55

One row working

19af7dc

Multiple rows working

4b54443

XapaJIaMnu requested a review from kpu July 17, 2021 00:45

XapaJIaMnu added 8 commits July 18, 2021 14:02

Fix msvcp and openmp compilation

9a76d76

Attempt to fix openmp again

676685d

Make quantisation great again

33bf772

Remove outdated assert

6c151e9

Meanstd arbitrary length and tests

e910430

Merge branch 'hacky_nonmult8' of github.com:kpu/intgemm into hacky_no…

c66a23f

…nmult8

try to fix windows compilation

b3b5071

windows compile v3

b99cc6c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hacky nonmult8 for VNNI#90

Hacky nonmult8 for VNNI#90
XapaJIaMnu wants to merge 13 commits intomasterfrom
hacky_nonmult8

XapaJIaMnu commented Jul 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

XapaJIaMnu commented Jul 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant