Kermac is a collection of fused CUDA kernels meant for fast and memory efficient computation for kernel methods. Kermac makes heavy use of JIT (Just-in-time) compilation to generate custom CUDA kernels on demand. These compiled kernels are stored in a cache database so the JIT costs are only incurred once. Using debug=True in most kermac routines will print information related to the compilation and caching of these JIT CUDA kernels.
Kermac supports only Nvidia cards with capability of sm_80 or higher. This includes:
- Server cards like A10, A100, H100, B100
- Consumer cards like RTX 30xx, RTX 40xx, RTX 50xx
Kermac relies on cuda-core for JIT compilation which is supported for cuda toolkits 11.8 and 12.x. Because of cuda-core and nvmath packages no C++ compilation or wheel system is needed to install this library.
pip install kermac[cu12]pip install kermac[cu11]linalg functionality depends on nvmath-python. This isn't a required dependency. To run linalg routines please do:
pip install nvmath-python[cu12]or
pip install nvmath-python[cu11]From a fresh environment you can do:
wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/cdist.py
python cdist.py -d -p 1.0wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/cdist_grad.py
python cdist_grad.py -dRunning build_a_kernel.py will batch compile quite a few different kernels on first run. Expect around 20 seconds of JIT compiling.
wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/build_a_kernel.py
python build_a_kernel.py -dwget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/linalg.py
python linalg.pySolves a symmetric system of equations like torch.linalg.cholesky. Wraps xpotrf and xpotrs from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed cholesky factor. Additionally this routine can write the factorization in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.
Solves a symmetric system of equations like torch.linalg.solve.
Wraps xgetrf and xgetrs from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed LU decomposition. Additionally this routine can write the factorization in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.
Computes eigenvalues and eigenvectors of a symmetric matrix like torch.linalg.eigh
Wraps xsyevd from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed eigenvalue decomposition. Additionally this routine can write the eigenvector decomposition in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.
An implementation of torch.cdist. Computes fractional norms. Supports batches and broadcasting. Aside from the out tensor in the out=None case does not allocate.
Computes:
If instead skip_epilogue is set it computes:
Or expressed in c-style it efficiently computes:
// a[K,M], b[K,N], out[N,M]
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
for (int k = 0; k < K; k++) {
out[n,m] += pow(abs(b[k,n] - a[k,m]), p);
}
if (!skip_epilogue) {
out[n,m] = pow(out[n,m], 1.0/p);
}
}
}It has special code paths for
with problem size
| GPU / p-norm | Speed-up (×) | kermac.cdist (ms) | torch.cdist (ms) |
|---|---|---|---|
| GH200 · p = 1.0 | 29.1× | 82 | 2,389 |
| GH200 · p = 1.3 | 9.6× | 453 | 4,360 |
| GH200 · p = 2.0 | 5.2× | 79 | 406 |
| H100-PCIe · p = 1.0 | 27.0× | 108 | 2,907 |
| H100-PCIe · p = 1.3 | 9.4× | 592 | 5,591 |
| H100-PCIe · p = 2.0 | 3.3× | 104 | 346 |
| A100 · p = 1.0 | 15.4× | 251 | 3,878 |
| A100 · p = 1.3 | 9.4× | 873 | 8,230 |
| A100 · p = 2.0 | 0.9× | 325 | 301 |
| RTX 4090 · p = 1.0 | 52.6× | 76 | 4,021 |
| RTX 4090 · p = 1.3 | 11.8× | 350 | 4,141 |
| RTX 4090 · p = 2.0 | 3.4× | 77 | 262 |
This is a more customizable version of kermac.cdist, kermac.cdist is written on top of this. run_kernel allows a descriptor as one of it's arguments that can create fully fused kernel functions. You can specify the inner-norm type (abs(x), x*x, or pow(x,p)), the outer-norm type (x, sqrt(x), or pow(x,1/p)) and finally a laplace or gaussian epilogue. On first run a fully fused kernel will be JIT compiled and cached for future use. This function also allows broadcasting and batching of it's input tensors. See build_a_kernel.py for various examples of usage. It also allows batching and broadcasting of it's hyperparameters such as p, bandwidth, and regularization. See broadcast_kernel.py for examples of batching and broadcasting hyperparameters
Computes the gradient of cdist in the style like:
Or expressed in c-style it efficiently computes:
// a[K,M], b[N,K], c[O,K], d[N,M], out[O,N,M]
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
for (int o = 0; o < O; o++) {
for (int k = 0; k < K; k++) {
float diff = d[n,m] - b[n,k];
out[o,n,m] += c[o,k] * a[k,m] * signum(diff) * pow(abs(diff), p - 1.0));
}
}
}
}Aside from the out tensor in the out=None case DOES NOT ALLOCATE
It has special code paths for
It's supposed to be used like:
-
$a_{k,m}$ isgrad_kernel_matrix -
$b_{n,k}$ isdata_x -
$c_{o,k}$ iscoefficients -
$d_{n,m}$ isdata_z -
$out_{o,n,k}$ isgradient
# Given tensors a,b,c,d,out and sizes M,N,O,K
# K is the contracted mode
assert a.shape == torch.Size([K,M])
assert b.shape == torch.Size([N,K])
assert c.shape == torch.Size([O,K])
assert d.shape == torch.Size([N,M])
assert out.shape == torch.Size([O,N,M])
assert a.stride(1) == 1
assert b.stride(1) == 1
assert c.stride(1) == 1
assert d.stride(1) == 1
assert out.stride(1) == 1
out = kermac.cdist_grad(a,b,c,d,out=out) # OK