Skip to content

Commit d6aae1e

Browse files
authored
WebGPU GPU compiler backend improvements (and small CUDA) (#565)
* handle `nnkSym` in `getInnerPointerType` * fix `constructPtrSignature` after change from nil -> gtVoid for iTyp The type of the identifier is now always `gtVoid`, so the previous check `not idTyp.isNil` does not work anylonger. * handle `gpuCast` in determineSymKind/Mutability/Ident * allow determination of GPU type to fail in `nimToGpuType` This is only for the optional helper used on the WebGPU backend, where we try to determine the type in an infix expression. However, for some arguments to the infix this is not uniquely possible. I.e. we might encounter `SomeInteger`, which is not a unique type. In this case we just fall back to not assigning a known type. * handle `UncheckedArray` in `gpuTypeToString` on CUDA backend * rename constant to set workgroup size * minor cleanup * correctly handle `gtUA` (UncheckedArray) on CUDA backend The `gtUA` type enum element is new and was not correctly handled yet on the CUDA backend. * fix access of type for left ident in assignment * rewrite compound assignment operators in all functions i.e. x += 5 becomes x = x + 5 etc for any prefix `foo=` in x foo= y we generate x = x foo y * extend `GpuFieldInit` by a type field We need that type to determine information about what type we actually assign in a `gpuObjConstr`. * make sure to update symbols in global functions too Otherwise we don't have up to date type / symbol kind information in globals. * remove local `determineIdent` from `genWebGpu` * fix tree representation of GpuAst * add `mpairs`, `pairs` iterator for `GpuAst` * support default initialization in obj constr fields As structs only support 'constructible types' in their fields anyway, we can just default initialize all fields the user leaves out in an object constructor. * correctly handle `var foo {.constant.}` variables by ⇒ globals Those are intended for runtime constants, i.e. further storage buffers in the context of WGSL. In CUDA we'd use `copyToSymbol` to copy the data to them before execution. * move rewriting of compound assignment out of `storagePass` * implement lifting of struct pointer fields This gives us the convenience of passing around a struct with a pointer field, while preserving the restrictions of WebGPU. Because of the fact that storage buffers are global anyway, we can "pretend" the pointers are part of an object and simply replace them by the global when the fields are used. We throw a CT error if one tries to assign a non storage buffer pointer to a field. * add helper to check code for validity Currently only checks if we assign a pointer to a `var` variable (which is not allowed in WGSL). * add support for full Nim generics in the context of GPU code This means we can finally write something like: ``` proc foo[N: static int](ar: array[N, BigInt], ...) ``` We generate generic instantiations for every instantiation the Nim compiler produces (plus potentially additional ones for every pointer type argument the user passes into such a function). * implement support for type aliases So far only the code generation for WGSL is done, but the CUDA code gen is simple. * allow "pulling in" procs from outside `cuda` scope The generic logic is essentially what we need to pull in a proc defined outside the scope of the `cuda` macro into the code. Essentially when we encounter a function that is not known to us yet, we simply look up its implementation from the symbol. This is the first step towards untying the current `cuda` macro code from relying on having _everything_ be defined under that macro. In the future the idea is that essentially only the `{.global.}` procs strictly need to be defined in the macro. These then pull in everything they need (i.e. everything that is used and has been checked by the compiler to be used). * add pretty printer for GpuType This is lossy so it is not the default representation. * store all types in `ctx.types` table, support 'pulling' types... from outside the `cuda` scope. This is the second step towards making it so that the `cuda` macro only needs to contain the calling code (`{.global.}` proc in general). * support generic instantiations, producing unique WGSL types * add `{.builtin.}` pragma intended to specify builtin procs in CUDA/WGSL This is fundamentally doing the same as `{.nimonly.}`, but for this purpose the different name makes the intent clearer. * fix context version for `cuModuleGetGlobal` * allow to compile with `-d:debugCuda` to compile in debug mode * ignore `varargs` pragma in procs * make sure statically sized arrays are copied As they are passed by pointer in C/C++/CUDA, e.g. a `BigInt state[2]` decays to `BigInt *state` in a function argument behind the scenes. * make sure to catch `CUdeviceptr` arguments After our recent Nvidia API wrapper changes, it is now an alias and not a distinct type anymore. * change CUDA codegen to similar style as WGSL This allows us to make use of the Nim generic instantiation and regular procs we 'pick up' as well as types. In principle it also allow us to do dead code elimination by only generating what's actually called in the global functions (or single kernel, if one uses `toGpuAst` first and then manually calls `codegen`). * improve error message for `Dot` node if not ident/deref in WGSL * always also inject `num_workgroups` argument into WGSL globals When working with workgroups dispatched in an e.g. 2D grid, `passEncoder.dispatchWorkgroups(N, M)` one needs the number of workgroups to compute the unique thread ID. * rename WGSL `storagePass` to `preprocess` * remove `gpuTypeToString`, `genFunctionType` from `backends` In the end we never used those variants anyway, as the code is fully separate in their respective submodules now. * [CUDA] fix indentation for variable declarations * improve type deduction for types that look generic I.e. an array type with a constant in place for the array size (i.e. *not* a `static int` generic argument) will look like a generic in the sense that depending on how you look at the type with `getType*`, you will end up seeing `Ident "Foo"` for the constant. Therefore, we allow array length determination to fail and return `gtInvalid`. If we see that happens for a variable or parameter, we try to determine the type from the symbol of the parameter or variable. For some reason that tends to have the fully instantiated type. However, for *return types* this is not possible, as there is no easy symbol to look at (unless one were to go into the proc body and look at symbol that is being returned). Therefore, if we encounter this type of array as a return type (NOTE: most backends won't anyway allow to return arrays, but that's beside the point), we treat the function as a generic and look up its type when we encounter it in a `gpuCall`. Then, the type will be fully instantiated again. * clone `forwardDeclare` in GpuAst * support tuples by mapping to object types Tuples are essentially just objects anyway. So we make the transformation explicit and thus support tuples (including anonymous). Note: The Nim compiler already transforms tuple unpacking into temporary variable + statements to assign the fields. * support Nim's implicit `result` variable Finally supports the implicit `result`. :) * first steps towards supporting expressions This by itself does not do much. But in order to support (statement list / block) expressions, some more information about if something is an expression / if a call is an expression is going to be needed. * move `cIsExpr` in clone to correct branch Whoops * also catch tuple types from `nnkBracketExpr` * handle `inline` pragma same as `forceinline` * fix detection of custom `result` by looking into `gpuBlock` * fix handling of `skipSemicolon` for nested usage If nested, we must not reset the `skipSemicolon` back to false once the inner one is done! * if last expression _is_ return, we don't need `result` variable At least not in theory. We _could_ construct a function in which there _is_ a branch that returns `result` and another that returns something else, but uhh, I guess we can leave that for later. * remove debug output * doc comment updates for procs in gpu_field_ops * do not emit semicolon in binary operand child nodes * handle function overloads by using `iSym` if encountered * ignore `magic` pragma procs * ignore forward declarations in Nim -> GpuAst Any forward declaration will also have its regular implementation somewhere. Otherwise the code is invalid anyway. As we generate our own forward declarations, we can ignore them. * add BigInt comparison and `toCanonical` conversion We could consider to rename it `fromMont` to match the regular Constantine naming, but given that here we (currently) only have a single type to represent both, I picked a different name. * add FIPS Montgomery multiplication for fields without spare bits The Goldilocks field does not have any spare bits. As a result using CIOS leads to the wrong result. * [cuda] for now emit `constexpr` for a `gpuConstexpr` (i.e. Nim `const`) See the added TODO. The main point is that if we write things like ```nim const M = toBigInt(bigintToUint32Limbs(T.getModulus)) ``` we want to have the Nim compiler evaluate the RHS at compile time. If we were to make the user write ``` var M {.constant.} = toBigInt(bigintToUint32Limbs(T.getModulus)) ``` instead the Nim compiler would evaluate the RHS at runtime. We _could_ force the user to write ```nim const M {.constant.} = toBigInt(bigintToUint32Limbs(T.getModulus)) ``` instead though. It is no problem however, to emit `__constant__` if it's a global and `constexpr` for a local (where `__constant__` is anyhow forbidden in CUDA). The only minor annoyance is that _maybe_ someone wants to emit `constexpr` also in global scope. _Maybe_ we'll go with a design that does the above by default but allows you to overwrite it via ```nim const M {.constant.} = toBigInt(bigintToUint32Limbs(T.getModulus)) ``` -> explicitly force `__constant__` ```nim const M {.constexpr.} = toBigInt(bigintToUint32Limbs(T.getModulus)) ``` -> explicity force `constexpr` instead. * [cuda] handle generic instantiations and `UncheckedArray` * [wgsl] do not suffix generic inst types if they have no args * handle `raises`, `noinit` pragmas And use normalized strings because `noInit` and `noinit` both appears often enough. * use `getTypeName` for tuple type fields Otherwise we get a big mess for objects :) * map `ntyString` to string explicitly in gpu type kinds Not supported on some backends * map ntyUnused2 (lent T) to ptr T * generate type names for bracket expressions e.g. generics * improve generic inst -> gpu type by handling nnkSym This is not perfect yet (I think). Needs some tests for different situations. Works in practice for what I've used it for at least. * overwrite function names of problematic identifiers E.g. `[]` is not a sensible name on GPU backends. We rename it to `get` for example. Note that the binary operators there likely never appear there. We need to handle those via `gpuBinOp` (and call `maybePatchFnName` from there) * refactor proc signature parsing & unify adding types to `ctx.types` NOTE: We'll change the `maybeAddType` code in the future to be done in `nimToGpuType` directly. However, at the moment that produces a bit too much required change with having to add `GpuContext` to a whole bunch of functions that all call `nimToGpuType` internally. * handle recursive calls in GPU code Previously due to our parsing of procs whenever they are called, we would infinitely recurse in the parsing logic. We now record the function signature and function identifier so that we can avoid that. We need the function signature to get information about the return type before we actually start parsing a function. Otherwise _inside_ of the recursive function we wouldn't be able to determine the return type at the callsite of the recursive call (the initial parse hasn't been completed at that point yet, which would fill the proc into `allFnTab`) * fix `modadd` in debug builds on CUDA In a debug build (`-d:debugCuda`) modular addition produced off by one errors. As it turned out the problem was our calculation of `overflowedLimbs` inside of `finalSubMayOverflow`. The carry flag set in the last `add_cio` call in `modadd` does not reliably survive into the function call in a debug build. We compute it directly after computing the last `add_cio` call in `modadd` and simply pass it as an argument. This way arithmetic also works reliably on a debug build. * fix reassignment of `types` in RT codegen for aliases * handle `nnkObjConstr` correctly when encountering gtGenericInst * do not emit `f` suffix for float literals Nim float literals when converting to strings already come with a `f` suffix. * [wgsl] append `;` for aliases * rewrite infix as `gpuCall` if arguments not basic types The idea is that if the arguments are not basic types, we will need a custom function to perform the infix operation. Most backends however do not support custom operators and hence actual `gpuBinOp` are not valid for custom types in general. Hence, we rewrite them as `gpuCall` nodes with non-symbol based naming. NOTE: Currently this does not handle the case where we might use an inbuilt type like `vec3` and its implementation of infix operators that _may_ be defined after all. We need to find an elegant solution for that. Either by checking if argument are of basic types (like in this commit) or if they are a type annotated with `{.builtin.}`. Alternatively, we could force the user to define operators for such inbuilt types (i.e. wrap them) and then if there is a wrapper that is marked `{.builtin.}` we don't replace infix by `gpuCall` either. * support arbitrary values in array literals * handle generic instantiation with actual initializations I.e. instead of just: ```nim Vec[float32]() ``` being: ``` ObjConstr BracketExpr Sym "Vec3" Sym "float32" ``` Also handle the case of: ``` Vec3[float32](limbs: [1'f32, 1'f32, 1'f32]) ``` being: ``` ObjConstr BracketExpr Sym "Vec3" Sym "float32" ExprColonExpr Sym "limbs" Bracket Float32Lit 1.0 Float32Lit 1.0 Float32Lit 1.0 ``` * fix top level type definitions Because we still had the `farmTopLevel` adding a (now empty) element to the `globalBlocks` our array access to `ctx.globalBlocks[0]` to generate the types didn't actually emit anything. * [cuda] add pass to strip `deref` if found inside `index` for pointers But only strip if not pointer to an array type! * remove old comment * better handle replacement of derefs This allows us to make a better choice about when to replace and when not to replace. * remove Nim gensym'd suffix from `tmpTuple` variables I.e. variables that correspond to tuple unpacking in Nim * replace single letter strings by characters * extend `maybeAddType` to find types behind Ptr/UA/Array * move all `builtins` into separate files, one for each backend * update doc comment of `cuda` macro * improve explanation of `typeOfIndex` in CUDA's `getType` helper
1 parent b3f4ebd commit d6aae1e

File tree

15 files changed

+1922
-488
lines changed

15 files changed

+1922
-488
lines changed

constantine/math_compiler/experimental/backends/backends.nim

Lines changed: 3 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -14,27 +14,11 @@ when defined(cuda):
1414
else:
1515
const Backend* = bkWGSL
1616

17-
proc gpuTypeToString*(t: GpuTypeKind): string =
18-
case Backend
19-
of bkCuda: cuda.gpuTypeToString(t)
20-
of bkWGSL: wgsl.gpuTypeToString(t)
21-
22-
proc gpuTypeToString*(t: GpuType, ident = newGpuIdent(), allowArrayToPtr = false,
23-
allowEmptyIdent = false,
24-
): string =
25-
case Backend
26-
of bkCuda: cuda.gpuTypeToString(t, ident.ident(), allowArrayToPtr, allowEmptyIdent)
27-
of bkWGSL: wgsl.gpuTypeToString(t, ident, allowArrayToPtr, allowEmptyIdent)
28-
29-
proc genFunctionType*(typ: GpuType, fn: string, fnArgs: string): string =
30-
case Backend
31-
of bkCuda: cuda.genFunctionType(typ, fn, fnArgs)
32-
of bkWGSL: wgsl.genFunctionType(typ, fn, fnArgs)
33-
3417
proc codegen*(ctx: var GpuContext, ast: GpuAst, kernel: string = ""): string =
3518
case Backend
3619
of bkCuda:
37-
result = ctx.genCuda(ast)
20+
cuda.preprocess(ctx, ast, kernel)
21+
result = cuda.codegen(ctx)
3822
of bkWGSL:
39-
ctx.storagePass(ast, kernel)
23+
wgsl.preprocess(ctx, ast, kernel)
4024
result = wgsl.codegen(ctx)

constantine/math_compiler/experimental/backends/common_utils.nim

Lines changed: 29 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,36 @@
66
# * Apache v2 license (license terms in the root directory or at http://www.apache.org/licenses/LICENSE-2.0).
77
# at your option. This file may not be copied, modified, or distributed except according to those terms.
88

9+
import std / tables
910
import ../gpu_types
10-
# import ./backends
1111

1212
proc address*(a: string): string = "&" & a
13-
1413
proc size*(a: string): string = "sizeof(" & a & ")"
14+
15+
proc isGlobal*(fn: GpuAst): bool =
16+
doAssert fn.kind == gpuProc, "Not a function, but: " & $fn.kind
17+
result = attGlobal in fn.pAttributes
18+
19+
proc farmTopLevel*(ctx: var GpuContext, ast: GpuAst, kernel: string, varBlock: var GpuAst) =
20+
## Farms the top level of the code for functions, variable and type definition.
21+
## All functions are added to the `allFnTab`, while only global ones (or even only
22+
## `kernel` if any) is added to the `fnTab` as the starting point for the remaining
23+
## logic.
24+
## Variables are collected in `varBlock`.
25+
case ast.kind
26+
of gpuProc:
27+
ctx.allFnTab[ast.pName] = ast
28+
if kernel.len > 0 and ast.pName.ident() == kernel and ast.isGlobal():
29+
ctx.fnTab[ast.pName] = ast.clone() # store global function extra
30+
elif kernel.len == 0 and ast.isGlobal():
31+
ctx.fnTab[ast.pName] = ast.clone() # store global function extra
32+
of gpuBlock:
33+
# could be a type definition or global variable
34+
for ch in ast:
35+
ctx.farmTopLevel(ch, kernel, varBlock)
36+
of gpuVar, gpuConstexpr:
37+
varBlock.statements.add ast
38+
of gpuTypeDef, gpuAlias:
39+
raiseAssert "Unexpected type def / alias def found. These should be in `ctx.types` now: " & $ast
40+
else:
41+
discard

0 commit comments

Comments
 (0)