-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Hi Dr. Wei,
I’m reaching out because I’ve encountered a non-deterministic CUDA error that appears to originate from Ash’s grid query during the backward pass. The relevant call is:
embeddings, masks = ash.grid(pcd, interpolation="linear")
The error message is as follows:
RuntimeError: CUDA error: an illegal memory access was encountered
...
File "/torch_ash/ash/grid_query.py", line 104, in backward
grad_embeddings, grad_offsets = SparseDenseGridQueryBackward.apply(...)
...
File "/torch_ash/ash/grid_query.py", line 193, in forward
grad_embeddings, grad_offsets = backend.query_backward_forward(...)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
When enabling both CUDA_LAUNCH_BLOCKING=1 and TORCH_USE_CUDA_DSA=1, I observed the following message from stdgpu:
Error : an illegal memory access was encountered
File : /.../torch-ash/ext/stdgpu/src/stdgpu/cuda/impl/memory.cpp:123
Function : void stdgpu::cuda::dispatch_memcpy(void*, const void*, ...)
terminate called without an active exception
In my setup, I compute embeddings and masks at each epoch from an input point cloud (pcd) that is randomly distributed in space, meaning that some points may fall inside the Ash grid while others may lie outside it.
Interestingly, the failure is non-deterministic — sometimes the training runs smoothly for many epochs, while other times it crashes at different iterations or scenes. Wrapping the call in with torch.no_grad() does not prevent the issue, so it seems unrelated to autograd itself.
Since I’m not deeply familiar with the internal mechanisms of Ash, I wonder if this might be related to out-of-bound accesses, race conditions, or some memory reuse issue. Do you have any suggestions on what might be causing this behavior, or guidance on how I could further debug it?
Thank you very much for your time and for your excellent work on Ash!