Skip to content

CUDA illegal memory access in torch-ash during ash.grid() / SparseDenseGridQueryBackward #35

@duzh11

Description

@duzh11

Hi Dr. Wei,

I’m reaching out because I’ve encountered a non-deterministic CUDA error that appears to originate from Ash’s grid query during the backward pass. The relevant call is:
embeddings, masks = ash.grid(pcd, interpolation="linear")

The error message is as follows:
RuntimeError: CUDA error: an illegal memory access was encountered
...
File "/torch_ash/ash/grid_query.py", line 104, in backward
grad_embeddings, grad_offsets = SparseDenseGridQueryBackward.apply(...)
...
File "/torch_ash/ash/grid_query.py", line 193, in forward
grad_embeddings, grad_offsets = backend.query_backward_forward(...)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

When enabling both CUDA_LAUNCH_BLOCKING=1 and TORCH_USE_CUDA_DSA=1, I observed the following message from stdgpu:
Error : an illegal memory access was encountered
File : /.../torch-ash/ext/stdgpu/src/stdgpu/cuda/impl/memory.cpp:123
Function : void stdgpu::cuda::dispatch_memcpy(void*, const void*, ...)
terminate called without an active exception

In my setup, I compute embeddings and masks at each epoch from an input point cloud (pcd) that is randomly distributed in space, meaning that some points may fall inside the Ash grid while others may lie outside it.

Interestingly, the failure is non-deterministic — sometimes the training runs smoothly for many epochs, while other times it crashes at different iterations or scenes. Wrapping the call in with torch.no_grad() does not prevent the issue, so it seems unrelated to autograd itself.

Since I’m not deeply familiar with the internal mechanisms of Ash, I wonder if this might be related to out-of-bound accesses, race conditions, or some memory reuse issue. Do you have any suggestions on what might be causing this behavior, or guidance on how I could further debug it?

Thank you very much for your time and for your excellent work on Ash!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions