CUDA illegal memory access in torch-ash during ash.grid() / SparseDenseGridQueryBackward

Hi Dr. Wei,

I’m reaching out because I’ve encountered a non-deterministic CUDA error that appears to originate from Ash’s grid query during the backward pass. The relevant call is:
embeddings, masks = ash.grid(pcd, interpolation="linear")

The error message is as follows:
    RuntimeError: CUDA error: an illegal memory access was encountered
    ...
    File "/torch_ash/ash/grid_query.py", line 104, in backward
        grad_embeddings, grad_offsets = SparseDenseGridQueryBackward.apply(...)
    ...
    File "/torch_ash/ash/grid_query.py", line 193, in forward
        grad_embeddings, grad_offsets = backend.query_backward_forward(...)
    RuntimeError: CUDA error: an illegal memory access was encountered
    CUDA kernel errors might be asynchronously reported at some other API call,
    so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

When enabling both CUDA_LAUNCH_BLOCKING=1 and TORCH_USE_CUDA_DSA=1, I observed the following message from stdgpu:
    Error     : an illegal memory access was encountered
    File      : /.../torch-ash/ext/stdgpu/src/stdgpu/cuda/impl/memory.cpp:123
    Function  : void stdgpu::cuda::dispatch_memcpy(void*, const void*, ...)
    terminate called without an active exception

In my setup, **I compute embeddings and masks at each epoch from an input point cloud (pcd) that is randomly distributed in space**, meaning that some points may fall inside the Ash grid while others may lie outside it.

Interestingly, **the failure is non-deterministic** — sometimes the training runs smoothly for many epochs, while other times it crashes at different iterations or scenes. Wrapping the call in with torch.no_grad() does not prevent the issue, so it seems unrelated to autograd itself.

Since I’m not deeply familiar with the internal mechanisms of Ash, I wonder if this might be related to out-of-bound accesses, race conditions, or some memory reuse issue. Do you have any suggestions on what might be causing this behavior, or guidance on how I could further debug it?

Thank you very much for your time and for your excellent work on Ash!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA illegal memory access in torch-ash during ash.grid() / SparseDenseGridQueryBackward #35

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

CUDA illegal memory access in torch-ash during ash.grid() / SparseDenseGridQueryBackward #35

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions