Skip to content

PAPI rocm_r component segfaults in intercept mode #74

@gcongiu

Description

@gcongiu

Testing PAPI rocm_r component (https://bitbucket.org/congiu/papi/branch/2022.01.11_rocm-rewrite) with the code at this link: https://bitbucket.org/congiu/papi/src/b9533e4c207f20d0477174d097bec2df73867f02/src/components/rocm_r/tests/hip_matmul_single_gpu.cpp

on MI100 GPUs with rocm-4.5.0 and rocm-5.0.0 generates the behaviour following reported.

Following is the kernel running with PAPI rocm_r component in sample mode

$ ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
First kernel run...
rocm:::SQ_INSTS_VALU:device=0 : 77329334272
rocm:::SQ_INSTS_SALU:device=0 : 17188257792
rocm:::SQ_WAVES:device=0 : 1048576
Second kernel run...
rocm:::SQ_INSTS_VMEM_RD:device=0 : 17179869184
rocm:::SQ_INSTS_VMEM_WR:device=0 : 1048576

And with PAPI rocm_r component in intercept mode

$ ROCP_HSA_INTERCEPT=1 ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
Segmentation fault (core dumped)

Rerunning the above with gdb:

$ ROCP_HSA_INTERCEPT=1 gdb ./hip_matmul_single_gpu
...Starting program: /home/gcongiu/papi/src/components/rocm_r/tests/./hip_matmul_single_gpu
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
/home/gcongiu/papi/src/components/rocm_r/tests/./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
[New Thread 0x7fffd303d700 (LWP 68100)]

Program received signal SIGSEGV, Segmentation fault.
0x00000000000127e0 in ?? ()
Missing separate debuginfos, use: debuginfo-install elfutils-libelf-0.176-5.el7.x86_64 glibc-2.17-325.el7_9.x86_64 libdrm-2.4.97-2.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 numactl-libs-2.0.12-5.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0  0x00000000000127e0 in ?? ()
#1  0x00007fffd30718b0 in ?? () from /opt/rocm-4.5.0/rocprofiler/lib/librocprofiler64.so
#2  0x00007fffd3077d6d in ?? () from /opt/rocm-4.5.0/rocprofiler/lib/librocprofiler64.so
#3  0x00007ffff6885947 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#4  0x00007ffff6899ce5 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#5  0x00007ffff68812ca in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#6  0x00007ffff686faa8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#7  0x00007ffff68143d1 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#8  0x00007ffff6814af8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#9  0x00007ffff68155da in hipStreamCreate () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#10 0x0000000000206bce in main ()

Interestingly, if I use MALLOC_CHECK_=1:

$ MALLOC_CHECK_=1 ROCP_HSA_INTERCEPT=1 ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
First kernel run...
rocm:::SQ_INSTS_VALU:device=0 : 77329334272
rocm:::SQ_INSTS_SALU:device=0 : 17188257792
rocm:::SQ_WAVES:device=0 : 1048576
Error! Failed starting eventset, error=-8 -> 'Event exists, but cannot be counted due to hardware resource limits’

The segmentation fault disappears. This seems to indicate a memory error in librocprofiler.

Ignore the “Error! …” line. This is generated by PAPI and is due to the fact that the EventSet that initially contained the VALU, SALU and WAVES events has been cleaned up and reused with different events (i.e. VMEM). Since rocprofiler does not allow changing the dispatch callbacks after they have been set PAPI throws an error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions