Skip to content

Tensor size check failed in unit test #39

@JF-D

Description

@JF-D

I run the unit test with command ROLE=joint RNIC=eth0 bash tests/utests/run_single_gpu_ut.sh and received the following log. In the end of the log, it seems the process crashes due to empty push/pull tensor. Is this expected behavior? The test run seems to have completed successfully. I see the same behavior when running therun_multi_gpu_ut.sh.

+ trap cleanup EXIT
+ export DMLC_NUM_WORKER=1
+ DMLC_NUM_WORKER=1
+ export DMLC_NUM_SERVER=1
+ DMLC_NUM_SERVER=1
+ export DMLC_INTERFACE=eth0
+ DMLC_INTERFACE=eth0
++ ip -o -4 addr
++ grep eth0
++ awk '{print $4}'
++ cut -d/ -f1
+ export DMLC_PS_ROOT_URI=22.11.148.38
+ DMLC_PS_ROOT_URI=22.11.148.38
+ export DMLC_PS_ROOT_PORT=12278
+ DMLC_PS_ROOT_PORT=12278
+ export STEPMESH_SPLIT_QP_LAG=1
+ STEPMESH_SPLIT_QP_LAG=1
+ export DMLC_NODE_RANK=0
+ DMLC_NODE_RANK=0
+ export DMLC_ENABLE_RDMA=ibverbs
+ DMLC_ENABLE_RDMA=ibverbs
+ echo 'SCHEDULER_IP is 22.11.148.38'
SCHEDULER_IP is 22.11.148.38
+ export DMLC_NODE_HOST=22.11.148.38
+ DMLC_NODE_HOST=22.11.148.38
+ cleanup
+ echo 'kill all testing process of ps lite for user root'
kill all testing process of ps lite for user root
+ pkill -9 -f ./cmake_build/tests/utests/ut_server
+ pkill -9 -f ./cmake_build/tests/utests/ut_scheduler
+ pkill -9 -f ./cmake_build/tests/utests/ut_tensor_worker
+ sleep 1
+ export STEPMESH_GPU=0
+ STEPMESH_GPU=0
+ DMLC_ROLE=scheduler
+ ./cmake_build/tests/utests/ut_scheduler
+ DMLC_ROLE=server
+ ./cmake_build/tests/utests/ut_server
+ sleep 1
+ export STEPMESH_GPU=0
+ STEPMESH_GPU=0
+ export DMLC_INTERFACE=auto
+ DMLC_INTERFACE=auto
+ DMLC_ROLE=worker
+ ./cmake_build/tests/utests/ut_tensor_worker
KVWorker instance_idx,0
GPU 0 Batch 1: ALL PASS duration=102939163ns
GPU 0 Batch 2: ALL PASS duration=27739833ns
GPU 0 Batch 3: ALL PASS duration=46168176ns
GPU 0 Batch 4: ALL PASS duration=27520080ns
GPU 0 Batch 5: ALL PASS duration=27165551ns
GPU 0 Batch 6: ALL PASS duration=27953516ns
GPU 0 Batch 7: ALL PASS duration=28203158ns
GPU 0 Batch 8: ALL PASS duration=28131956ns
GPU 0 Batch 9: ALL PASS duration=28194022ns
GPU 0 Batch 10: ALL PASS duration=45210068ns
GPU 0 Batch 11: ALL PASS duration=44272102ns
GPU 0 Batch 12: ALL PASS duration=28727110ns
GPU 0 Batch 13: ALL PASS duration=45771072ns
GPU 0 Batch 14: ALL PASS duration=28989824ns
GPU 0 Batch 15: ALL PASS duration=29023167ns
GPU 0 Batch 16: ALL PASS duration=28965286ns
GPU 0 Batch 17: ALL PASS duration=29427256ns
GPU 0 Batch 18: ALL PASS duration=48544087ns
GPU 0 Batch 19: ALL PASS duration=30062539ns
GPU 0 Batch 20: ALL PASS duration=47414593ns
GPU 0 Batch 21: ALL PASS duration=30887337ns
GPU 0 Batch 22: ALL PASS duration=30476398ns
GPU 0 Batch 23: ALL PASS duration=30219108ns
GPU 0 Batch 24: ALL PASS duration=30752724ns
GPU 0 Batch 25: ALL PASS duration=30469622ns
GPU 0 Batch 26: ALL PASS duration=32349774ns
GPU 0 Batch 27: ALL PASS duration=49403079ns
GPU 0 Batch 28: ALL PASS duration=32056304ns
GPU 0 Batch 29: ALL PASS duration=32781589ns
GPU 0 Batch 30: ALL PASS duration=32495211ns
GPU 0 Batch 31: ALL PASS duration=32362893ns
GPU 0 Batch 32: ALL PASS duration=32050631ns
GPU 0 Batch 33: ALL PASS duration=32501808ns
GPU 0 Batch 34: ALL PASS duration=32772405ns
GPU 0 Batch 35: ALL PASS duration=32757119ns
GPU 0 Batch 36: ALL PASS duration=32641362ns
GPU 0 Batch 37: ALL PASS duration=67196018ns
GPU 0 Batch 38: ALL PASS duration=33236069ns
GPU 0 Batch 39: ALL PASS duration=32761001ns
GPU 0 Batch 40: ALL PASS duration=33892548ns
GPU 0 Batch 41: ALL PASS duration=42562357ns
GPU 0 Batch 42: ALL PASS duration=41388467ns
GPU 0 Batch 43: ALL PASS duration=33617129ns
GPU 0 Batch 44: ALL PASS duration=33809077ns
GPU 0 Batch 45: ALL PASS duration=38755509ns
GPU 0 Batch 46: ALL PASS duration=34612357ns
GPU 0 Batch 47: ALL PASS duration=33614431ns
GPU 0 Batch 48: ALL PASS duration=33377249ns
GPU 0 Batch 49: ALL PASS duration=34635396ns
GPU 0 Batch 50: ALL PASS duration=34613911ns
GPU 0 Batch 51: ALL PASS duration=40757247ns
GPU 0 Batch 52: ALL PASS duration=47069682ns
GPU 0 Batch 53: ALL PASS duration=35675470ns
GPU 0 Batch 54: ALL PASS duration=35977120ns
GPU 0 Batch 55: ALL PASS duration=36102100ns
GPU 0 Batch 56: ALL PASS duration=36183584ns
GPU 0 Batch 57: ALL PASS duration=54167697ns
GPU 0 Batch 58: ALL PASS duration=37007098ns
GPU 0 Batch 59: ALL PASS duration=54641997ns
GPU 0 Batch 60: ALL PASS duration=36974136ns
GPU 0 Batch 61: ALL PASS duration=37319138ns
GPU 0 Batch 62: ALL PASS duration=43199335ns
GPU 0 Batch 63: ALL PASS duration=38930962ns
GPU 0 Batch 64: ALL PASS duration=70583732ns
GPU 0 Batch 65: ALL PASS duration=54357640ns
GPU 0 Batch 66: ALL PASS duration=37470205ns
GPU 0 Batch 67: ALL PASS duration=37186176ns
GPU 0 Batch 68: ALL PASS duration=55340401ns
GPU 0 Batch 69: ALL PASS duration=39065215ns
GPU 0 Batch 70: ALL PASS duration=37256671ns
GPU 0 Batch 71: ALL PASS duration=38823620ns
GPU 0 Batch 72: ALL PASS duration=38601652ns
GPU 0 Batch 73: ALL PASS duration=38330380ns
GPU 0 Batch 74: ALL PASS duration=53560578ns
GPU 0 Batch 75: ALL PASS duration=37433443ns
GPU 0 Batch 76: ALL PASS duration=44340984ns
GPU 0 Batch 77: ALL PASS duration=37434736ns
GPU 0 Batch 78: ALL PASS duration=37747351ns
GPU 0 Batch 79: ALL PASS duration=37771523ns
GPU 0 Batch 80: ALL PASS duration=54664266ns
GPU 0 Batch 81: ALL PASS duration=37565272ns
GPU 0 Batch 82: ALL PASS duration=38426003ns
GPU 0 Batch 83: ALL PASS duration=38178045ns
GPU 0 Batch 84: ALL PASS duration=55504985ns
GPU 0 Batch 85: ALL PASS duration=38289179ns
GPU 0 Batch 86: ALL PASS duration=38905454ns
GPU 0 Batch 87: ALL PASS duration=48633228ns
GPU 0 Batch 88: ALL PASS duration=38896169ns
GPU 0 Batch 89: ALL PASS duration=39307279ns
GPU 0 Batch 90: ALL PASS duration=38426712ns
GPU 0 Batch 91: ALL PASS duration=38698799ns
GPU 0 Batch 92: ALL PASS duration=38468385ns
GPU 0 Batch 93: ALL PASS duration=38803042ns
GPU 0 Batch 94: ALL PASS duration=39775753ns
GPU 0 Batch 95: ALL PASS duration=39201464ns
GPU 0 Batch 96: ALL PASS duration=38966659ns
GPU 0 Batch 97: ALL PASS duration=45183238ns
GPU 0 Batch 98: ALL PASS duration=39401135ns
GPU 0 Batch 99: ALL PASS duration=62850173ns
GPU 0 Batch 100: ALL PASS duration=39703706ns
GPU 0 Batch 101: ALL PASS duration=39490970ns
GPU 0 Batch 102: ALL PASS duration=40023853ns
GPU 0 Batch 103: ALL PASS duration=40426615ns
GPU 0 Batch 104: ALL PASS duration=41240478ns
GPU 0 Batch 105: ALL PASS duration=40971248ns
GPU 0 Batch 106: ALL PASS duration=40741029ns
GPU 0 Batch 107: ALL PASS duration=40566786ns
GPU 0 Batch 108: ALL PASS duration=74995922ns
GPU 0 Batch 109: ALL PASS duration=42169887ns
GPU 0 Batch 110: ALL PASS duration=40681037ns
GPU 0 Batch 111: ALL PASS duration=41022674ns
GPU 0 Batch 112: ALL PASS duration=59206563ns
GPU 0 Batch 113: ALL PASS duration=44751240ns
GPU 0 Batch 114: ALL PASS duration=41708476ns
GPU 0 Batch 115: ALL PASS duration=40811586ns
GPU 0 Batch 116: ALL PASS duration=46401737ns
GPU 0 Batch 117: ALL PASS duration=41851769ns
GPU 0 Batch 118: ALL PASS duration=41269446ns
GPU 0 Batch 119: ALL PASS duration=58507911ns
GPU 0 Batch 120: ALL PASS duration=42020185ns
GPU 0 Batch 121: ALL PASS duration=41769979ns
GPU 0 Batch 122: ALL PASS duration=42503107ns
GPU 0 Batch 123: ALL PASS duration=42387695ns
GPU 0 Batch 124: ALL PASS duration=43097052ns
GPU 0 Batch 125: ALL PASS duration=43215364ns
GPU 0 Batch 126: ALL PASS duration=42620700ns
GPU 0 Batch 127: ALL PASS duration=58966016ns
[17:36:24] worker 0 /mnt/workspace/duanjiangfei/StepMesh/include/dmlc/logging.h:301: [17:36:24] /mnt/workspace/duanjiangfei/StepMesh/include/ps/af_tensor_app.h:310: Check failed: (push_tensors.size() + pull_tensors.size()) >= (1) 

Stack trace returned 7 entries:
[bt] (0) ./cmake_build/tests/utests/ut_tensor_worker(+0x1373f) [0x563d7dde473f]
[bt] (1) ./cmake_build/tests/utests/ut_tensor_worker(+0x13a73) [0x563d7dde4a73]
[bt] (2) ./cmake_build/tests/utests/ut_tensor_worker(+0x1efcc) [0x563d7ddeffcc]
[bt] (3) ./cmake_build/tests/utests/ut_tensor_worker(+0x1f883) [0x563d7ddf0883]
[bt] (4) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fc11ea6edb4]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fc11e69caa4]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(+0x129c6c) [0x7fc11e729c6c]


terminate called after throwing an instance of 'dmlc::Error'
  what():  [17:36:24] /mnt/workspace/duanjiangfei/StepMesh/include/ps/af_tensor_app.h:310: Check failed: (push_tensors.size() + pull_tensors.size()) >= (1) 

Stack trace returned 7 entries:
[bt] (0) ./cmake_build/tests/utests/ut_tensor_worker(+0x1373f) [0x563d7dde473f]
[bt] (1) ./cmake_build/tests/utests/ut_tensor_worker(+0x13a73) [0x563d7dde4a73]
[bt] (2) ./cmake_build/tests/utests/ut_tensor_worker(+0x1efcc) [0x563d7ddeffcc]
[bt] (3) ./cmake_build/tests/utests/ut_tensor_worker(+0x1f883) [0x563d7ddf0883]
[bt] (4) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fc11ea6edb4]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fc11e69caa4]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(+0x129c6c) [0x7fc11e729c6c]


tests/utests/run_single_gpu_ut.sh: line 45: 455872 Aborted                 DMLC_ROLE=worker $WORKER_BIN
+ cleanup
+ echo 'kill all testing process of ps lite for user root'
kill all testing process of ps lite for user root
+ pkill -9 -f ./cmake_build/tests/utests/ut_server
+ pkill -9 -f ./cmake_build/tests/utests/ut_scheduler
+ pkill -9 -f ./cmake_build/tests/utests/ut_tensor_worker
+ sleep 1
tests/utests/run_single_gpu_ut.sh: line 11: 455866 Killed                  DMLC_ROLE=server $SERVER_BIN

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions