-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
I run the unit test with command ROLE=joint RNIC=eth0 bash tests/utests/run_single_gpu_ut.sh and received the following log. In the end of the log, it seems the process crashes due to empty push/pull tensor. Is this expected behavior? The test run seems to have completed successfully. I see the same behavior when running therun_multi_gpu_ut.sh.
+ trap cleanup EXIT
+ export DMLC_NUM_WORKER=1
+ DMLC_NUM_WORKER=1
+ export DMLC_NUM_SERVER=1
+ DMLC_NUM_SERVER=1
+ export DMLC_INTERFACE=eth0
+ DMLC_INTERFACE=eth0
++ ip -o -4 addr
++ grep eth0
++ awk '{print $4}'
++ cut -d/ -f1
+ export DMLC_PS_ROOT_URI=22.11.148.38
+ DMLC_PS_ROOT_URI=22.11.148.38
+ export DMLC_PS_ROOT_PORT=12278
+ DMLC_PS_ROOT_PORT=12278
+ export STEPMESH_SPLIT_QP_LAG=1
+ STEPMESH_SPLIT_QP_LAG=1
+ export DMLC_NODE_RANK=0
+ DMLC_NODE_RANK=0
+ export DMLC_ENABLE_RDMA=ibverbs
+ DMLC_ENABLE_RDMA=ibverbs
+ echo 'SCHEDULER_IP is 22.11.148.38'
SCHEDULER_IP is 22.11.148.38
+ export DMLC_NODE_HOST=22.11.148.38
+ DMLC_NODE_HOST=22.11.148.38
+ cleanup
+ echo 'kill all testing process of ps lite for user root'
kill all testing process of ps lite for user root
+ pkill -9 -f ./cmake_build/tests/utests/ut_server
+ pkill -9 -f ./cmake_build/tests/utests/ut_scheduler
+ pkill -9 -f ./cmake_build/tests/utests/ut_tensor_worker
+ sleep 1
+ export STEPMESH_GPU=0
+ STEPMESH_GPU=0
+ DMLC_ROLE=scheduler
+ ./cmake_build/tests/utests/ut_scheduler
+ DMLC_ROLE=server
+ ./cmake_build/tests/utests/ut_server
+ sleep 1
+ export STEPMESH_GPU=0
+ STEPMESH_GPU=0
+ export DMLC_INTERFACE=auto
+ DMLC_INTERFACE=auto
+ DMLC_ROLE=worker
+ ./cmake_build/tests/utests/ut_tensor_worker
KVWorker instance_idx,0
GPU 0 Batch 1: ALL PASS duration=102939163ns
GPU 0 Batch 2: ALL PASS duration=27739833ns
GPU 0 Batch 3: ALL PASS duration=46168176ns
GPU 0 Batch 4: ALL PASS duration=27520080ns
GPU 0 Batch 5: ALL PASS duration=27165551ns
GPU 0 Batch 6: ALL PASS duration=27953516ns
GPU 0 Batch 7: ALL PASS duration=28203158ns
GPU 0 Batch 8: ALL PASS duration=28131956ns
GPU 0 Batch 9: ALL PASS duration=28194022ns
GPU 0 Batch 10: ALL PASS duration=45210068ns
GPU 0 Batch 11: ALL PASS duration=44272102ns
GPU 0 Batch 12: ALL PASS duration=28727110ns
GPU 0 Batch 13: ALL PASS duration=45771072ns
GPU 0 Batch 14: ALL PASS duration=28989824ns
GPU 0 Batch 15: ALL PASS duration=29023167ns
GPU 0 Batch 16: ALL PASS duration=28965286ns
GPU 0 Batch 17: ALL PASS duration=29427256ns
GPU 0 Batch 18: ALL PASS duration=48544087ns
GPU 0 Batch 19: ALL PASS duration=30062539ns
GPU 0 Batch 20: ALL PASS duration=47414593ns
GPU 0 Batch 21: ALL PASS duration=30887337ns
GPU 0 Batch 22: ALL PASS duration=30476398ns
GPU 0 Batch 23: ALL PASS duration=30219108ns
GPU 0 Batch 24: ALL PASS duration=30752724ns
GPU 0 Batch 25: ALL PASS duration=30469622ns
GPU 0 Batch 26: ALL PASS duration=32349774ns
GPU 0 Batch 27: ALL PASS duration=49403079ns
GPU 0 Batch 28: ALL PASS duration=32056304ns
GPU 0 Batch 29: ALL PASS duration=32781589ns
GPU 0 Batch 30: ALL PASS duration=32495211ns
GPU 0 Batch 31: ALL PASS duration=32362893ns
GPU 0 Batch 32: ALL PASS duration=32050631ns
GPU 0 Batch 33: ALL PASS duration=32501808ns
GPU 0 Batch 34: ALL PASS duration=32772405ns
GPU 0 Batch 35: ALL PASS duration=32757119ns
GPU 0 Batch 36: ALL PASS duration=32641362ns
GPU 0 Batch 37: ALL PASS duration=67196018ns
GPU 0 Batch 38: ALL PASS duration=33236069ns
GPU 0 Batch 39: ALL PASS duration=32761001ns
GPU 0 Batch 40: ALL PASS duration=33892548ns
GPU 0 Batch 41: ALL PASS duration=42562357ns
GPU 0 Batch 42: ALL PASS duration=41388467ns
GPU 0 Batch 43: ALL PASS duration=33617129ns
GPU 0 Batch 44: ALL PASS duration=33809077ns
GPU 0 Batch 45: ALL PASS duration=38755509ns
GPU 0 Batch 46: ALL PASS duration=34612357ns
GPU 0 Batch 47: ALL PASS duration=33614431ns
GPU 0 Batch 48: ALL PASS duration=33377249ns
GPU 0 Batch 49: ALL PASS duration=34635396ns
GPU 0 Batch 50: ALL PASS duration=34613911ns
GPU 0 Batch 51: ALL PASS duration=40757247ns
GPU 0 Batch 52: ALL PASS duration=47069682ns
GPU 0 Batch 53: ALL PASS duration=35675470ns
GPU 0 Batch 54: ALL PASS duration=35977120ns
GPU 0 Batch 55: ALL PASS duration=36102100ns
GPU 0 Batch 56: ALL PASS duration=36183584ns
GPU 0 Batch 57: ALL PASS duration=54167697ns
GPU 0 Batch 58: ALL PASS duration=37007098ns
GPU 0 Batch 59: ALL PASS duration=54641997ns
GPU 0 Batch 60: ALL PASS duration=36974136ns
GPU 0 Batch 61: ALL PASS duration=37319138ns
GPU 0 Batch 62: ALL PASS duration=43199335ns
GPU 0 Batch 63: ALL PASS duration=38930962ns
GPU 0 Batch 64: ALL PASS duration=70583732ns
GPU 0 Batch 65: ALL PASS duration=54357640ns
GPU 0 Batch 66: ALL PASS duration=37470205ns
GPU 0 Batch 67: ALL PASS duration=37186176ns
GPU 0 Batch 68: ALL PASS duration=55340401ns
GPU 0 Batch 69: ALL PASS duration=39065215ns
GPU 0 Batch 70: ALL PASS duration=37256671ns
GPU 0 Batch 71: ALL PASS duration=38823620ns
GPU 0 Batch 72: ALL PASS duration=38601652ns
GPU 0 Batch 73: ALL PASS duration=38330380ns
GPU 0 Batch 74: ALL PASS duration=53560578ns
GPU 0 Batch 75: ALL PASS duration=37433443ns
GPU 0 Batch 76: ALL PASS duration=44340984ns
GPU 0 Batch 77: ALL PASS duration=37434736ns
GPU 0 Batch 78: ALL PASS duration=37747351ns
GPU 0 Batch 79: ALL PASS duration=37771523ns
GPU 0 Batch 80: ALL PASS duration=54664266ns
GPU 0 Batch 81: ALL PASS duration=37565272ns
GPU 0 Batch 82: ALL PASS duration=38426003ns
GPU 0 Batch 83: ALL PASS duration=38178045ns
GPU 0 Batch 84: ALL PASS duration=55504985ns
GPU 0 Batch 85: ALL PASS duration=38289179ns
GPU 0 Batch 86: ALL PASS duration=38905454ns
GPU 0 Batch 87: ALL PASS duration=48633228ns
GPU 0 Batch 88: ALL PASS duration=38896169ns
GPU 0 Batch 89: ALL PASS duration=39307279ns
GPU 0 Batch 90: ALL PASS duration=38426712ns
GPU 0 Batch 91: ALL PASS duration=38698799ns
GPU 0 Batch 92: ALL PASS duration=38468385ns
GPU 0 Batch 93: ALL PASS duration=38803042ns
GPU 0 Batch 94: ALL PASS duration=39775753ns
GPU 0 Batch 95: ALL PASS duration=39201464ns
GPU 0 Batch 96: ALL PASS duration=38966659ns
GPU 0 Batch 97: ALL PASS duration=45183238ns
GPU 0 Batch 98: ALL PASS duration=39401135ns
GPU 0 Batch 99: ALL PASS duration=62850173ns
GPU 0 Batch 100: ALL PASS duration=39703706ns
GPU 0 Batch 101: ALL PASS duration=39490970ns
GPU 0 Batch 102: ALL PASS duration=40023853ns
GPU 0 Batch 103: ALL PASS duration=40426615ns
GPU 0 Batch 104: ALL PASS duration=41240478ns
GPU 0 Batch 105: ALL PASS duration=40971248ns
GPU 0 Batch 106: ALL PASS duration=40741029ns
GPU 0 Batch 107: ALL PASS duration=40566786ns
GPU 0 Batch 108: ALL PASS duration=74995922ns
GPU 0 Batch 109: ALL PASS duration=42169887ns
GPU 0 Batch 110: ALL PASS duration=40681037ns
GPU 0 Batch 111: ALL PASS duration=41022674ns
GPU 0 Batch 112: ALL PASS duration=59206563ns
GPU 0 Batch 113: ALL PASS duration=44751240ns
GPU 0 Batch 114: ALL PASS duration=41708476ns
GPU 0 Batch 115: ALL PASS duration=40811586ns
GPU 0 Batch 116: ALL PASS duration=46401737ns
GPU 0 Batch 117: ALL PASS duration=41851769ns
GPU 0 Batch 118: ALL PASS duration=41269446ns
GPU 0 Batch 119: ALL PASS duration=58507911ns
GPU 0 Batch 120: ALL PASS duration=42020185ns
GPU 0 Batch 121: ALL PASS duration=41769979ns
GPU 0 Batch 122: ALL PASS duration=42503107ns
GPU 0 Batch 123: ALL PASS duration=42387695ns
GPU 0 Batch 124: ALL PASS duration=43097052ns
GPU 0 Batch 125: ALL PASS duration=43215364ns
GPU 0 Batch 126: ALL PASS duration=42620700ns
GPU 0 Batch 127: ALL PASS duration=58966016ns
[17:36:24] worker 0 /mnt/workspace/duanjiangfei/StepMesh/include/dmlc/logging.h:301: [17:36:24] /mnt/workspace/duanjiangfei/StepMesh/include/ps/af_tensor_app.h:310: Check failed: (push_tensors.size() + pull_tensors.size()) >= (1)
Stack trace returned 7 entries:
[bt] (0) ./cmake_build/tests/utests/ut_tensor_worker(+0x1373f) [0x563d7dde473f]
[bt] (1) ./cmake_build/tests/utests/ut_tensor_worker(+0x13a73) [0x563d7dde4a73]
[bt] (2) ./cmake_build/tests/utests/ut_tensor_worker(+0x1efcc) [0x563d7ddeffcc]
[bt] (3) ./cmake_build/tests/utests/ut_tensor_worker(+0x1f883) [0x563d7ddf0883]
[bt] (4) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fc11ea6edb4]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fc11e69caa4]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(+0x129c6c) [0x7fc11e729c6c]
terminate called after throwing an instance of 'dmlc::Error'
what(): [17:36:24] /mnt/workspace/duanjiangfei/StepMesh/include/ps/af_tensor_app.h:310: Check failed: (push_tensors.size() + pull_tensors.size()) >= (1)
Stack trace returned 7 entries:
[bt] (0) ./cmake_build/tests/utests/ut_tensor_worker(+0x1373f) [0x563d7dde473f]
[bt] (1) ./cmake_build/tests/utests/ut_tensor_worker(+0x13a73) [0x563d7dde4a73]
[bt] (2) ./cmake_build/tests/utests/ut_tensor_worker(+0x1efcc) [0x563d7ddeffcc]
[bt] (3) ./cmake_build/tests/utests/ut_tensor_worker(+0x1f883) [0x563d7ddf0883]
[bt] (4) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fc11ea6edb4]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fc11e69caa4]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(+0x129c6c) [0x7fc11e729c6c]
tests/utests/run_single_gpu_ut.sh: line 45: 455872 Aborted DMLC_ROLE=worker $WORKER_BIN
+ cleanup
+ echo 'kill all testing process of ps lite for user root'
kill all testing process of ps lite for user root
+ pkill -9 -f ./cmake_build/tests/utests/ut_server
+ pkill -9 -f ./cmake_build/tests/utests/ut_scheduler
+ pkill -9 -f ./cmake_build/tests/utests/ut_tensor_worker
+ sleep 1
tests/utests/run_single_gpu_ut.sh: line 11: 455866 Killed DMLC_ROLE=server $SERVER_BIN
Metadata
Metadata
Assignees
Labels
No labels