Skip to content

Conversation

@kylewanginchina
Copy link
Contributor

This PR is for:

  • Agent

support multi-thread python unwind for distributed vllm profiling

Checklist

  • Added unit test.

Backport to branches

@kylewanginchina
Copy link
Contributor Author

原先的问题:
img_v3_02st_c4f75f0e-8f7a-4cd1-9f87-29263690f5fg
img_v3_02st_88f63cae-32c8-4f2c-96cc-92c13da59d7g
img_v3_02st_82c72bea-7223-4d9c-97fa-bdabd2b3b77g

修复后:
img_v3_02t4_f699408b-7728-4538-9e7f-c8621b73940g
img_v3_02t4_77bd2a0d-7194-4b66-9bad-e906e9bbf01g
img_v3_02t4_5a5bdfa8-60d1-4d98-8525-de4140ad750g
img_v3_02t4_f0c76812-aff3-4e4d-ba45-29ec568d0efg

nccl什么的都是正常的

@kylewanginchina kylewanginchina force-pushed the fix-multithread-python-unwind branch 2 times, most recently from dedd7d1 to a327688 Compare December 23, 2025 09:46
@kylewanginchina
Copy link
Contributor Author

容器中运行缺失python线程函数符号的问题也已解决:
[2025-12-23 17:34:11.905519 +08:00] DEBUG [crates/trace-utils/src/unwind/tpbase.rs:325] Failed to read kernel code for x86_fsbase_write_task: failed to fill whole buffer
[2025-12-23 17:34:12.034660 +08:00] DEBUG [crates/trace-utils/src/unwind/tpbase.rs:355] Extracted TPBASE offset 5416 (0x1528) from BTF

@kylewanginchina
Copy link
Contributor Author

有些不同系统下会默认回退到固定偏移的处理应该可以进一步消除

@kylewanginchina kylewanginchina force-pushed the fix-multithread-python-unwind branch from a327688 to 0bb6564 Compare December 24, 2025 06:51
@kylewanginchina
Copy link
Contributor Author

TSD decoding以及autoTLSkey disassembly失败回退到默认值的情况也改进支持了一下。

原先会fallback:

[2025-12-24 09:55:40.872406 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:406] process#167378 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 09:55:40.875547 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:330] Could not find autoTLSkey from disassembly, using fallback offset 0x555d1fa40d0c
[2025-12-24 09:55:40.877578 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:406] process#167378 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 09:55:40.879065 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:330] Could not find autoTLSkey from disassembly, using fallback offset 0x555d1fa40d0c
[2025-12-24 09:55:40.881692 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:676] Failed to extract TSD info for process#167378: Could not extract TSD info from x86_64 code (len=256). Dump: [f3, 0f, 1e, fa, 83, ff, 1f, 77, 37, 8d, 47, 31, 48, c1, e0, 04, 64, 48, 03, 04, 25, 10, 00, 00, 00, 4c, 8b, 40, 08, 4d, 85, c0, 74, 16, 89, ff, 48, 8d, 15, 95, 66, 18, 00, 48, 8b, 08, 48, c1, e7, 04, 48, 39, 0c, 3a, 75, 38, 4c, 89, c0, c3, 0f, 1f, 40, 00, 81, ff, ff, 03, 00, 00, 77, 38, 89, fa, 89, f8, 83, e2, 1f, c1, e8, 05, 64, 48, 8b, 04, c5, 10, 05, 00, 00, 49, 89, c0, 48, 85, c0, 74, d5, 48, c1, e2, 04, 48, 01, d0, eb, ad, 0f, 1f, 40, 00, 48, c7, 40, 08, 00, 00, 00, 00, 45, 31, c0, eb, bb, 0f, 1f, 00, 45, 31, c0, eb, b3, 66, 2e, 0f, 1f, 84, 00, 00, 00, 00, 00, 90, f3, 0f, 1e, fa, 41, b8, 01, 00, 00, 00, 31, c9, 31, d2, e9, 2d, 00, 00, 00, 66, 2e, 0f, 1f, 84, 00, 00, 00, 00, 00, 0f, 1f, 00, f3, 0f, 1e, fa, 48, 89, 7c, 24, f8, 31, d2, 64, 48, 8b, 04, 25, 10, 00, 00, 00, f0, 48, 0f, b1, 54, 24, f8, c3, 0f, 1f, 40, 00, f3, 0f, 1e, fa, 41, 57, 41, 56, 41, 55, 41, 54, 55, 53, 48, 83, ec, 48, 64, 48, 8b, 04, 25, 28, 00, 00, 00, 48, 89, 44, 24, 38, 31, c0, 48, 85, ff, 0f, 84, 55, 01, 00, 00, 8b, 87, d0, 02, 00], using defaults
[2025-12-24 09:55:40.883687 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:692] Loading Python unwind info for process#167378: autoTLSkey=0x555d1fa40d0c, version=0x30a
[2025-12-24 09:55:43.821644 +08:00] DEBUG [crates/trace-utils/src/unwind.rs:369] process#167378 loaded 222 and reused 0 dwarf entry shards
[2025-12-24 09:55:43.821836 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:406] process#167555 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 09:55:43.822554 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:330] Could not find autoTLSkey from disassembly, using fallback offset 0x5612cd086d0c
[2025-12-24 09:55:43.822699 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:406] process#167555 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 09:55:43.823345 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:330] Could not find autoTLSkey from disassembly, using fallback offset 0x5612cd086d0c
[2025-12-24 09:55:43.823807 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:676] Failed to extract TSD info for process#167555: Could not extract TSD info from x86_64 code (len=256). Dump: [f3, 0f, 1e, fa, 83, ff, 1f, 77, 37, 8d, 47, 31, 48, c1, e0, 04, 64, 48, 03, 04, 25, 10, 00, 00, 00, 4c, 8b, 40, 08, 4d, 85, c0, 74, 16, 89, ff, 48, 8d, 15, 95, 66, 18, 00, 48, 8b, 08, 48, c1, e7, 04, 48, 39, 0c, 3a, 75, 38, 4c, 89, c0, c3, 0f, 1f, 40, 00, 81, ff, ff, 03, 00, 00, 77, 38, 89, fa, 89, f8, 83, e2, 1f, c1, e8, 05, 64, 48, 8b, 04, c5, 10, 05, 00, 00, 49, 89, c0, 48, 85, c0, 74, d5, 48, c1, e2, 04, 48, 01, d0, eb, ad, 0f, 1f, 40, 00, 48, c7, 40, 08, 00, 00, 00, 00, 45, 31, c0, eb, bb, 0f, 1f, 00, 45, 31, c0, eb, b3, 66, 2e, 0f, 1f, 84, 00, 00, 00, 00, 00, 90, f3, 0f, 1e, fa, 41, b8, 01, 00, 00, 00, 31, c9, 31, d2, e9, 2d, 00, 00, 00, 66, 2e, 0f, 1f, 84, 00, 00, 00, 00, 00, 0f, 1f, 00, f3, 0f, 1e, fa, 48, 89, 7c, 24, f8, 31, d2, 64, 48, 8b, 04, 25, 10, 00, 00, 00, f0, 48, 0f, b1, 54, 24, f8, c3, 0f, 1f, 40, 00, f3, 0f, 1e, fa, 41, 57, 41, 56, 41, 55, 41, 54, 55, 53, 48, 83, ec, 48, 64, 48, 8b, 04, 25, 28, 00, 00, 00, 48, 89, 44, 24, 38, 31, c0, 48, 85, ff, 0f, 84, 55, 01, 00, 00, 8b, 87, d0, 02, 00], using defaults
[2025-12-24 09:55:43.823900 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:692] Loading Python unwind info for process#167555: autoTLSkey=0x5612cd086d0c, version=0x30a
[2025-12-24 09:55:43.825627 +08:00] DEBUG [crates/trace-utils/src/unwind.rs:369] process#167555 loaded 0 and reused 4 dwarf entry shards
[2025-12-24 09:55:43.827956 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:406] process#167556 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 09:55:43.828473 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:330] Could not find autoTLSkey from disassembly, using fallback offset 0x556cc5770d0c
[2025-12-24 09:55:43.830614 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:406] process#167556 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 09:55:43.831121 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:330] Could not find autoTLSkey from disassembly, using fallback offset 0x556cc5770d0c
[2025-12-24 09:55:43.833548 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:676] Failed to extract TSD info for process#167556: Could not extract TSD info from x86_64 code (len=256). Dump: [f3, 0f, 1e, fa, 83, ff, 1f, 77, 37, 8d, 47, 31, 48, c1, e0, 04, 64, 48, 03, 04, 25, 10, 00, 00, 00, 4c, 8b, 40, 08, 4d, 85, c0, 74, 16, 89, ff, 48, 8d, 15, 95, 66, 18, 00, 48, 8b, 08, 48, c1, e7, 04, 48, 39, 0c, 3a, 75, 38, 4c, 89, c0, c3, 0f, 1f, 40, 00, 81, ff, ff, 03, 00, 00, 77, 38, 89, fa, 89, f8, 83, e2, 1f, c1, e8, 05, 64, 48, 8b, 04, c5, 10, 05, 00, 00, 49, 89, c0, 48, 85, c0, 74, d5, 48, c1, e2, 04, 48, 01, d0, eb, ad, 0f, 1f, 40, 00, 48, c7, 40, 08, 00, 00, 00, 00, 45, 31, c0, eb, bb, 0f, 1f, 00, 45, 31, c0, eb, b3, 66, 2e, 0f, 1f, 84, 00, 00, 00, 00, 00, 90, f3, 0f, 1e, fa, 41, b8, 01, 00, 00, 00, 31, c9, 31, d2, e9, 2d, 00, 00, 00, 66, 2e, 0f, 1f, 84, 00, 00, 00, 00, 00, 0f, 1f, 00, f3, 0f, 1e, fa, 48, 89, 7c, 24, f8, 31, d2, 64, 48, 8b, 04, 25, 10, 00, 00, 00, f0, 48, 0f, b1, 54, 24, f8, c3, 0f, 1f, 40, 00, f3, 0f, 1e, fa, 41, 57, 41, 56, 41, 55, 41, 54, 55, 53, 48, 83, ec, 48, 64, 48, 8b, 04, 25, 28, 00, 00, 00, 48, 89, 44, 24, 38, 31, c0, 48, 85, ff, 0f, 84, 55, 01, 00, 00, 8b, 87, d0, 02, 00], using defaults
[2025-12-24 09:55:43.859635 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:692] Loading Python unwind info for process#167556: autoTLSkey=0x556cc5770d0c, version=0x30a
[2025-12-24 09:55:46.127457 +08:00] DEBUG [crates/trace-utils/src/unwind.rs:369] process#167556 loaded 21 and reused 217 dwarf entry shards
[2025-12-24 09:55:46.127818 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:406] process#467351 exe: /deepflow-agent lib: n/a
[2025-12-24 09:55:46.184062 +08:00] DEBUG [crates/trace-utils/src/unwind.rs:369] process#467351 loaded 4 and reused 1 dwarf entry shards

改进后能解析了:

[2025-12-24 17:36:32.923060 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:647] process#716079 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 17:36:32.923882 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:348] Found autoTLSkey address 0x563c423bfd0c (mov/lea from RAX base)
[2025-12-24 17:36:32.925889 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:647] process#716079 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 17:36:32.926559 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:348] Found autoTLSkey address 0x563c423bfd0c (mov/lea from RAX base)
[2025-12-24 17:36:32.929640 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:910] Extracted TSD info for process#716079: offset=792, multiplier=16, indirect=0
[2025-12-24 17:36:32.929671 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:933] Loading Python unwind info for process#716079: autoTLSkey=0x563c423bfd0c, version=0x30a
[2025-12-24 17:36:35.836817 +08:00] DEBUG [crates/trace-utils/src/unwind.rs:369] process#716079 loaded 220 and reused 2 dwarf entry shards
[2025-12-24 17:36:35.837000 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:647] process#716157 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 17:36:35.837719 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:348] Found autoTLSkey address 0x55fd6237ad0c (mov/lea from RAX base)
[2025-12-24 17:36:35.837872 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:647] process#716157 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 17:36:35.838533 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:348] Found autoTLSkey address 0x55fd6237ad0c (mov/lea from RAX base)
[2025-12-24 17:36:35.839018 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:910] Extracted TSD info for process#716157: offset=792, multiplier=16, indirect=0
[2025-12-24 17:36:35.839037 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:933] Loading Python unwind info for process#716157: autoTLSkey=0x55fd6237ad0c, version=0x30a
[2025-12-24 17:36:35.841000 +08:00] DEBUG [crates/trace-utils/src/unwind.rs:369] process#716157 loaded 0 and reused 5 dwarf entry shards
[2025-12-24 17:36:35.843305 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:647] process#716158 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 17:36:35.844015 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:348] Found autoTLSkey address 0x55719c484d0c (mov/lea from RAX base)
[2025-12-24 17:36:35.846249 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:647] process#716158 exe: /usr/bin/python3.10 lib: n/a
[2025-12-24 17:36:35.846863 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:348] Found autoTLSkey address 0x55719c484d0c (mov/lea from RAX base)
[2025-12-24 17:36:35.849387 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:910] Extracted TSD info for process#716158: offset=792, multiplier=16, indirect=0
[2025-12-24 17:36:35.849409 +08:00] DEBUG [crates/trace-utils/src/unwind/python.rs:933] Loading Python unwind info for process#716158: autoTLSkey=0x55719c484d0c, version=0x30a
[2025-12-24 17:36:38.306521 +08:00] DEBUG [crates/trace-utils/src/unwind.rs:369] process#716158 loaded 21 and reused 217 dwarf entry shards

@kylewanginchina kylewanginchina force-pushed the fix-multithread-python-unwind branch from cbd7c5a to 325f4da Compare December 24, 2025 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants