-
Notifications
You must be signed in to change notification settings - Fork 73
Work around ssh client hanging while opening a session #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
zyga
wants to merge
2
commits into
canonical:master
Choose a base branch
from
zyga:fix/hang-on-reboot
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sergiocazzolato
added a commit
to sergiocazzolato/spread
that referenced
this pull request
Mar 26, 2025
This is the change done by Zyga, which is merged in spread-plus
sergiocazzolato
added a commit
to sergiocazzolato/spread
that referenced
this pull request
Apr 1, 2025
This reverts commit 30423ae.
Contributor
Author
|
@cmatsuoka this requires small improvements to handle non-local case where instead talking to qemu, we talk across an actual network. |
When spread reboots it keeps trying to read the boot ID from the
system. This is done by opening a new "session" on the existing
connection and running "cat" on a file in /proc.
There are several problems with current approach:
1) The code to run arbitrary commands needed to read bootID ... also
reads bootID (calling itself) in some cases, which feels wrong.
2) SSH client does not seem to be resilient against SSHD that doesn't
respond at the exactly inconvenient time. As the remote machine is
rebooting we are still trying to create a new session, which requires
communication.
As a workaround, create a new SSH connection and a new session on _that_
connection in a separate go-routine. In the unfortunate case when the
go-routine hangs, we just try again.
With the crypto/ssh package modified (`const debugMux = true`) to log messages,
we can see this sequence of events when a hang occurs. The lines prefixed with
(spread) are from spread, the rest are from the crypto/ssh library:
```
(spread) creating new ssh session with client 0xc0002485f0
2025/03/25 18:58:20 send global(12): ssh.channelOpenMsg{ChanType:"session", PeersID:0xc, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2025/03/25 18:58:20 decoding(12): 80 &ssh.globalRequestMsg{Type:"hostkeys-00@openssh.com", WantReply:false, Data:[]uint8{0x0, 0x0, 0x1, 0x97, 0x0, 0x0, 0x0, 0x7, 0x73, 0x73, 0x68, 0x2d, 0x72, 0x73, 0x61, 0x0, 0x0, 0x0, 0x3, 0x1, 0x0, 0x1, 0x0, 0x0, 0x1, 0x81, 0x0, 0xc3, 0xf4, 0xe6, 0x5c, 0x1a, 0xfb, 0x5a, 0x3e, 0xa6, 0x4e, 0x28, 0xb5, 0x9d, 0xf9, 0xa1, 0x4b, 0x88, 0x9a, 0xe5, 0x6b, 0x89, 0x2f, 0x90, 0x48, 0x7d, 0x9b, 0x6c, 0xe6, 0x24, 0x55, 0x87, 0x82, 0x5d, 0x32, 0x54, 0x4e, 0xc6, 0xe8, 0x3a, 0x94, 0x14, 0xc9, 0x2c, 0x16, 0xac, 0x83, 0x86, 0x2e, 0xa6, 0xf6, 0x91, 0x92, 0x77, 0x81, 0xfe, 0x0, 0xf2, 0x29, 0xe6, 0xc9, 0x3e, 0xb9, 0x16, 0x35, 0xa4, 0x89, 0xdb, 0xc1, 0xf7, 0xf2, 0x9c, 0x6a, 0x66, 0xbf, 0x2, 0xc1, 0x38, 0x6c, 0xfc, 0xce, 0x21, 0x5b, 0x89, 0x6b, 0xcd, 0xc3, 0x42, 0xc0, 0xbc, 0x1e, 0x90, 0x46, 0x5a, 0x26, 0x82, 0x16, 0x4d, 0x35, 0x61, 0x2, 0x21, 0x9c, 0x4f, 0x8d, 0xa7, 0xf5, 0x18, 0x28, 0x22, 0xe8, 0x94, 0xca, 0xa2, 0xfa, 0xb3, 0xb0, 0xdf, 0x22, 0xc2, 0x9b, 0xf5, 0xdf, 0xd3, 0x25, 0xfa, 0x80, 0x20, 0x0, 0xa1, 0xb0, 0xc2, 0x39, 0xef, 0x9b, 0x88, 0x7f, 0x5c, 0x89, 0x36, 0xd0, 0x0, 0xd3, 0xa8, 0xb4, 0x92, 0x23, 0x48, 0xfa, 0xa3, 0x67, 0x4e, 0xea, 0xac, 0x35, 0x2e, 0xce, 0x13, 0x12, 0xbe, 0x1, 0xaa, 0x8d, 0x9f, 0xa5, 0x43, 0x16, 0x90, 0x9d, 0xa4, 0xf6, 0xa7, 0xaf, 0xc6, 0x1d, 0xd, 0xfb, 0x9d, 0xf2, 0x6b, 0xc1, 0x0, 0x30, 0x6, 0xbd, 0xa4, 0x25, 0xa5, 0xf0, 0x34, 0xff, 0xa4, 0x19, 0x14, 0x83, 0x64, 0x65, 0xbd, 0x4d, 0xa3, 0xf4, 0x5, 0xb6, 0x8d, 0x16, 0xf7, 0xcf, 0x5e, 0x2b, 0x7d, 0x15, 0x9d, 0x8e, 0x36, 0x17, 0x74, 0x82, 0x4d, 0xab, 0x61, 0x67, 0x20, 0x27, 0x3, 0xf, 0x59, 0xeb, 0x8b, 0x3e, 0xbd, 0x7d, 0xb9, 0xe3, 0xc6, 0x72, 0xf3, 0x7d, 0xdc, 0x6a, 0x6d, 0xd4, 0x4d, 0xb3, 0xfa, 0xaf, 0x1d, 0x2d, 0xb0, 0x92, 0x81, 0x61, 0xf3, 0x5e, 0xe6, 0xb3, 0xb8, 0xa, 0x58, 0x4f, 0xa5, 0x21, 0x4b, 0x8c, 0x98, 0x46, 0x44, 0x9b, 0xaf, 0x87, 0x9b, 0x21, 0xe5, 0x8b, 0x6b, 0xb0, 0x2a, 0x52, 0xc7, 0xb0, 0x23, 0x0, 0x72, 0x3c, 0xb2, 0x72, 0x6f, 0x5, 0xa, 0x80, 0x61, 0x9a, 0xb6, 0x5a, 0xea, 0xcf, 0x61, 0x3c, 0x4f, 0xfe, 0x3, 0xed, 0x85, 0xdb, 0x31, 0xad, 0x1f, 0xe4, 0xf5, 0x1a, 0xa4, 0xc, 0xe2, 0xc7, 0x57, 0x97, 0x81, 0x1e, 0x9, 0xa0, 0x83, 0xfc, 0x13, 0x9, 0x2c, 0x1, 0x5c, 0xc7, 0x4c, 0xee, 0x8, 0xcc, 0x2d, 0x3a, 0xea, 0xe1, 0xdf, 0x71, 0x88, 0xec, 0x3d, 0x5f, 0x1e, 0x8b, 0x2c, 0xbd, 0x17, 0x40, 0x65, 0x2f, 0x1b, 0x1a, 0x7, 0xac, 0x53, 0x76, 0x56, 0xf0, 0xbc, 0x0, 0xcc, 0x1f, 0xf0, 0xa5, 0xd7, 0xbf, 0xca, 0xf9, 0xbe, 0x0, 0x40, 0x2e, 0xaf, 0xfd, 0x9a, 0x90, 0xfa, 0xc6, 0x4e, 0xdf, 0x53, 0x44, 0xb8, 0x49, 0x15, 0xc9, 0x0, 0x0, 0x0, 0x68, 0x0, 0x0, 0x0, 0x13, 0x65, 0x63, 0x64, 0x73, 0x61, 0x2d, 0x73, 0x68, 0x61, 0x32, 0x2d, 0x6e, 0x69, 0x73, 0x74, 0x70, 0x32, 0x35, 0x36, 0x0, 0x0, 0x0, 0x8, 0x6e, 0x69, 0x73, 0x74, 0x70, 0x32, 0x35, 0x36, 0x0, 0x0, 0x0, 0x41, 0x4, 0xa9, 0xc8, 0x76, 0xe9, 0xa0, 0x5d, 0xb, 0x3b, 0x47, 0x1a, 0x19, 0xe5, 0x66, 0x4c, 0xed, 0xf0, 0x9b, 0xe7, 0xb7, 0xa2, 0xc0, 0x2a, 0xc3, 0x40, 0xa4, 0x5, 0x3b, 0x38, 0xa4, 0x17, 0xa4, 0x2b, 0x7f, 0x44, 0x9c, 0xfd, 0xfb, 0xf8, 0x7b, 0xd8, 0xaf, 0xde, 0x94, 0x59, 0x8d, 0xf5, 0xb9, 0x9a, 0xf3, 0x5b, 0x9a, 0x4f, 0x2b, 0x9d, 0xcb, 0x72, 0xdc, 0xd6, 0x44, 0xce, 0x9a, 0xf3, 0x7b, 0x2e, 0x0, 0x0, 0x0, 0x33, 0x0, 0x0, 0x0, 0xb, 0x73, 0x73, 0x68, 0x2d, 0x65, 0x64, 0x32, 0x35, 0x35, 0x31, 0x39, 0x0, 0x0, 0x0, 0x20, 0xa0, 0x8d, 0x14, 0x41, 0xcd, 0x24, 0x4e, 0x40, 0x8d, 0x36, 0xb9, 0xda, 0x34, 0xd1, 0x1d, 0xb3, 0x1e, 0x2f, 0xee, 0x97, 0xc8, 0x78, 0x6b, 0xbd, 0x56, 0xa5, 0x76, 0x53, 0xb, 0x12, 0xf7, 0x1a}} - 603 bytes
2025/03/25 18:58:20 decoding(12): 91 &ssh.channelOpenConfirmMsg{PeersID:0xc, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
(spread) ssh session ready
2025/03/25 18:58:20 send(12): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x13, 0x73, 0x75, 0x64, 0x6f, 0x20, 0x2d, 0x69, 0x20, 0x2f, 0x62, 0x69, 0x6e, 0x2f, 0x62, 0x61, 0x73, 0x68, 0x20, 0x2d}}
2025/03/25 18:58:20 decoding(12): 93 &ssh.windowAdjustMsg{PeersID:0xc, AdditionalBytes:0x200000} - 9 bytes
2025/03/25 18:58:20 decoding(12): 99 &ssh.channelRequestSuccessMsg{PeersID:0xc} - 5 bytes
2025/03/25 18:58:20 send(12): ssh.channelEOFMsg{PeersID:0x0}
```
For comparison, a successful exchange looks like this:
```
(spread) creating new ssh session with client 0xc0002484b0
2025/03/25 18:58:20 send global(11): ssh.channelOpenMsg{ChanType:"session", PeersID:0xb, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2025/03/25 18:58:20 loop exit read tcp 127.0.0.1:50842->127.0.0.1:5000: use of closed network connection
2025/03/25 18:58:20 decoding(11): 80 &ssh.globalRequestMsg{Type:"hostkeys-00@openssh.com", WantReply:false, Data:[]uint8{0x0, 0x0, 0x1, 0x97, 0x0, 0x0, 0x0, 0x7, 0x73, 0x73, 0x68, 0x2d, 0x72, 0x73, 0x61, 0x0, 0x0, 0x0, 0x3, 0x1, 0x0, 0x1, 0x0, 0x0, 0x1, 0x81, 0x0, 0xc3, 0xf4, 0xe6, 0x5c, 0x1a, 0xfb, 0x5a, 0x3e, 0xa6, 0x4e, 0x28, 0xb5, 0x9d, 0xf9, 0xa1, 0x4b, 0x88, 0x9a, 0xe5, 0x6b, 0x89, 0x2f, 0x90, 0x48, 0x7d, 0x9b, 0x6c, 0xe6, 0x24, 0x55, 0x87, 0x82, 0x5d, 0x32, 0x54, 0x4e, 0xc6, 0xe8, 0x3a, 0x94, 0x14, 0xc9, 0x2c, 0x16, 0xac, 0x83, 0x86, 0x2e, 0xa6, 0xf6, 0x91, 0x92, 0x77, 0x81, 0xfe, 0x0, 0xf2, 0x29, 0xe6, 0xc9, 0x3e, 0xb9, 0x16, 0x35, 0xa4, 0x89, 0xdb, 0xc1, 0xf7, 0xf2, 0x9c, 0x6a, 0x66, 0xbf, 0x2, 0xc1, 0x38, 0x6c, 0xfc, 0xce, 0x21, 0x5b, 0x89, 0x6b, 0xcd, 0xc3, 0x42, 0xc0, 0xbc, 0x1e, 0x90, 0x46, 0x5a, 0x26, 0x82, 0x16, 0x4d, 0x35, 0x61, 0x2, 0x21, 0x9c, 0x4f, 0x8d, 0xa7, 0xf5, 0x18, 0x28, 0x22, 0xe8, 0x94, 0xca, 0xa2, 0xfa, 0xb3, 0xb0, 0xdf, 0x22, 0xc2, 0x9b, 0xf5, 0xdf, 0xd3, 0x25, 0xfa, 0x80, 0x20, 0x0, 0xa1, 0xb0, 0xc2, 0x39, 0xef, 0x9b, 0x88, 0x7f, 0x5c, 0x89, 0x36, 0xd0, 0x0, 0xd3, 0xa8, 0xb4, 0x92, 0x23, 0x48, 0xfa, 0xa3, 0x67, 0x4e, 0xea, 0xac, 0x35, 0x2e, 0xce, 0x13, 0x12, 0xbe, 0x1, 0xaa, 0x8d, 0x9f, 0xa5, 0x43, 0x16, 0x90, 0x9d, 0xa4, 0xf6, 0xa7, 0xaf, 0xc6, 0x1d, 0xd, 0xfb, 0x9d, 0xf2, 0x6b, 0xc1, 0x0, 0x30, 0x6, 0xbd, 0xa4, 0x25, 0xa5, 0xf0, 0x34, 0xff, 0xa4, 0x19, 0x14, 0x83, 0x64, 0x65, 0xbd, 0x4d, 0xa3, 0xf4, 0x5, 0xb6, 0x8d, 0x16, 0xf7, 0xcf, 0x5e, 0x2b, 0x7d, 0x15, 0x9d, 0x8e, 0x36, 0x17, 0x74, 0x82, 0x4d, 0xab, 0x61, 0x67, 0x20, 0x27, 0x3, 0xf, 0x59, 0xeb, 0x8b, 0x3e, 0xbd, 0x7d, 0xb9, 0xe3, 0xc6, 0x72, 0xf3, 0x7d, 0xdc, 0x6a, 0x6d, 0xd4, 0x4d, 0xb3, 0xfa, 0xaf, 0x1d, 0x2d, 0xb0, 0x92, 0x81, 0x61, 0xf3, 0x5e, 0xe6, 0xb3, 0xb8, 0xa, 0x58, 0x4f, 0xa5, 0x21, 0x4b, 0x8c, 0x98, 0x46, 0x44, 0x9b, 0xaf, 0x87, 0x9b, 0x21, 0xe5, 0x8b, 0x6b, 0xb0, 0x2a, 0x52, 0xc7, 0xb0, 0x23, 0x0, 0x72, 0x3c, 0xb2, 0x72, 0x6f, 0x5, 0xa, 0x80, 0x61, 0x9a, 0xb6, 0x5a, 0xea, 0xcf, 0x61, 0x3c, 0x4f, 0xfe, 0x3, 0xed, 0x85, 0xdb, 0x31, 0xad, 0x1f, 0xe4, 0xf5, 0x1a, 0xa4, 0xc, 0xe2, 0xc7, 0x57, 0x97, 0x81, 0x1e, 0x9, 0xa0, 0x83, 0xfc, 0x13, 0x9, 0x2c, 0x1, 0x5c, 0xc7, 0x4c, 0xee, 0x8, 0xcc, 0x2d, 0x3a, 0xea, 0xe1, 0xdf, 0x71, 0x88, 0xec, 0x3d, 0x5f, 0x1e, 0x8b, 0x2c, 0xbd, 0x17, 0x40, 0x65, 0x2f, 0x1b, 0x1a, 0x7, 0xac, 0x53, 0x76, 0x56, 0xf0, 0xbc, 0x0, 0xcc, 0x1f, 0xf0, 0xa5, 0xd7, 0xbf, 0xca, 0xf9, 0xbe, 0x0, 0x40, 0x2e, 0xaf, 0xfd, 0x9a, 0x90, 0xfa, 0xc6, 0x4e, 0xdf, 0x53, 0x44, 0xb8, 0x49, 0x15, 0xc9, 0x0, 0x0, 0x0, 0x68, 0x0, 0x0, 0x0, 0x13, 0x65, 0x63, 0x64, 0x73, 0x61, 0x2d, 0x73, 0x68, 0x61, 0x32, 0x2d, 0x6e, 0x69, 0x73, 0x74, 0x70, 0x32, 0x35, 0x36, 0x0, 0x0, 0x0, 0x8, 0x6e, 0x69, 0x73, 0x74, 0x70, 0x32, 0x35, 0x36, 0x0, 0x0, 0x0, 0x41, 0x4, 0xa9, 0xc8, 0x76, 0xe9, 0xa0, 0x5d, 0xb, 0x3b, 0x47, 0x1a, 0x19, 0xe5, 0x66, 0x4c, 0xed, 0xf0, 0x9b, 0xe7, 0xb7, 0xa2, 0xc0, 0x2a, 0xc3, 0x40, 0xa4, 0x5, 0x3b, 0x38, 0xa4, 0x17, 0xa4, 0x2b, 0x7f, 0x44, 0x9c, 0xfd, 0xfb, 0xf8, 0x7b, 0xd8, 0xaf, 0xde, 0x94, 0x59, 0x8d, 0xf5, 0xb9, 0x9a, 0xf3, 0x5b, 0x9a, 0x4f, 0x2b, 0x9d, 0xcb, 0x72, 0xdc, 0xd6, 0x44, 0xce, 0x9a, 0xf3, 0x7b, 0x2e, 0x0, 0x0, 0x0, 0x33, 0x0, 0x0, 0x0, 0xb, 0x73, 0x73, 0x68, 0x2d, 0x65, 0x64, 0x32, 0x35, 0x35, 0x31, 0x39, 0x0, 0x0, 0x0, 0x20, 0xa0, 0x8d, 0x14, 0x41, 0xcd, 0x24, 0x4e, 0x40, 0x8d, 0x36, 0xb9, 0xda, 0x34, 0xd1, 0x1d, 0xb3, 0x1e, 0x2f, 0xee, 0x97, 0xc8, 0x78, 0x6b, 0xbd, 0x56, 0xa5, 0x76, 0x53, 0xb, 0x12, 0xf7, 0x1a}} - 603 bytes
2025/03/25 18:58:20 decoding(11): 91 &ssh.channelOpenConfirmMsg{PeersID:0xb, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
(spread) ssh session ready
2025/03/25 18:58:20 send(11): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x13, 0x73, 0x75, 0x64, 0x6f, 0x20, 0x2d, 0x69, 0x20, 0x2f, 0x62, 0x69, 0x6e, 0x2f, 0x62, 0x61, 0x73, 0x68, 0x20, 0x2d}}
2025/03/25 18:58:20 decoding(11): 93 &ssh.windowAdjustMsg{PeersID:0xb, AdditionalBytes:0x200000} - 9 bytes
2025/03/25 18:58:20 decoding(11): 99 &ssh.channelRequestSuccessMsg{PeersID:0xb} - 5 bytes
2025/03/25 18:58:20 send(11): ssh.channelEOFMsg{PeersID:0x0}
2025/03/25 18:58:20 decoding(11): data packet - 46 bytes
2025/03/25 18:58:20 decoding(11): 96 &ssh.channelEOFMsg{PeersID:0xb} - 5 bytes
2025/03/25 18:58:20 decoding(11): 98 &ssh.channelRequestMsg{PeersID:0xb, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2025/03/25 18:58:20 decoding(11): 97 &ssh.channelCloseMsg{PeersID:0xb} - 5 bytes
2025/03/25 18:58:20 send(11): ssh.channelCloseMsg{PeersID:0x0}
2025/03/25 18:58:20 send(11): ssh.channelEOFMsg{PeersID:0x0}
2025/03/25 18:58:20 send(11): ssh.channelCloseMsg{PeersID:0x0}
```
I've tested this locally with a Debian 12 system running in Qemu. The project
reboots once in `prepare`. Without this patch the hang occurs every several
minutes on a Core Ultra 125H system. With this patch I was unable to reproduce
the hang for over an hour.
Signed-off-by: Zygmunt Krynicki <me@zygoon.pl>
Signed-off-by: Zygmunt Krynicki <me@zygoon.pl>
6fe0c97 to
3791148
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When spread reboots it keeps trying to read the boot ID from the system. This is done by opening a new "session" on the existing connection and running "cat" on a file in /proc.
There are several problems with current approach:
The code to run arbitrary commands needed to read bootID ... also
reads bootID (calling itself) in some cases, which feels wrong.
SSH client does not seem to be resilient against SSHD that doesn't
respond at the exactly inconvenient time. As the remote machine is
rebooting we are still trying to create a new session, which requires
communication.
As a workaround, create a new SSH connection and a new session on that connection in a separate go-routine. In the unfortunate case when the go-routine hangs, we just try again.
With the crypto/ssh package modified (
const debugMux = true) to log messages, we can see this sequence of events when a hang occurs. The lines prefixed with (spread) are from spread, the rest are from the crypto/ssh library:For comparison, a successful exchange looks like this:
I've tested this locally with a Debian 12 system running in Qemu. The project reboots once in
prepare. Without this patch the hang occurs every several minutes on a Core Ultra 125H system. With this patch I was unable to reproduce the hang for over an hour.