Async signal/bail safety #940
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR draft documents the attempt to make road promotion safer when it comes to async signal handling and
u3m_bailcalls.jets.cnock.cProblem
In Vere we have nonlinear control flow in the form of exception raising with
u3m_bailand catching it in various wrappers likeu3m_soft_runand signal handling viau3m_soft_top/u3m_signal. So there are three ways we can "return" from a wrapped computation:u3m_bailcall. The exception will either be turned into a noun, allowing Nock virtualization via metacircular jets, or the exception will be raised again to bubble it up to the top, eventually crashing the event transaction and printing a stacktrace. We still promote and integrate the accumulated persistent state;I noticed the latter when experimenting with Ford Lightning build system: hitting ^C mid build would interrupt it, and if the build was retried it would take less time than if it was done the first time without interruption. This is because some persistent caches were already accumulated and promoted to the home road on SIGINT handling.
This rang an alarm bell in my head: for this to properly work we would need to make sure that all functions that modify road state (
u3h_*, functions that modify bytecode programs) are async-signal safe, otherwise, we might be promoting state that does not uphold its invariants.After a discussion at core blitz on ~2026.1.9, @joemfb raised a concern that the issue has a broader scope, as signals are not the only piece of non-linear control flow in the codebase. Functions that modify road state must also be exception-handling safe: at each
u3m_bailcallsite the road state needs to conform to its invariants too.It is worth noting that not all invariants need to be upheld at either bail or signal raise: for HAMTs it is sufficient to be walkable and have valid values. An example of HAMT having an invalid value: if a SIGINT comes between lines 1502-1506 here:
vere/pkg/noun/jets.c
Lines 1502 to 1506 in 33671ea
, the site struct will have a bytecode program pointer that does not correspond to the battery stored in
batfield.(This particular example does not allow to produce a bug because this fields are erased anyway when we call a program from a senior road)
Solution
I decided to minimize the surface area of code where the async-signal safety is required by only promoting the road state if:
With that, the only thing we are doing when async signal or nondeterministic error are raised is copying out nouns.
Nouns are (almost) async-signal safe
The only operation that is not async-signal safe is
u3i_edit, if it is performed on a mutable noun (refcount of 1). The only place where it is used is in the Nock bytecode interpreter, and in case of any crash the product from the road stack is not being promoted anyway, so the stacktraces and error reports are safe.HAMTs are fine
The only calls to
u3m_bailinhashtable.cwere inu3_assert(bail with c3__oops mote, which is not recoverable) andu3h/tmacros to disassemble key-value pairs. These are always cells by construction, so no change is necessary, but I replaced them with asserting versions for documentation purposes, just to be sure.Nock interpreter: to verify
Cursory look at
nock.c/jets.cdidn't raise any alarms, but I might be missing something. Injets.cu3h/tmacros are used everywhere to dissassemble nouns from jet state and to disassemble cores for Nock 9 calls. The former should be infallible, and the latter could legitimately crash, so more attention is necessary there.