Skip to content

Conversation

@myk002
Copy link
Member

@myk002 myk002 commented Nov 29, 2024

@myk002 myk002 merged commit 33acc1f into develop Nov 29, 2024
25 of 26 checks passed
@myk002 myk002 deleted the revert-5040-fix_console_autocomplete branch November 29, 2024 21:05
lethosor added a commit to lethosor/dfhack that referenced this pull request Nov 30, 2024
Previously, calling a Lua C API function that throws an error (e.g. with any of
the `lua_error()` family of functions, which use `luaD_throw()` internally) in a
function that also uses our own `StackUnwinder` would deadlock, because:

- `luaD_throw()` calls `lua_lock()` but apparently expects something else to
  call `lua_unlock()`
- `~StackUnwinder()` calls `lua_settop()` to rewind the stack, which internally
  also calls `lua_lock()` and `lua_unlock()`. This is called before the mutex
  can be unlocked.

This was the root cause of the issue behind DFHack#5040, DFHack#5055, DFHack#5056 - `GetVector()`
was called with an invalid index (only when invoked from `gui/launcher`, because
this changed the data on the Lua stack) and threw an exception, which caused the
`~StackUnwinder()` destructor to run while the Lua state was still locked.

Other things to note:
- We control the locking/unlocking implementation - the default, defined in
  `llimits.h`, is a _no-op_.
- @ab9rf points out that the `CRITICAL_SECTION` API we're using on Windows
  already appears to be equivalent to a recursive mutex, so this issue likely
  would not have occurred there.

The issue can be reproduced relatively easily with a simple test command, e.g.:

```diff
@@ -716,6 +723,15 @@ void ls_helper(color_ostream &con, const std::vector<std::string> &params) {
 command_result Core::runCommand(color_ostream &con, const std::string &first_, std::vector<std::string> &parts)
 {
     std::string first = first_;
+    if (first == "luaerror") {
+        auto L = Lua::Core::State;
+        Lua::StackUnwinder top(L);
+
+        luaL_error(L, "test error");
+
+        return CR_OK;
+    }
+
     CommandDepthCounter counter;
     if (!counter.ok())
     {
```

In `gui/launcher`:
- before this change, invoking `luaerror` would hang DF.
- after this change, `luaerror` prints `nil` in red to the native console and
  does nothing. This comes from the last-resort error handler in
  `LuaTools.cpp:report_error()`, and is equivalent to how other unexpected
  errors are handled in code invoked from Lua. Not ideal, but better than a
  crash.

From the native console, invoking `luaerror` crashes DF in both cases (SIGABRT),
and logs `PANIC: unprotected error in call to Lua API (test error)` to
`stderr.log`. The difference in behavior is because `report_error()` above is
part of the error handling system present when code is called _from Lua_,
through variants of `SafeCall`. There is no Lua layer involved in the native
console. (Note: the same crash would have been observed in the original issue
in DFHack#5056 et. al. if the error had occurred when invoked through the console.)
myk002 pushed a commit that referenced this pull request Dec 1, 2024
Previously, calling a Lua C API function that throws an error (e.g. with any of
the `lua_error()` family of functions, which use `luaD_throw()` internally) in a
function that also uses our own `StackUnwinder` would deadlock, because:

- `luaD_throw()` calls `lua_lock()` but apparently expects something else to
  call `lua_unlock()`
- `~StackUnwinder()` calls `lua_settop()` to rewind the stack, which internally
  also calls `lua_lock()` and `lua_unlock()`. This is called before the mutex
  can be unlocked.

This was the root cause of the issue behind #5040, #5055, #5056 - `GetVector()`
was called with an invalid index (only when invoked from `gui/launcher`, because
this changed the data on the Lua stack) and threw an exception, which caused the
`~StackUnwinder()` destructor to run while the Lua state was still locked.

Other things to note:
- We control the locking/unlocking implementation - the default, defined in
  `llimits.h`, is a _no-op_.
- @ab9rf points out that the `CRITICAL_SECTION` API we're using on Windows
  already appears to be equivalent to a recursive mutex, so this issue likely
  would not have occurred there.

The issue can be reproduced relatively easily with a simple test command, e.g.:

```diff
@@ -716,6 +723,15 @@ void ls_helper(color_ostream &con, const std::vector<std::string> &params) {
 command_result Core::runCommand(color_ostream &con, const std::string &first_, std::vector<std::string> &parts)
 {
     std::string first = first_;
+    if (first == "luaerror") {
+        auto L = Lua::Core::State;
+        Lua::StackUnwinder top(L);
+
+        luaL_error(L, "test error");
+
+        return CR_OK;
+    }
+
     CommandDepthCounter counter;
     if (!counter.ok())
     {
```

In `gui/launcher`:
- before this change, invoking `luaerror` would hang DF.
- after this change, `luaerror` prints `nil` in red to the native console and
  does nothing. This comes from the last-resort error handler in
  `LuaTools.cpp:report_error()`, and is equivalent to how other unexpected
  errors are handled in code invoked from Lua. Not ideal, but better than a
  crash.

From the native console, invoking `luaerror` crashes DF in both cases (SIGABRT),
and logs `PANIC: unprotected error in call to Lua API (test error)` to
`stderr.log`. The difference in behavior is because `report_error()` above is
part of the error handling system present when code is called _from Lua_,
through variants of `SafeCall`. There is no Lua layer involved in the native
console. (Note: the same crash would have been observed in the original issue
in #5056 et. al. if the error had occurred when invoked through the console.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants