Skip to content

Fixed issue with exception in server run method not being logged#5796

Merged
dlmarion merged 5 commits intoapache:mainfrom
dlmarion:server-failure-not-logged
Aug 19, 2025
Merged

Fixed issue with exception in server run method not being logged#5796
dlmarion merged 5 commits intoapache:mainfrom
dlmarion:server-failure-not-logged

Conversation

@dlmarion
Copy link
Contributor

The use of try-with-resources in the server process' main method causes the server's close method to always be called. When an exception occurs in the runServer method it may not be logged because the invocation of the close method consumes enough time that the lock watcher is fired which ends up halting the JVM.

Closes #5794

The use of try-with-resources in the server process' main method
causes the server's close method to always be called. When an
exception occurs in the runServer method it may not be logged
because the invocation of the close method consumes enough time
that the lock watcher is fired which ends up halting the JVM.

Closes apache#5794
@dlmarion dlmarion added this to the 4.0.0 milestone Aug 13, 2025
@dlmarion dlmarion self-assigned this Aug 13, 2025
@dlmarion
Copy link
Contributor Author

I found this issue because I started 4.0 software on a 2.1.4 instance without running the upgrade --start command. I expected to see the following in the error log:

Thread 'manager' died.
java.lang.IllegalStateException: Upgrade not started, /upgrade_progress node does not exist. Did you run 'accumulo upgrade --start'?
        at org.apache.accumulo.server.util.upgrade.UpgradeProgressTracker.continueUpgrade(UpgradeProgressTracker.java:81)
        at org.apache.accumulo.manager.upgrade.UpgradeCoordinator.continueUpgrade(UpgradeCoordinator.java:159)
        at org.apache.accumulo.manager.Manager.setManagerState(Manager.java:365)
        at org.apache.accumulo.manager.Manager.run(Manager.java:953)
        at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
        at java.base/java.lang.Thread.run(Thread.java:840)

The above was never logged, instead the only thing that I saw in the log was the fact that the server's lock was gone and that the JVM was being halted.

AbstractServer.close was an empty method until #5726, which was included as part of the 2.1.4 release.

@ddanielr
Copy link
Contributor

ddanielr commented Aug 13, 2025

If an exception is thrown in the .runServer() methods, then .close() is never called. Does that cause any issues?

The above was never logged, instead the only thing that I saw in the log was the fact that the server's lock was gone and that the JVM was being halted.

Is this due to the ZK lock being lost by closing the client context? I'm curious that one logger is able to print a message showing the lock is gone, but the exception message wasn't flushed.

I thought try-with-resources will append any exception in the close() method with the one from the original resource.
I wonder if it "holding onto" the original exception message is why it's not ending up in the error log.

In general, if we don't care about any cleanup actions being called in the .close() then this change seems fine, but I'm not 100% sure it solves the problem and without code comments, I could see this change getting accidentally refactored back into try-with resource blocks.

Could the exception be logged in AbstractServer.runServer() before throwing the exception up to the try-with-resources block in the various Main methods?

    Throwable thrown = err.get();
    if (thrown != null) {
      log.error(thrown.getMessage());
      if (thrown instanceof Error) {
        throw (Error) thrown;
      }
      if (thrown instanceof Exception) {
        throw (Exception) thrown;
      }
      throw new RuntimeException("Weird throwable type thrown", thrown);
    }

@dlmarion
Copy link
Contributor Author

If an exception is thrown in the .runServer() methods, then .close() is never called. Does that cause any issues?

I don't think so. ServerContext.close does call close on the shared conditional writers for the user and metadata tables, but I think if an exception is thrown from server's run method, then we want to exit as quickly as possible.

The above was never logged, instead the only thing that I saw in the log was the fact that the server's lock was gone and that the JVM was being halted.

Is this due to the ZK lock being lost by closing the client context? I'm curious that one logger is able to print a message showing the lock is gone, but the exception message wasn't flushed.

It's due to close being called and the JVM halting due to the lock watcher firing before the code in Main.die executes.

I thought try-with-resources will append any exception in the close() method with the one from the original resource. I wonder if it "holding onto" the original exception message is why it's not ending up in the error log.

In general, if we don't care about any cleanup actions being called in the .close() then this change seems fine, but I'm not 100% sure it solves the problem and without code comments, I could see this change getting accidentally refactored back into try-with resource blocks.

Could the exception be logged in AbstractServer.runServer() before throwing the exception up to the try-with-resources block in the various Main methods?

    Throwable thrown = err.get();
    if (thrown != null) {
      log.error(thrown.getMessage());
      if (thrown instanceof Error) {
        throw (Error) thrown;
      }
      if (thrown instanceof Exception) {
        throw (Exception) thrown;
      }
      throw new RuntimeException("Weird throwable type thrown", thrown);
    }

Yes, we could do that. I did do that and I saw the logging show up. The place where the logging is done currently is in Main.die, called from Main.execKeyword.

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this fix is working only because we are bypassing the cleanup in the close method, so it never triggers the Halt in the background thread, and instead exits abnormally through the main method with an exception and non-zero exit code (the usual way Java processes abnormally terminate).

But the trade off is that we entirely skip any attempt to clean up any resources in use by ZooKeeper, ServerContext, AbstractServer, etc., in their close methods.

If we are resigned to do that, then why bother implementing any close methods at all? And not even bother to make them AutoCloseable. If the only way to exit abnormally is from the background thread that ends up calling Halt, we could just do that every time, without bothering to write or maintain any cleanup close methods.

The quick fix is to log the message first and then try to clean up in the close method... which will trigger the halt.

A more complicated fix would be to track that the main method is about to exit through some atomic state variable, and then to skip any attempt to halt in the background thread... but I'm not sure if that's a good idea.

This also makes me question whether we can even get graceful shutdowns at all with a zero exit code. If the close method is what is triggering the eventual Halt, then it's possible that the JVM will halt during cleanup, even when runServer exits without an exception (such as due to a graceful shutdown), and then the process exits with a non-zero exit code even though it was intended to be a graceful shutdown.

@dlmarion
Copy link
Contributor Author

The quick fix is to log the message first and then try to clean up in the close method... which will trigger the halt.

Yeah, that's fair. I can modify this PR to do just that.

@dlmarion dlmarion requested a review from ctubbsii August 18, 2025 17:32
Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there it would be easy to write a small final method on the abstract server base class to call that would then call runServer and do this logging, so there's less copy pasted boilerplate code.

I'm not sure how best to handle the non zero exit code. Normally, it's bad form to call System.exit, and we'd normally want to just let the exception propagate, but in this case, that's not happening always because of the background thread doing the halting. Still, if the background thread doesn't halt first, we don't want to exit with a zero exit code here. Re throwing the logged exception would be fine, but may result in redundant messages to STDOUT. That's probably fine though.

I'm still not sure how the try with resources affects a graceful shutdown... it seems likely to also cause a halt, but this PR doesn't have to address that right now.

Comment on lines +1070 to +1073
System.err.println("Compactor died, exception thrown from runServer.");
e.printStackTrace();
LOG.error("Compactor died, exception thrown from runServer.", e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine but causes it to exit with a normal exit code, 0. Normally an exception being thrown through the main class would cause a non-zero exit code.

@dlmarion dlmarion requested a review from ctubbsii August 19, 2025 11:53
Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: The consolidation is good, and it rethrows in case of an exception so the exit code is always correct.

Neutral: This doesn't use try-with-resources, but it could instead of the finally block. I think the only difference is the log message that says that the exception happened in the close, but I don't think that difference in log messages is all that critical.

Bad: In the case of a clean exit, for a graceful shutdown, I think the background thread could still halt the process and produce an exception and non-zero exit code, but as I've mentioned, it's okay if this PR doesn't try to address that specific case.

@dlmarion
Copy link
Contributor Author

Bad: In the case of a clean exit, for a graceful shutdown, I think the background thread could still halt the process and produce an exception and non-zero exit code, but as I've mentioned, it's okay if this PR doesn't try to address that specific case.

The lock watcher may halt the process, but would do so with a 0 exit code. All of the server processes run methods end in this same general pattern:

context.getLowMemoryDetector().logGCInfo(getConfiguration());
super.close();
getShutdownComplete().set(true);
log.info("TServerInfo: stop requested. exiting ... ");
try {
tabletServerLock.unlock();
} catch (Exception e) {
log.warn("Failed to release tablet server lock", e);
}

The last thing that happens is unlocking the lock, which may end up causing the ServiceLockWatcher to fire before the JVM ends. However, it uses a 0 exit code in this case because the shutdownComplete object is set to true.

if (shutdownComplete.get()) {
Halt.halt(0,
server + " lost lock (reason = " + reason
+ "), exiting cleanly because shutdown is complete.",
() -> lostLockAction.accept(server));
} else {

@dlmarion dlmarion merged commit 7960039 into apache:main Aug 19, 2025
8 checks passed
@dlmarion dlmarion deleted the server-failure-not-logged branch August 19, 2025 14:56
keith-turner added a commit to keith-turner/accumulo that referenced this pull request Aug 19, 2025
Follow up to apache#5796 that attempts to handle the following case.

 1. server.runServer() throws an error
 2. server.close() throws en exception

When the above happens the error will never be logged and its also not
attached to the exception thrown in the finally block by server.close().
Therefore the error is lost and there is no indication it happeneded.
With these changes the exception thrown by server.close() will be
attached to the error thrown by server.runServer().
keith-turner added a commit that referenced this pull request Aug 19, 2025
Follow up to #5796 that attempts to handle the following case.

 1. server.runServer() throws an error
 2. server.close() throws en exception or error

When the above happens the error will never be logged and its also not
attached to the exception thrown in the finally block by server.close().
Therefore the error is lost and there is no indication it happened. Changed 
the code to catch throwable to avoid loss of information in this case.


Co-authored-by: Dave Marion <dlmarion@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4.0 Upgrade error being suppressed somewhere

3 participants