Fixed issue with exception in server run method not being logged by dlmarion · Pull Request #5796 · apache/accumulo

dlmarion · 2025-08-13T13:00:19Z

The use of try-with-resources in the server process' main method causes the server's close method to always be called. When an exception occurs in the runServer method it may not be logged because the invocation of the close method consumes enough time that the lock watcher is fired which ends up halting the JVM.

Closes #5794

The use of try-with-resources in the server process' main method causes the server's close method to always be called. When an exception occurs in the runServer method it may not be logged because the invocation of the close method consumes enough time that the lock watcher is fired which ends up halting the JVM. Closes apache#5794

dlmarion · 2025-08-13T13:16:05Z

I found this issue because I started 4.0 software on a 2.1.4 instance without running the upgrade --start command. I expected to see the following in the error log:

Thread 'manager' died.
java.lang.IllegalStateException: Upgrade not started, /upgrade_progress node does not exist. Did you run 'accumulo upgrade --start'?
        at org.apache.accumulo.server.util.upgrade.UpgradeProgressTracker.continueUpgrade(UpgradeProgressTracker.java:81)
        at org.apache.accumulo.manager.upgrade.UpgradeCoordinator.continueUpgrade(UpgradeCoordinator.java:159)
        at org.apache.accumulo.manager.Manager.setManagerState(Manager.java:365)
        at org.apache.accumulo.manager.Manager.run(Manager.java:953)
        at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
        at java.base/java.lang.Thread.run(Thread.java:840)

The above was never logged, instead the only thing that I saw in the log was the fact that the server's lock was gone and that the JVM was being halted.

AbstractServer.close was an empty method until #5726, which was included as part of the 2.1.4 release.

ddanielr · 2025-08-13T16:10:38Z

If an exception is thrown in the .runServer() methods, then .close() is never called. Does that cause any issues?

The above was never logged, instead the only thing that I saw in the log was the fact that the server's lock was gone and that the JVM was being halted.

Is this due to the ZK lock being lost by closing the client context? I'm curious that one logger is able to print a message showing the lock is gone, but the exception message wasn't flushed.

I thought try-with-resources will append any exception in the close() method with the one from the original resource.
I wonder if it "holding onto" the original exception message is why it's not ending up in the error log.

In general, if we don't care about any cleanup actions being called in the .close() then this change seems fine, but I'm not 100% sure it solves the problem and without code comments, I could see this change getting accidentally refactored back into try-with resource blocks.

Could the exception be logged in AbstractServer.runServer() before throwing the exception up to the try-with-resources block in the various Main methods?

    Throwable thrown = err.get();
    if (thrown != null) {
      log.error(thrown.getMessage());
      if (thrown instanceof Error) {
        throw (Error) thrown;
      }
      if (thrown instanceof Exception) {
        throw (Exception) thrown;
      }
      throw new RuntimeException("Weird throwable type thrown", thrown);
    }

dlmarion · 2025-08-13T16:32:07Z

If an exception is thrown in the .runServer() methods, then .close() is never called. Does that cause any issues?

I don't think so. ServerContext.close does call close on the shared conditional writers for the user and metadata tables, but I think if an exception is thrown from server's run method, then we want to exit as quickly as possible.

The above was never logged, instead the only thing that I saw in the log was the fact that the server's lock was gone and that the JVM was being halted.

Is this due to the ZK lock being lost by closing the client context? I'm curious that one logger is able to print a message showing the lock is gone, but the exception message wasn't flushed.

It's due to close being called and the JVM halting due to the lock watcher firing before the code in Main.die executes.

I thought try-with-resources will append any exception in the close() method with the one from the original resource. I wonder if it "holding onto" the original exception message is why it's not ending up in the error log.

In general, if we don't care about any cleanup actions being called in the .close() then this change seems fine, but I'm not 100% sure it solves the problem and without code comments, I could see this change getting accidentally refactored back into try-with resource blocks.

Could the exception be logged in AbstractServer.runServer() before throwing the exception up to the try-with-resources block in the various Main methods?
    Throwable thrown = err.get();
    if (thrown != null) {
      log.error(thrown.getMessage());
      if (thrown instanceof Error) {
        throw (Error) thrown;
      }
      if (thrown instanceof Exception) {
        throw (Exception) thrown;
      }
      throw new RuntimeException("Weird throwable type thrown", thrown);
    }

Yes, we could do that. I did do that and I saw the logging show up. The place where the logging is done currently is in Main.die, called from Main.execKeyword.

ctubbsii

It seems to me that this fix is working only because we are bypassing the cleanup in the close method, so it never triggers the Halt in the background thread, and instead exits abnormally through the main method with an exception and non-zero exit code (the usual way Java processes abnormally terminate).

But the trade off is that we entirely skip any attempt to clean up any resources in use by ZooKeeper, ServerContext, AbstractServer, etc., in their close methods.

If we are resigned to do that, then why bother implementing any close methods at all? And not even bother to make them AutoCloseable. If the only way to exit abnormally is from the background thread that ends up calling Halt, we could just do that every time, without bothering to write or maintain any cleanup close methods.

The quick fix is to log the message first and then try to clean up in the close method... which will trigger the halt.

A more complicated fix would be to track that the main method is about to exit through some atomic state variable, and then to skip any attempt to halt in the background thread... but I'm not sure if that's a good idea.

This also makes me question whether we can even get graceful shutdowns at all with a zero exit code. If the close method is what is triggering the eventual Halt, then it's possible that the JVM will halt during cleanup, even when runServer exits without an exception (such as due to a graceful shutdown), and then the process exits with a non-zero exit code even though it was intended to be a graceful shutdown.

dlmarion · 2025-08-15T12:46:47Z

The quick fix is to log the message first and then try to clean up in the close method... which will trigger the halt.

Yeah, that's fair. I can modify this PR to do just that.

ctubbsii

It seems like there it would be easy to write a small final method on the abstract server base class to call that would then call runServer and do this logging, so there's less copy pasted boilerplate code.

I'm not sure how best to handle the non zero exit code. Normally, it's bad form to call System.exit, and we'd normally want to just let the exception propagate, but in this case, that's not happening always because of the background thread doing the halting. Still, if the background thread doesn't halt first, we don't want to exit with a zero exit code here. Re throwing the logged exception would be fine, but may result in redundant messages to STDOUT. That's probably fine though.

I'm still not sure how the try with resources affects a graceful shutdown... it seems likely to also cause a halt, but this PR doesn't have to address that right now.

ctubbsii · 2025-08-19T03:47:00Z

server/compactor/src/main/java/org/apache/accumulo/compactor/Compactor.java

+        System.err.println("Compactor died, exception thrown from runServer.");
+        e.printStackTrace();
+        LOG.error("Compactor died, exception thrown from runServer.", e);
+      }


This is fine but causes it to exit with a normal exit code, 0. Normally an exception being thrown through the main class would cause a non-zero exit code.

ctubbsii

Good: The consolidation is good, and it rethrows in case of an exception so the exit code is always correct.

Neutral: This doesn't use try-with-resources, but it could instead of the finally block. I think the only difference is the log message that says that the exception happened in the close, but I don't think that difference in log messages is all that critical.

Bad: In the case of a clean exit, for a graceful shutdown, I think the background thread could still halt the process and produce an exception and non-zero exit code, but as I've mentioned, it's okay if this PR doesn't try to address that specific case.

dlmarion · 2025-08-19T14:50:06Z

Bad: In the case of a clean exit, for a graceful shutdown, I think the background thread could still halt the process and produce an exception and non-zero exit code, but as I've mentioned, it's okay if this PR doesn't try to address that specific case.

The lock watcher may halt the process, but would do so with a 0 exit code. All of the server processes run methods end in this same general pattern:

accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java

Lines 712 to 720 in 2570c50

    
           context.getLowMemoryDetector().logGCInfo(getConfiguration()); 
        
           super.close(); 
        
           getShutdownComplete().set(true); 
        
           log.info("TServerInfo: stop requested. exiting ... "); 
        
           try { 
        
             tabletServerLock.unlock(); 
        
           } catch (Exception e) { 
        
             log.warn("Failed to release tablet server lock", e); 
        
           }

The last thing that happens is unlocking the lock, which may end up causing the ServiceLockWatcher to fire before the JVM ends. However, it uses a 0 exit code in this case because the shutdownComplete object is set to true.

accumulo/core/src/main/java/org/apache/accumulo/core/lock/ServiceLockSupport.java

Lines 165 to 170 in 2570c50

    
           if (shutdownComplete.get()) { 
        
             Halt.halt(0, 
        
                 server + " lost lock (reason = " + reason 
        
                     + "), exiting cleanly because shutdown is complete.", 
        
                 () -> lostLockAction.accept(server)); 
        
           } else {

Follow up to apache#5796 that attempts to handle the following case. 1. server.runServer() throws an error 2. server.close() throws en exception When the above happens the error will never be logged and its also not attached to the exception thrown in the finally block by server.close(). Therefore the error is lost and there is no indication it happeneded. With these changes the exception thrown by server.close() will be attached to the error thrown by server.runServer().

Follow up to #5796 that attempts to handle the following case. 1. server.runServer() throws an error 2. server.close() throws en exception or error When the above happens the error will never be logged and its also not attached to the exception thrown in the finally block by server.close(). Therefore the error is lost and there is no indication it happened. Changed the code to catch throwable to avoid loss of information in this case. Co-authored-by: Dave Marion <dlmarion@apache.org>

dlmarion added this to the 4.0.0 milestone Aug 13, 2025

dlmarion self-assigned this Aug 13, 2025

ctubbsii reviewed Aug 13, 2025

View reviewed changes

dlmarion added 2 commits August 15, 2025 12:50

Merge branch 'main' into server-failure-not-logged

78b68af

Reverted change and instead just added logging to runServer failures

0101783

dlmarion requested a review from ctubbsii August 18, 2025 17:32

ctubbsii requested changes Aug 19, 2025

View reviewed changes

dlmarion added 2 commits August 19, 2025 11:26

Merge branch 'main' into server-failure-not-logged

ab037f3

Consolidated logic to a single method, modified to throw errors

0e49979

dlmarion requested a review from ctubbsii August 19, 2025 11:53

ctubbsii approved these changes Aug 19, 2025

View reviewed changes

dlmarion merged commit 7960039 into apache:main Aug 19, 2025
8 checks passed

dlmarion deleted the server-failure-not-logged branch August 19, 2025 14:56

keith-turner mentioned this pull request Aug 19, 2025

avoid loss of error thrown from server run() #5808

Merged

dlmarion mentioned this pull request Aug 21, 2025

Added ExitCodesIT to test process exit codes under various conditions #5811

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed issue with exception in server run method not being logged#5796

Fixed issue with exception in server run method not being logged#5796
dlmarion merged 5 commits intoapache:mainfrom
dlmarion:server-failure-not-logged

dlmarion commented Aug 13, 2025

Uh oh!

dlmarion commented Aug 13, 2025

Uh oh!

ddanielr commented Aug 13, 2025 •

edited

Loading

Uh oh!

dlmarion commented Aug 13, 2025

Uh oh!

ctubbsii left a comment

Uh oh!

dlmarion commented Aug 15, 2025

Uh oh!

ctubbsii left a comment

Uh oh!

ctubbsii Aug 19, 2025

Uh oh!

ctubbsii left a comment

Uh oh!

dlmarion commented Aug 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dlmarion commented Aug 13, 2025

Uh oh!

dlmarion commented Aug 13, 2025

Uh oh!

ddanielr commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dlmarion commented Aug 13, 2025

Uh oh!

ctubbsii left a comment

Choose a reason for hiding this comment

Uh oh!

dlmarion commented Aug 15, 2025

Uh oh!

ctubbsii left a comment

Choose a reason for hiding this comment

Uh oh!

ctubbsii Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

ctubbsii left a comment

Choose a reason for hiding this comment

Uh oh!

dlmarion commented Aug 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ddanielr commented Aug 13, 2025 •

edited

Loading