Skip to content

Added ExitCodesIT to test process exit codes under various conditions#5811

Merged
dlmarion merged 11 commits intoapache:mainfrom
dlmarion:exit-codes-it
Aug 27, 2025
Merged

Added ExitCodesIT to test process exit codes under various conditions#5811
dlmarion merged 11 commits intoapache:mainfrom
dlmarion:exit-codes-it

Conversation

@dlmarion
Copy link
Contributor

No description provided.

@dlmarion dlmarion added this to the 4.0.0 milestone Aug 20, 2025
@dlmarion dlmarion requested a review from ctubbsii August 20, 2025 21:24
@dlmarion dlmarion self-assigned this Aug 20, 2025
zooCache.get().close();
}
if (zooKeeperOpened.get()) {
zooSession.get().close();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needed to be moved to the end because it closes ZooKeeper, which deletes all ephemeral nodes and fires any watchers.

} // end while
} catch (Exception e) {
LOG.error("Unhandled error occurred in Compactor", e);
} finally {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here was to remove the finally block. It causes the normal close code to occur even in Exception or Error case.

assertEquals(1, exitValue);
}
} else {
// TODO:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be resolved, wasn't quite sure how to fix it.

@dlmarion
Copy link
Contributor Author

Best viewed using the "Hide Whitespace" option.

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat strategy for testing exit codes... I'm not sure how easy it's going to be to maintain, as it involves a bit of a learning curve to understand how it works. I'm also wondering if it might be fragile, when we make general improvements. Even so, it might be worth it.


if (acquiredLock) {
Halt.halt(-1,
Halt.halt(1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to using positive numbers.

updateIdleStatus(true);
final AtomicReference<Throwable> err = new AtomicReference<>();
final LogSorter logSorter = new LogSorter(this);
long nextSortLogsCheckTime = System.currentTimeMillis();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but I wonder if this needs to use the system clock or can be made to use relative time (nanoTime).

Comment on lines +355 to +360
// Must set shutdown as completed before calling super.close().
// super.close() calls ServerContext.close() ->
// ClientContext.close() -> ZooSession.close() which removes
// all of the ephemeral nodes and forces the watches to fire.
getShutdownComplete().set(true);
log.info("stop requested. exiting ... ");
try {
gcLock.unlock();
} catch (Exception e) {
log.warn("Failed to release GarbageCollector lock", e);
}

super.close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something that we always do together? If so, I wonder if it can be done inside the close, or if the getShutdownComplete() stuff is needed at all.

serverClass = TabletServer.class;
methodName = "updateIdleStatus";
ctorParams = new Class<?>[] {ConfigOpts.class, Function.class, String[].class};
ctorArgs = new Object[] {new ConfigOpts(), new ServerContextFunction(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you inline the ServerContextFunction like the server processes normally do, as in:

Suggested change
ctorArgs = new Object[] {new ConfigOpts(), new ServerContextFunction(),
ctorArgs = new Object[] {new ConfigOpts(), ServerContext::new,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this initially, and I get a compilation error that says "The target type of this expression must be a functional interface"

Comment on lines +103 to +106
// Determine the constructor arguments and parameters for each server class.
// Find a method with no-args that does not return anything that is
// called during the servers run method that we can intercept to signal
// shutdown, exception, or error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comment... but this is wild 😅

@dlmarion
Copy link
Contributor Author

I'm seeing an inconsistency in how we handle Exceptions and Errors in the server threads (AbstractServer.run).

In all other threads the AccumuloUncaughtExceptionHandler will:

  1. Log an error that the thread is dead on an uncaught Exception
  2. Print the stack trace and a message to stderr then halt the VM on an uncaught Error

For critical tasks put into a ThreadPool, we have a method that watches them and throws an ExecutionError if the task has failed. This Error is uncaught and would bubble up to the AccumuloUncaugthExceptionHandler, terminating the VM.

We made some changes recently to AbstractServer in #5796 and #5808 such that a new method AbstractServer.startServer exists because exceptions and errors being throws from AbstractServer.runServer were not being logged. This method looks like:

public static void startServer(AbstractServer server, Logger LOG) throws Exception {
try {
server.runServer();
} catch (Throwable e) {
System.err
.println(server.getClass().getSimpleName() + " died, exception thrown from runServer.");
e.printStackTrace();
LOG.error("{} died, exception thrown from runServer.", server.getClass().getSimpleName(), e);
throw e;
} finally {
try {
server.close();
} catch (Throwable e) {
System.err.println("Exception thrown while closing " + server.getClass().getSimpleName());
e.printStackTrace();
LOG.error("Exception thrown while closing {}", server.getClass().getSimpleName(), e);
throw e;
}
}
}

WIth this new method if AbstractServer.runServer completes successfully, then the server process does a normal shutdown finishing with a call to AbstractServer.close. Due to the finally block in AbstractServer.startServer the server's close method will be called again. If AbstractServer.runServer throws an Exception or Error, then the server's close method will be called by the finally block in AbstractServer.startServer, but the other normal shutdown stuff won't occur.

Looking at one of the logs from ExitCodesIT we see the following:

2025-08-21T12:59:37,729 12 [exit_codes.TabletServer_ERROR] INFO : TabletServer_ERROR process shut down.
2025-08-21T12:59:37,729 12 [tserver.TabletServer] ERROR: TabletServer_ERROR died, exception thrown from runServer.
java.lang.StackOverflowError: throwing unknown error
	at org.apache.accumulo.test.functional.exit_codes.TabletServer_ERROR.updateIdleStatus(Unknown Source) ~[?:?]
	at org.apache.accumulo.tserver.TabletServer.run(TabletServer.java:607) ~[classes/:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[classes/:?]
	at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
2025-08-21T12:59:37,733 12 [metrics.MetricsInfoImpl] INFO : Closing metrics registry
2025-08-21T12:59:37,750 12 [server.ServerContext] DEBUG: Shutting down shared executor pool
2025-08-21T12:59:37,750 12 [clientImpl.ClientContext] DEBUG: Closing ZooCache
2025-08-21T12:59:37,750 12 [clientImpl.ClientContext] DEBUG: Closing ZooSession
2025-08-21T12:59:37,752 21 [lock.ServiceLock] DEBUG: [zlock#5c6acefc-a1c0-4fdc-8c81-a7acd162c56e#] zlock#5c6acefc-a1c0-4fdc-8c81-a7acd162c56e#0000000000 was deleted; WatchedEvent state:SyncConnected type:NodeDeleted path:/tservers/TEST/ip-10-113-15-120.evoforge.org:33045/zlock#5c6acefc-a1c0-4fdc-8c81-a7acd162c56e#0000000000 zxid: 827
2025-08-21T12:59:37,754 21 [util.Halt] ERROR: FATAL TABLET_SERVER lost lock (reason = LOCK_DELETED), exiting.

In any other thread, the VM would be halted without closing. I'm thinking that the finally block from AbstractServer.startServer should be removed. Thoughts?

@keith-turner
Copy link
Contributor

In any other thread, the VM would be halted without closing. I'm thinking that the finally block from AbstractServer.startServer should be removed. Thoughts?

Are you thinking of something like the following? This would only call close when there is no exception in runServer.

  public static void startServer(AbstractServer server, Logger LOG) throws Exception {
    try {
      server.runServer();
      server.close();
    } catch (Exception e) {
      System.err
          .println(server.getClass().getSimpleName() + " died, exception thrown from runServer.");
      e.printStackTrace();
      LOG.error("{} died, exception thrown from runServer.", server.getClass().getSimpleName(), e);
      throw e;
    }
  }

Or is the call to close() not needed at all in this code because of the comment about AbstractServer.close()?

@dlmarion
Copy link
Contributor Author

Yeah, I don't think server.close is needed. If runServer exits normally without Exception/Error, then it's already been called.

e.printStackTrace();
LOG.error("{} died, exception thrown from runServer.", server.getClass().getSimpleName(), e);
throw e;
} finally {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit futher up could change the catch to catch(Exception e) instead of catch(Throwable e). The only reason it was catching Throwable was in case the finally block threw an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was for the logging of Error, not just Exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was for logging an error in the case where finally threw an exception in which case the error would be lost and would not propagate. W/o the finally the error will propagate now. So it depends on what the outer code does with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ec5cd1d

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to catch an exception and log here to achieve the goals of #5796? I was commenting about catching Throwable because it was only added in #5808 because of the finally. So not sure we need to catch Throwable, but may still want to catch exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to. I checked the logs and the Exception and Error are logged correctly. The issue in #5796 was an exception in the try-with-resources was being suppressed when an exception happened in the close. try-with-resources is no longer being used.


context.getLowMemoryDetector().logGCInfo(getConfiguration());

// Must set shutdown as completed before calling super.close().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment mentions calling super.close(), but do not see that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment in ec5cd1d

}
continue;
} finally {
currentCompactionId.set(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not seem like this should be cleared here because its used for dead compaction detection and it should be set for the entire time that a compactor is running a compaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed finally block in ec5cd1d

LOG.warn("Failed to close filesystem : {}", e.getMessage(), e);
}

super.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the the new pattern in this PR is that super.close() is always called at the end of each run() method. Wondering if would make sense to push this into AbstractServer like the following. This would centralize the pattern to one place in the code.

  public void runServer() throws Exception {
    final AtomicReference<Throwable> err = new AtomicReference<>();
    serverThread = new Thread(TraceUtil.wrap(()->{
      this.run();
      close();
    }), applicationName);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in ec5cd1d

compactorArgs.add("-o");
compactorArgs.add(Property.COMPACTOR_GROUP_NAME.getKey() + "=TEST");
serverClass = Compactor.class;
methodName = "updateIdleStatus";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens w/ the test if this method does not exists? Does the test fail?
Seems like maybe it would fail because the SHUTDOWN case would expect an exit code of zero but would see some other exit code. But not sure.

Wondering if instead of this run time overriding if it would be possible to do it at compile time instead w/ something like the following. Then if someone changes a method name they will get a compile time error.

class ExitCompactor extends  Compactor {

  private final TerminalBehavior terminalBehavior;

  protected ExitCompactor(ConfigOpts opts, String[] args, TerminalBehavior terminalBehavior) {
    super(opts, args);
    this.terminalBehavior = terminalBehavior;
  }

  @Override
  public void updateIdleStatus(boolean idle){
    switch (terminalBehavior){
      case SHUTDOWN:
        super.requestShutdownForTests();
        break;
      case ERROR:
        throw new StackOverflowError();
      case EXCEPTION:
        throw new RuntimeException();
    }
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens w/ the test if this method does not exists? Does the test fail?
Seems like maybe it would fail because the SHUTDOWN case would expect an exit code of zero but would see some other exit code. But not sure.

I can make it fail explicity by adding the following to ProcessProxy.main:

diff --git a/test/src/main/java/org/apache/accumulo/test/functional/ExitCodesIT.java b/test/src/main/java/org/apache/accumulo/test/functional/ExitCodesIT.java
index 4f8f948d26..c17a30ab6e 100644
--- a/test/src/main/java/org/apache/accumulo/test/functional/ExitCodesIT.java
+++ b/test/src/main/java/org/apache/accumulo/test/functional/ExitCodesIT.java
@@ -155,6 +155,15 @@ public class ExitCodesIT extends SharedMiniClusterBase {
           throw new UnsupportedOperationException(st + " is not currently supported");
       }
 
+      // Check that methodName exists on serverClass
+      try {
+        @SuppressWarnings("unused")
+        var ignored = serverClass.getDeclaredMethod(methodName);
+      } catch (NoSuchMethodException nsme) {
+        nsme.printStackTrace();
+        System.exit(42);
+      }
+

Wondering if instead of this run time overriding if it would be possible to do it at compile time instead w/ something like the following. Then if someone changes a method name they will get a compile time error.

I think I could make something that is functionally equivalent without using ByteBuddy, but it would be more code. I started with ByteBuddy initially because I wasn't sure of the design when I started, but I knew that I basically needed a proxy for each server class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I could make something that is functionally equivalent without using ByteBuddy, but it would be more code.

If its not much more code then there are benefits. Can fully navigate and manipulate the code in an IDE and easily see what the test references. Also if a method is renamed will get a quick compile time failure instead of this test failing and then trying to figure out why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed ByteBuddy and subclassed server processes in ec5cd1d

// We need to let this time out and then
// terminate the process.
IllegalStateException ise = assertThrows(IllegalStateException.class,
() -> Wait.waitFor(() -> !pi.getProcess().isAlive(), 120_000));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this always wait for 2 mins? If so, could we lower time to wait to 30s or 60s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change to 1m. The other two tests for the GC take 30-35s

@dlmarion dlmarion merged commit 501efad into apache:main Aug 27, 2025
8 checks passed
@dlmarion dlmarion deleted the exit-codes-it branch August 27, 2025 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants