Introduce Compactor configuration for dealing with consecutive failures by dlmarion · Pull Request #5726 · apache/accumulo

dlmarion · 2025-07-08T18:58:58Z

This change propagates exceptions from the classloading related code. Prior to this change some of the exceptions raised by the classloading code would be caught early and a RuntimeException would be raised instead.

This change also modifies the Compactor to conditionally delay execution of the next compaction job when the Compactors has been failing to complete consecutive prior compactions. Four new properties control the behavior of new delay logic.

This change also modifies the API between the Compactor and Coordinator when compactions have failed. The exception class is now relayed to the coordinator, which is tracking and periodically logging a summary of the failures.

AccumuloVFSClassLoader.getContextClassLoader throws an UncheckedIOException when there is an issue getting the ClassLoader for a context name. This runtime exception escapes all the way up to the calling code in Compactor and TabletServer, which fails the compaction and moves on to the next compaction. If there is an issue getting the ClassLoader for the context once, then it's likely to happen again. It's probably not safe to terminate the TabletServer in this case, but is likely safe for the Compactor. This change captures the RuntimeException in the FileCompactor.compactLocalityGroup where the iterator stack is created and raises a new checked exception which is handled in the calling code.

keith-turner · 2025-07-08T19:16:06Z

server/base/src/main/java/org/apache/accumulo/server/compaction/FileCompactor.java

-          .convertItersAndLoad(env.getIteratorScope(), cfsi, acuTableConf, iterators, iterEnv));
+      SortedKeyValueIterator<Key,Value> stack = null;
+      try {
+        stack = IteratorConfigUtil.convertItersAndLoad(env.getIteratorScope(), cfsi, acuTableConf,


If a table is configured w/ an incorrect iterator class name what would happen in that case with these changes?

Looking at the code path, ClassLoaderUtil.loadClass would get the ClassLoader, but calling ClassLoader.loadClass would throw a ClassNotFoundException, which is not a RuntimeException. I think it would fail the compaction, pick up the next one, and keep failing until the configuration is fixed. Do you think we should add ReflectiveOperationException to this catch clause so that Compactors die if there is any issue loading iterator classes?

Ignore what I said above. I wrote an IT and it turns out that an iterator class in the configuration that does not exist will be caught here. IteratorConfigUtil.loadIterators catches RelfectiveOperationException and raises a RuntimeException. I wonder if we just want to catch Exception here instead of RuntimeException.

Do you think we should add ReflectiveOperationException to this catch clause so that Compactors die if there is any issue loading iterator classes?

Personally I would not want all compactors to die if bad config w/ an incorrect class name was placed on a tables iterator settings.

Specifically in the context of compactors, I think we would want bad configuration to drive the process dieing. This would only be expected at the start of a compaction with a new configuration. Since the sole responsibility of the compactor is to compact, it terminating is a very clear message it cannot do that process. My understanding is that it would not impact existing compactions in progress. In the absence of something like this we'd need additional metrics that capture failed compactions so we could monitor that state in addition to busy/idle. As-is we have to monitor for the impacts of failing compactions cascading across the cluster, then go back to individual compactor logs to determine who is healthy and who isn't after the fact. This is much more complex than just checking which processes are down and pulling their recent log history.

Wondering if this fix is too narrow. Maybe we want to do something more general like the following.

HAve a configurable consecutive compaction failure count that cause process death

Do exponential backoff between failed compactions.

This would more gracefully deal with consistently failing compactions that happen for any reason. Like if a compactor fails to compact 10 times in a row after backing off between each attempt, just exit the process.

I don't think we should kill compactors when there is a bad compaction configuration. That compaction should certainly be aborted, though. I think that in general, the service should remain available for future compactions, if at all possible.

The code currently in this PR (as of commit 7c7ecef no longer kills the Compactors.

The code in this PR as of commit 95ebecf conditionally kills the compactor.

ctubbsii · 2025-07-10T21:58:20Z

server/base/src/main/java/org/apache/accumulo/server/compaction/FileCompactor.java

-          .convertItersAndLoad(env.getIteratorScope(), cfsi, acuTableConf, iterators, iterEnv));
+      SortedKeyValueIterator<Key,Value> stack = null;
+      try {
+        stack = IteratorConfigUtil.convertItersAndLoad(env.getIteratorScope(), cfsi, acuTableConf,


I don't think we should kill compactors when there is a bad compaction configuration. That compaction should certainly be aborted, though. I think that in general, the service should remain available for future compactions, if at all possible.

server/base/src/main/java/org/apache/accumulo/server/compaction/FileCompactor.java

dlmarion · 2025-07-11T21:41:23Z

7c7ecef pushes the class of the exception that occurred on the Compactor back to the Coordinator. This is used for logging and for incrementing failure counters for the queue, compactor, and table. Subsequent compaction successes will decrement the counters for the queue, compactor, and table. Using these counters we can return an empty job back to the Compactor when the current error rate is over some threshold. We could also emit metrics from the Coordinator based on these failure counts.

The logic and accounting in 7c7ecef is not 100% correct. I pushed it up as-is to get feedback on if and how we should move forward with this idea.

...paction-coordinator/src/main/java/org/apache/accumulo/coordinator/CompactionCoordinator.java

dlmarion · 2025-07-14T15:38:51Z

6563aee modifies the Compactor such that it can be configured to wait progressively longer on consecutive compaction failures. The failures are reported to the Coordinator, which logs a failure summary every 5 minutes.

keith-turner · 2025-07-14T22:29:21Z

Created some test scripts to explore compaction failures so I can see what the current code does. I can also try running them against these changes.

apache/accumulo-testing#295

dlmarion · 2025-07-15T11:41:15Z

I have updated the description to match what the code in this PR currently does. I'm going to work on modifying the IT to try and test some of this new code.

@keith-turner - Except for changes from testing, and comments from PR review, I don't think I have any major changes planned for this PR if you want to start testing it with your new scripts. I did change the implementation since your last review, so it might be good to review this first before testing.

keith-turner

Took a look through the changes, going to try running some test w/ these changes.

server/compactor/src/main/java/org/apache/accumulo/compactor/Compactor.java

keith-turner · 2025-07-15T15:49:37Z

...paction-coordinator/src/main/java/org/apache/accumulo/coordinator/CompactionCoordinator.java

+    Set<String> allCompactorAddrs = new HashSet<>();
+    allCompactors.values().forEach(l -> l.forEach(c -> allCompactorAddrs.add(c.toString())));
+    failingCompactors.keySet().retainAll(allCompactorAddrs);
+


Another way this tracking could work is that it counts success and failure. Each time it logs it takes a snapshot of the counts, logs them, and then deducts the snapshot it logged. This will give an indication of the successes and failures since the last time this functions ran. Also could only log when there is failure and maybe log a line per thing (interpreting large maps in the log can be difficult). Maybe something like the following.

// this is not really correct, its assuming we also get a snapshot of the value in the map.. but that is not really true Map<String, SuccessFailureCounts> queueSnapshot = Map.copyOf(failingQueues); queueSnapshot.foreach((queue, counts)-> { if(counts.failures > 0){ LOG.warn("Queue {} had {} successes and {} failures in the last {}ms", queue, counts.successes, counts.failures, logInterval); // TODO decrement counts logged from failingQueues, by decrementing only the // counts logged we do not lose any concurrent increments made while logging } });

Created this PR dlmarion#56 .. but I have not tested the changes

keith-turner · 2025-07-15T15:57:39Z

core/src/main/java/org/apache/accumulo/core/conf/Property.java

+          + " again, then it will wait 40s before starting the next compaction.",
+      "2.1.4"),
+  @Experimental
+  COMPACTOR_FAILURE_BACKOFF_RESET("compactor.failure.backoff.reset", "10m",


Not recommending any changes here, was just pondering something. Another way this could work is that it could set a max backoff time instead of a reset time. Once we get to that max time we stop incrementing, but do not reset until a success is seen. Not coming up w/ any advantages for this other approach though. Wondering if there is any particular reason this reset after time approach was chosen?

I selected to reset to 0 instead of staying at the max time just as a way to try and recover quicker in the event that the issue was fixed. This was also the reason I didn't do exponential backoff. I'm assuming that the user will fix the issue.

keith-turner · 2025-07-15T16:37:00Z

core/src/main/java/org/apache/accumulo/core/spi/common/ContextClassLoaderFactory.java

   * @return the class loader for the given contextName
   */
-  ClassLoader getClassLoader(String contextName);
+  ClassLoader getClassLoader(String contextName) throws IOException, ReflectiveOperationException;


Is there a benefit to these two specific exceptions? If we want information to travel through the code via a checked exception, then it may be better to create a very specific exception related to this SPI. This allows knowing that class loader creation failed w/o trying to guess at specific reasons/exceptions that it could fail, the specific reason should be in the cause. In general we may want to know this type of failure happened, but we probably do not care too much why it happened. Whenever it happens for any reasons its not good.

// maybe this should extend Exception /** * @since 2.1.4 * / public static class ClassLoaderCreationFailed extends AccumuloException { public ClassLoaderCreationFailed(Throwable cause) {} public ClassLoaderCreationFailed(String msg, Throwable cause) {} } ClassLoader getClassLoader(String contextName) throws ClassLoaderCreationFailed;

We could also leave this SPI as is and create a new internal exception that is always thrown when class loading creation fails. This allows this very specific and important information to travel in the internal code. Could do the more minimal change below in 2.1 and add the checked exception to the SPI in 4.0. Not opposed to adding a checked exception in 2.1.4 to the SPI though, would need to document the breaking change in the release notes.

public class ClassLoaderUtil { // create this class outside of public API... any code in the class that attempts to create a classloader and fails should throw this exception // this could be a checked or runtime exception... not sure which is best public static class ClassLoaderCreationFailed extends RuntimeException { } public static ClassLoader getClassLoader(String context) { try{ return FACTORY.getClassLoader(context); } catch (Exception e) { throw new ClassLoaderCreationFailed("Failed to create context "+context, e); } } }

I don't have an opinion on this either way, except that whatever is thrown should be a checked exception so that it must be handled. Using a RuntimeException is part of the reason for this PR.

If there is no benefit to using IOException, ReflectiveOperationException then IMO creating a new checked exception specific to the situation would be more informative for someone looking at logs or for someone reading code that throws it.

keith-turner · 2025-07-15T16:48:09Z

core/src/main/thrift/compaction-coordinator.thrift

    2:security.TCredentials credentials
    3:string externalCompactionId
    4:data.TKeyExtent extent
+    5:string exceptionClassName


this may be a breaking change to thrift RPC? or maybe it will be null/ignored when there is difference so maybe its ok. Also coordinator is experimental.

If using a 2.1.4 Compactor and 2.1.3 Coordinator, then I don't think this is an issue. Not sure about the reverse, maybe it's always null?

keith-turner · 2025-07-15T16:51:04Z

server/compactor/src/main/java/org/apache/accumulo/compactor/Compactor.java

+              errorHistory.values().stream().mapToLong(p -> p.getSecond().get()).sum();
+          if (totalFailures > 0) {
+            LOG.warn("This Compactor has had {} consecutive failures. Failures: {}", totalFailures,
+                errorHistory);


Wondering how this will look, its always relogging the entire error history. Also it seems like it will only log the first exception seen for a table, is that the intent? I will know more about how this looks when I run some test.

Yeah, I guess that I'm assuming the Throwable in Map<TableId,Pair<Throwable,AtomicLong>> would not really change in consecutive failures. I guess we could change it to List<Pair<>>, or remove the Throwable entirely and just make sure we log it. The exception class does get logged in the Coordinator when the compaction fails.

This is what the logging looks like

2025-07-15T17:46:35,339 [compactor.Compactor] WARN : This Compactor has had 1 consecutive failures. Failures: {1=(java.lang.ClassNotFoundException: org.apache.accumulo.testing.continuous.ValidatingIterator,1)}

The issue w/ the property type caused the compactor do die, so have not seen a count greater than one yet.

The summaries may be nice if you know to look for them. Could do something like Map<TableId, Map<ClassName, FailureCount>>

I'm thinking that I should just remove Throwable from the errorHistory map. The specific exception that caused the compaction to fail is logged at error at Compactor.java line 594.

Ok, I missed your last comment before I posted mine. I can work on improving the error history map.

Updated error history map in 3d71032

core/src/main/java/org/apache/accumulo/core/conf/Property.java

keith-turner · 2025-07-15T19:52:40Z

Seeing some nice results experimenting with these changes. Ran a test w/ 100 tablets w/ continual bulk import into 20 random tablets. There were 16 compactors, 8 of which would always fail. After 15 mins saw the following counts using the default settings.

$ grep "Compaction completed" coordinator.log | wc
    552    7176  145757
$ grep "Compaction failed" coordinator.log | wc
   1270   15240  292997

Restarted the same test setting compactor.failure.backoff.interval=3s and saw the following counts after 15 mins.

$ grep "Compaction completed" coordinator.log | wc
    322    4186   84946
$ grep "Compaction failed" coordinator.log | wc
     28     336    6464

My suspicion is there were more successful compactions in the first case because the avg files per tablet was higher because of the failed compactions delaying compactions. Going to do some more digging and see if that is the case.

Log4J was not invoking ErrorHistory.toString when logging, so explicitly called toString when logging. Also, Throwable doesn't implement hashCode, so the HashMap wasn't working as expected. Changed the map key from Throwable to String to fix

keith-turner · 2025-07-15T21:07:36Z

Ran the two test again tracking average files per tablet. With compactor.failure.backoff.interval=3s, seeing much better numbers for average file per tablet and and max files per tablet. Looking at the logs the coordinator scans the tservers every 60s by default. If a bad compactor takes a tablets job after that scan, then it will not be found until the next scan by the coordinator. Some tablets would keep getting unlucky and picked up by bad compactors after each scan so they would not compact for 3 or 4 minutes even as new bulk import files kept rolling in.

I know it would be a change in behavior, but wondering if the default settings should make compactors backoff. Could be slight like compactor.failure.backoff.interval=100ms.

server/compactor/src/main/java/org/apache/accumulo/compactor/Compactor.java

keith-turner · 2025-07-15T22:16:22Z

Tried running another test that did the following. For both tables below create 100 tablets.

Started 16 compactors, 8 of which will always fail any compaction they get
Started ingest into table ci1 which has no problems and can compact on the good compactors. This ingest went forever always bulk importing into 20 tablets.
Started ingest into table ci2 which has a misconfigured iterator that prevents any tablet on the table from compacting on any compactor. This ingest stopped after a bit after bulk importing into 20 random tablets in a loop for a bit.
Set compactor.failure.backoff.interval=3s

In this case the tablets in table ci2 built up a lot of files because they could never compact. This caused those tablets to have higher priority. Their higher priority would cause the good compactors to get those bad tablets first and then they would observe lots of consecutive failures. Eventually the tablet for ci1 would build up enough files that they would have a higher priority and compact.

The test has been running for a bit, seeing table ci1 with these files per tablet stats min:14 avg:28.47 max:35 and table ci2 has min:7 avg:15.22 max:35 which is never changing (ingest was stopped on this table).

keith-turner · 2025-07-16T17:08:28Z

server/compactor/src/main/java/org/apache/accumulo/compactor/Compactor.java

+        .description("Number compactions that have succeeded on this compactor").register(registry);
+    FunctionCounter.builder(METRICS_COMPACTOR_COMPACTIONS_FAILED, this, Compactor::getFailures)
+        .description("Number compactions that have failed on this compactor").register(registry);
+    FunctionCounter.builder(METRICS_COMPACTOR_FAILURES_TERMINATION, this, Compactor::getTerminated)


Seems like this metric will usually not be seen as 1, if the compactor exits before the metric system polls the 1.

The changes in my last commits having to deal with AbstractServer.close were actually to close the ServerContext so that the MeterRegistry gets closed. Closing the MeterRegistry's may end up doing a final poll on the metrics before closing down. It looks like the StatsD implementation does that anyway. But I agree, it's best effort and may not be guaranteed.

I was wondering what the close changes were for

That metric does show up in the IT FWIW

…ulo into compactor-die-on-cl-error

dlmarion · 2025-07-17T19:18:34Z

Full IT build completed successfully

dlmarion added this to the 2.1.4 milestone Jul 8, 2025

dlmarion requested a review from keith-turner July 8, 2025 18:58

dlmarion self-assigned this Jul 8, 2025

dlmarion changed the base branch from main to 2.1 July 8, 2025 18:59

keith-turner reviewed Jul 8, 2025

View reviewed changes

Added IT

825df0d

ctubbsii requested changes Jul 10, 2025

View reviewed changes

dlmarion added 2 commits July 11, 2025 18:04

Changes to propogate IOE and ROE up to classloader callers

8695400

Return exception class name, capture failures and successes

7c7ecef

dlmarion commented Jul 11, 2025

View reviewed changes

...paction-coordinator/src/main/java/org/apache/accumulo/coordinator/CompactionCoordinator.java Outdated Show resolved Hide resolved

Moved failure account to Compactor, log summary in Coordinator

6563aee

dlmarion changed the title ~~Cause Compactor to exit when error loading classes~~ Introduce configurable wait into Compactor for dealing with consecutive failures Jul 14, 2025

dlmarion added 2 commits July 14, 2025 16:42

Minor update

65a7a35

Add property to terminate Compactor based on consecutive failures

95ebecf

dlmarion changed the title ~~Introduce configurable wait into Compactor for dealing with consecutive failures~~ Introduce Compactor configuration for dealing with consecutive failures Jul 14, 2025

keith-turner reviewed Jul 15, 2025

View reviewed changes

core/src/main/java/org/apache/accumulo/core/conf/Property.java Outdated Show resolved Hide resolved

dlmarion added 3 commits July 15, 2025 18:33

Merge branch '2.1' into compactor-die-on-cl-error

787a4a7

Fix property type, move error history logic to method

2865ad2

Modified error history to capture all exceptions, print nice summary

3d71032

dlmarion added 2 commits July 15, 2025 20:33

Fixed ErrorHistory

d45b2d6

Log4J was not invoking ErrorHistory.toString when logging, so explicitly called toString when logging. Also, Throwable doesn't implement hashCode, so the HashMap wasn't working as expected. Changed the map key from Throwable to String to fix

Updated IT

9368643

keith-turner reviewed Jul 15, 2025

View reviewed changes

server/compactor/src/main/java/org/apache/accumulo/compactor/Compactor.java Show resolved Hide resolved

keith-turner reviewed Jul 15, 2025

View reviewed changes

server/compactor/src/main/java/org/apache/accumulo/compactor/Compactor.java Outdated Show resolved Hide resolved

dlmarion added 2 commits July 16, 2025 11:54

Updated logging, addressed PR suggestions

f302e83

Fix backoff condition, was off by one

04faeab

keith-turner approved these changes Jul 16, 2025

View reviewed changes

dlmarion added 2 commits July 16, 2025 16:22

Added new metrics to Compactor to track success, cancellation, failure

052ffff

Updated MetricsProducer javadoc for new metrics

daf32fc

keith-turner reviewed Jul 16, 2025

View reviewed changes

keith-turner and others added 5 commits July 16, 2025 15:18

log success and failure counts (#56)

d6c8aac

Created new exception class for ContextClassLoaderFactory

22fe257

Merge branch 'compactor-die-on-cl-error' of github.com:dlmarion/accum…

b6af86c

…ulo into compactor-die-on-cl-error

narrow exceptions

d216bde

Fixes to get failed ITs working

e05434d

dlmarion merged commit b2c5fc4 into apache:2.1 Jul 18, 2025
8 checks passed

dlmarion deleted the compactor-die-on-cl-error branch July 18, 2025 13:12

dlmarion mentioned this pull request Aug 13, 2025

Fixed issue with exception in server run method not being logged #5796

Merged

Conversation

dlmarion commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dlmarion commented Jul 11, 2025

Uh oh!

Uh oh!

dlmarion commented Jul 14, 2025

Uh oh!

keith-turner commented Jul 14, 2025

Uh oh!

dlmarion commented Jul 15, 2025

Uh oh!

keith-turner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keith-turner Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

keith-turner commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keith-turner commented Jul 15, 2025

dlmarion commented Jul 8, 2025 •

edited

Loading

keith-turner Jul 15, 2025 •

edited

Loading

keith-turner commented Jul 15, 2025 •

edited

Loading

keith-turner commented Jul 15, 2025 •

edited

Loading