Skip to content

Shard aggregator lock#351

Merged
StephenWakely merged 3 commits intomasterfrom
carlosroman/AGTMETRICS-340-opsengine-shard-aggregator
Feb 2, 2026
Merged

Shard aggregator lock#351
StephenWakely merged 3 commits intomasterfrom
carlosroman/AGTMETRICS-340-opsengine-shard-aggregator

Conversation

@carlosroman
Copy link
Contributor

@carlosroman carlosroman commented Jan 19, 2026

Updated version of #343 (fixes git conflicts)


This PR introduces optional sharding for client-side aggregation locks to significantly improve throughput in high-concurrency scenarios.

Problem:
The aggregator previously used a single sync.RWMutex for each metric type (counts, gauges, sets). Under high concurrency (many goroutines reporting metrics), this single lock became a major contention point, limiting throughput even when reporting unique metrics.

Solution:
We have introduced a sharding mechanism for these metric maps.

  • The number of shards is configurable via a new option: WithAggregatorShardCount(int).
  • The default shard count is 1, preserving the existing behavior and performance characteristics.
  • When shardCount > 1, metrics are distributed across shards based on a hash of their context (name + tags), reducing lock contention by a factor roughly equal to the shard count.

Performance:
Micro-benchmarks demonstrate significant improvements in high-contention scenarios (M4 Max, 14 threads):

High Contention (Concurrent updates to unique metrics):

  • 1 Shard: ~481 ns/op
  • 32 Shards: ~78 ns/op
    -> ~6x throughput improvement

Overhead (Single thread, no contention):

  • Legacy Implementation: ~31.46 ns/op
  • New Implementation (1 Shard): ~31.47 ns/op
    -> Zero overhead introduced for the default configuration.

The optimization ensures that the hashing cost is completely bypassed when sharding is disabled (default), guaranteeing no regression for existing users.

@carlosroman carlosroman force-pushed the carlosroman/AGTMETRICS-340-opsengine-shard-aggregator branch from b9b43a0 to de50ead Compare January 19, 2026 18:04
@carlosroman
Copy link
Contributor Author

Still need to investigate thread-safety of the sample() methods.

@carlosroman carlosroman force-pushed the carlosroman/AGTMETRICS-340-opsengine-shard-aggregator branch from c97a88f to 6fedd7a Compare January 20, 2026 15:46
@carlosroman
Copy link
Contributor Author

Calling sample() method is fine. Just tidied up the typos and comments on the PR.

@carlosroman carlosroman marked this pull request as ready for review January 20, 2026 15:48
@carlosroman carlosroman requested a review from a team as a code owner January 20, 2026 15:48
Copy link
Contributor

@StephenWakely StephenWakely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a couple of very minor nits.

i := 0
for pb.Next() {
i++
name := fmt.Sprintf("metric.%d", i%100000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark is mostly going to be benchmarking the string allocation here. We should instead preallocate a big list of these strings before b.ResetTimer().

defaultOriginDetection = true
defaultChannelModeErrorsWhenFull = false
defaultErrorHandler = func(error) {}
defaultTagCardinality = CardinalityNotSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultTagCardinality isn't used anywhere, so should be removed.

It was removed in a previous PR because the value is now 0, but arguably it should be used and set in resolveOptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking that up. I forgot doing something with that value.

Copy link
Contributor

@StephenWakely StephenWakely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make this mistake every time too...

Copy link
Contributor

@StephenWakely StephenWakely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this..

channelModeErrorsWhenFull: defaultChannelModeErrorsWhenFull,
errorHandler: defaultErrorHandler,
aggregatorShardCount: defaultAggregatorShardCount,
tagCardinality: &defaultTagCardinality,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, the defaultTagCardinality should be nil, we use nil to mean it hasn't been set here and should be loaded from the environment.

@carlosroman carlosroman force-pushed the carlosroman/AGTMETRICS-340-opsengine-shard-aggregator branch from cff244f to b1f2f9f Compare January 30, 2026 17:10
Copy link
Contributor

@StephenWakely StephenWakely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@StephenWakely StephenWakely merged commit b808812 into master Feb 2, 2026
54 checks passed
@StephenWakely StephenWakely mentioned this pull request Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants