Skip to content

statsd: improve worst-case latency for heavily contended aggregator#349

Draft
nsrip-dd wants to merge 3 commits intoDataDog:masterfrom
nsrip-dd:nick.ripley/reduce-worst-case-contention
Draft

statsd: improve worst-case latency for heavily contended aggregator#349
nsrip-dd wants to merge 3 commits intoDataDog:masterfrom
nsrip-dd:nick.ripley/reduce-worst-case-contention

Conversation

@nsrip-dd
Copy link

@nsrip-dd nsrip-dd commented Jan 5, 2026

WIP - not ready for review, needs good benchmarks, I've overlooked an obvious race in the first version, etc...

The aggregator in the statsd package uses a sync.RWMutex to guard the count, gauge, and set maps. When a metric is udpated, the locking works like so:

  • Take a read lock and check if the metric is there.
  • If so, update the metric, release the read lock, and return
  • Otherwise, release the read lock, take a write lock, create the metric, and add it to the map. Then release the lock and return.

It turns out that this pattern scales very poorly under heavy contention. We've seen >100ms latency in updating a metric at Datadog. We can see contention when lots of goroutines are creating new metrics, or when lots of goroutines race to create the same metric (the maps are regularly cleared.) See golang/go#76808 for more details.

This PR attempts to improve the worst-case latency by switching to sync.Map. According to the docs:

The Map type is optimized for two common use cases: (1) when the entry for a given key is only ever written once but read many times, as in caches that only grow, or (2) when multiple goroutines read, write, and overwrite entries for disjoint sets of keys.

I don't think either case exactly describes this library, but they're both kinda close. And in the linked GitHub issue, we've seen promising results switching to a sync.Map implementation similar to what this PR does.

This PR still needs a convincing benchmark. We'll also want to consider the overhead of wasted allocations if goroutines race to add a new metric. With this new implementation, the racing goroutines will all allocate a new metric and then add it to the map, but only the winner's allocation is actually used.

Fixed with the help of Mr. Claude. This isn't exhaustive; it's just
doing enough for now to make sure that we can change the counts, gauges,
and sets implementations without breaking the tests for non-functional
reasons.
TODO actual description
TODO testing
TODO benchmarks
@nsrip-dd nsrip-dd force-pushed the nick.ripley/reduce-worst-case-contention branch from 8b6e4bc to 5b8c4d8 Compare January 6, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant