statsd: improve worst-case latency for heavily contended aggregator by nsrip-dd · Pull Request #349 · DataDog/datadog-go

nsrip-dd · 2026-01-05T20:08:05Z

WIP - not ready for review, needs good benchmarks, I've overlooked an obvious race in the first version, etc...

The aggregator in the statsd package uses a sync.RWMutex to guard the count, gauge, and set maps. When a metric is udpated, the locking works like so:

Take a read lock and check if the metric is there.
If so, update the metric, release the read lock, and return
Otherwise, release the read lock, take a write lock, create the metric, and add it to the map. Then release the lock and return.

It turns out that this pattern scales very poorly under heavy contention. We've seen >100ms latency in updating a metric at Datadog. We can see contention when lots of goroutines are creating new metrics, or when lots of goroutines race to create the same metric (the maps are regularly cleared.) See golang/go#76808 for more details.

This PR attempts to improve the worst-case latency by switching to sync.Map. According to the docs:

The Map type is optimized for two common use cases: (1) when the entry for a given key is only ever written once but read many times, as in caches that only grow, or (2) when multiple goroutines read, write, and overwrite entries for disjoint sets of keys.

I don't think either case exactly describes this library, but they're both kinda close. And in the linked GitHub issue, we've seen promising results switching to a sync.Map implementation similar to what this PR does.

This PR still needs a convincing benchmark. We'll also want to consider the overhead of wasted allocations if goroutines race to add a new metric. With this new implementation, the racing goroutines will all allocate a new metric and then add it to the map, but only the winner's allocation is actually used.

Fixed with the help of Mr. Claude. This isn't exhaustive; it's just doing enough for now to make sure that we can change the counts, gauges, and sets implementations without breaking the tests for non-functional reasons.

TODO actual description TODO testing TODO benchmarks

nsrip-dd added 3 commits January 5, 2026 14:22

statsd: don't read internal fields in aggregator tests

5e69248

Fixed with the help of Mr. Claude. This isn't exhaustive; it's just doing enough for now to make sure that we can change the counts, gauges, and sets implementations without breaking the tests for non-functional reasons.

statsd: add concurrency benchmark

f918841

statsd: reduce contention in aggregator

5b8c4d8

TODO actual description TODO testing TODO benchmarks

nsrip-dd force-pushed the nick.ripley/reduce-worst-case-contention branch from 8b6e4bc to 5b8c4d8 Compare January 6, 2026 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

statsd: improve worst-case latency for heavily contended aggregator#349

statsd: improve worst-case latency for heavily contended aggregator#349
nsrip-dd wants to merge 3 commits intoDataDog:masterfrom
nsrip-dd:nick.ripley/reduce-worst-case-contention

nsrip-dd commented Jan 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nsrip-dd commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nsrip-dd commented Jan 5, 2026 •

edited

Loading