statsd: improve worst-case latency for heavily contended aggregator#349
Draft
nsrip-dd wants to merge 3 commits intoDataDog:masterfrom
Draft
statsd: improve worst-case latency for heavily contended aggregator#349nsrip-dd wants to merge 3 commits intoDataDog:masterfrom
nsrip-dd wants to merge 3 commits intoDataDog:masterfrom
Conversation
Fixed with the help of Mr. Claude. This isn't exhaustive; it's just doing enough for now to make sure that we can change the counts, gauges, and sets implementations without breaking the tests for non-functional reasons.
TODO actual description TODO testing TODO benchmarks
8b6e4bc to
5b8c4d8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WIP - not ready for review, needs good benchmarks, I've overlooked an obvious race in the first version, etc...
The
aggregatorin the statsd package uses async.RWMutexto guard the count, gauge, and set maps. When a metric is udpated, the locking works like so:It turns out that this pattern scales very poorly under heavy contention. We've seen >100ms latency in updating a metric at Datadog. We can see contention when lots of goroutines are creating new metrics, or when lots of goroutines race to create the same metric (the maps are regularly cleared.) See golang/go#76808 for more details.
This PR attempts to improve the worst-case latency by switching to
sync.Map. According to the docs:I don't think either case exactly describes this library, but they're both kinda close. And in the linked GitHub issue, we've seen promising results switching to a
sync.Mapimplementation similar to what this PR does.This PR still needs a convincing benchmark. We'll also want to consider the overhead of wasted allocations if goroutines race to add a new metric. With this new implementation, the racing goroutines will all allocate a new metric and then add it to the map, but only the winner's allocation is actually used.