Skip to content

Conversation

@bitemyapp
Copy link
Contributor

@bitemyapp bitemyapp commented Dec 29, 2017

Okay so while this does use the bulk API I didn't see an appreciable throughput difference because of the load-shedding behavior of the default scribe. Hypothetically, it should let you do more logging with less stress on your ES shards.

However, the real purpose for this was that I wanted a logger that wouldn't load-shed. As far as I could tell from my testing, no documents get lost by the bulk scribe as long as it is permitted to finish a final sync before closing the scribe.

Attrition rates were as follows from my testing:

No added delay to simulate work: 12 microsecond average iteration time for the regular scribe, but the loss rate is pretty intense without a delay. 99.5% loss.

1 millisecond delay: 86%

18% at 5 milliseconds.

0% at 10 milliseconds.

All for the regular scribe. At four log messages per iteration, this implies hucking more than one log message (document) every 2.5 milliseconds will start resulting in load-shedding. I used Elasticsearch 5.5.0 running locally for the test.

The bulk scribe didn't lose anything as long as I gave it a second or two before closing the scribe.

My next concern was throughput. The regular and bulk scribe were broadly comparable in throughput, with the bulk scribe occasionally being a mite slower.

Then came tail latency / outlier iteration time.

There were time-to-complete outliers when the delay time was 1, 5, or 10 milliseconds per four log messages. At 100 milliseconds the outlier was barely greater than the average. I took the max of all runtimes to get the outlier, so this is worst case scenario.

For what it's worth, I think it is sensible to have both of these scribes but I am not sure which one to recommend. I feel like the trade-offs would ideally be more explicit than "default" and "bulk."

Profiling the bulk scribe didn't evince anything interesting. This was a real pleasure to write in Haskell and being able to mimic the structure of the original scribe saved me a lot of grief too.

Please let me know what you think.

@bitemyapp
Copy link
Contributor Author

Repeating here for posterity:

The findings aren't from the benchmarks I wrote. The benchmarks aren't super helpful in their current state and I should think about changing them to match the private stress test I made to measure loss, tail latency, throughput.

Copy link
Collaborator

@MichaelXavier MichaelXavier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me aside from a few comments here and there. I will probably need a cleanup pass on comments. I think we should also workshop what we should recommend to the users. I agree that it sucks to load shed in a logger and if using the bulk API gets us out of that, is there a compelling reason to not make that the one supported path, even if it isn't a significant performance boon or a slight penalty?

-- internal mechanisms that will change without notice.
module Katip.Core where

-------------------------------------------------------------------------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an automatic process here that's removing all of these comment sections? This is a coding style thing we do at Soostone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just me. I usually avoid doing petty stuff like this but they rankled me. Sorry about that.


go :: Int -> [V5.BulkOperation] -> Int -> IO ()
go delayTime xs len = do
result <- race (UNB.readChan outChan) (threadDelay delayTime)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit worried about the safety of this race. IIRC race will run 2 async threads and will cancel the losing thread after the winner returns. Is it possible for the delay to hit simultaneously with an item being removed from the queue effectfully, the thread to be canceled and that item to be dropped on the floor?

It seems like maybe readChanOnException could be used to make this async exception safe (and the ThreadKilled exception would be delivered asynchronously). I guess the danger there is if the chan is full and the readChan was blocking, in the cleanup function we provide to readChanOnException you'd have to do a tryWriteChan to make a best effort to give it back without blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to Brandon Simmons, he said to spin tryRead in a loop for this purpose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose that makes sense. I'm kind of curious how hard that would be on the CPU. We're not in STM so we don't have a lot of options that wake up your loop only when things change. Does he suggest a check delay or anything to mitigate that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exact wording, hopefully he won't mind: "So if you've got a single reader then the clearest way to do what you want would probably be to use tryReadChan and make sure that if a blocking read times out that you retry (that's what tryReadChan gives you: the ability to block, or busy-wait and retry as much as you like), before proceeding with another readChan on the queue."

Sounds like busy-spin as we discussed. It's not clear to me how this avoids a faulty read if the raced timeout lands in the middle of the readChan.

Some arithmetic could be used to wait rather than attempting the genuine read if the time differential is within #{X} nanos of the timeout deadline but I don't like hitting up the clock in a hot loop like this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional clarification. Sounds good. I say let's move forward with this simple fallback-to-spin-loop approach and see where it gets us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/bitemyapp/break-unagi-chan Here's some results from my attempt at shaking out how unagi-chan would behave in the pessimal case.

I'm not 100% certain how to interpret the results (love to know what you think), but I got basically the same results from the regular and "exception safe" variants.

As it stands, I'd say that I would personally be comfortable going forward with this variant as a harm reducing relative to the current scribe's behavior (the load-shedding can get extreme)

All my alternatives would involve load-shedding (STM of bounded collection), livelock (TQueue, TBQueue), or unbounded memory use (STM of unbounded collection). I'd also have to be careful to not make the same mistake, whatever it was, that induces livelock in TQueue/TBQueue/TBMQueue.

I'll focus on exposing configuration bits for the moment, I'd like to know what you think meanwhile.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the research! When you say going forward with this variant, you mean with the tryReadChan as suggested by the unagi-chan author?

Copy link
Contributor Author

@bitemyapp bitemyapp Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/bitemyapp/break-unagi-chan/blob/master/src/Lib.hs#L59-L60

you mean with the tryReadChan as suggested by the unagi-chan author?

I'm not 100% certain of what they meant, but definitely not a busy-spin. Instead I'm using timeout of the blocking IO action returned by tryReadChan or just readChan. System.Timeout.timeout seems to behave better when compared with race'ing a threadDelay and I would really like to know why.

I'm also not completely certain of the validity or meaning of my results in the README of the repository. I'd like a second pair of eyes on how I'm shaking it out if you or anyone else has the time.

I'd say that worst case scenario, we're still losing data but it's far-far less than the currently existing load shedding behavior and it's overall a more efficient scribe for ES resources.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incoming text wall.

Took a closer look at the code and I'm starting to be convinced this is pretty hard to test. In your code you fill a chan with all the data that will ever be pushed to it. You then try your algorithms and give them decreasing "patience" for how long they'll wait to read data that we're certain is in the chan, down to like 1 microsecond (if I understand correctly) which I think may even be below the practicality of the scheduler. Eventually I would expect this to break down but I think this may just be testing the constant factors of each mechanism and not the safety.

The hypothetical data loss scenario we're talking about is when data is available to the chan at almost exactly the same time as the worker will give up, with the idea that a read from the chan isn't atomic and could get interrupted by a timeout and drop the message on the floor. In order to compare this across implementations somewhat deterministically I think you'd start with an empty queue and then coordinate many pushes to it that are all right around the time where it'd give up (and then have your implementation retry in any case). Like running a particle accelerator and looking for collisions. I think that means your implementation under test would need to keep track of the last number it saw so it could simply abort on first anomaly. That way, it could differentiate between successfully timing out but getting the data the next time and actually losing data.

So it seems like from the code there's 3 variations we want to consider:

  1. race blocking read with threadDelay
  2. race blocking read w/ attempted re-queueing with thread delay (so, theoretically safer, maybe bad perf, maybe not effective at preventing lost writes)
  3. blocking version of tryReadChan.

I think we can probably eliminate number 3. It seems like in the unagi code, that's exactly equivalent to 1:

tryReadChan :: OutChan a -> IO (UT.Element a, IO a)
tryReadChan oc = do -- no mask necessary
    (seg,segIx) <- startReadChan oc
    return ( ... ellided, promise verison unused... , readSegIxUnmasked id (seg,segIx))

readChan :: OutChan a -> IO a
readChan = \oc-> startReadChan oc >>= readSegIxUnmasked id

So now we're down to: is readChanOnException with timeout safer than readChan with timeout? If so, is there a performance impact to using readChanOnException?

For checking safety, I'm not sure the best way but maybe we should just batter a queue with data and abort on first anomaly. So maybe:

  1. Spawn a writer with random short publish delays roughly around the timeout in the reader.
  2. Spawn a reader that retries, and remembers the last number it saw and blows up on first out of sequence.
  3. Run the program for a while and see if we can ever produce the bug. We would do this with plain readChan first.

I feel like if we can get up to some high number of attempts without creating this bug, then we can probably just stop worrying about it, race readChan and call it a day. If we do create the bug, see if we can recreate it using the "safe" readChan. If not, see if perf disqualifies the safe variant.

I'll take a crack locally at your repo to see if I can rig this test setup.

(V5.DocId docId) val

discrimLen l
| l >= 500 = SendIt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should make this configurable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The figure was the product of throughput testing. I don't know why someone would want to change it, but yeah probably should be configurable.

startBulkWorker _ env mapping dieSignal outChan = do
-- We need to randomize upload delay
-- so that workers do not stampede the nodes
delayTime' <- randomRIO (100, 2000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost certainly yes, I'll lift it out.

@bitemyapp bitemyapp changed the title Bulk elasticsearch scribe WIP: Bulk elasticsearch scribe Apr 8, 2018
@bitemyapp
Copy link
Contributor Author

Still working on this. Pretty close to having something we could choose to release which would be strictly better than the old scribe.

However,

The benchmark is still having tremendous data loss (albeit 10x more retained than the old scribe) and I'd really like to sort that out before pushing toward a release.

@Qasaye007
Copy link

Learn me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants