-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Motivation:
Understanding millions of related events and their relationship in a dataflow is impossible. Events span serverless functions, stream processors and microservices. Building that story and extracting instrumentation knowledge is virtually impossible. It should be possible to build a queryable dataflow model that supports instrumentation. The result would be the ability to scan millions of events (historcally and realtime), break them down by different attributes, identify anomolies, drill into the set of interest, and then down to a singular dataflow with relevant contextual information.
For example:
- millions of IoT events -> those devices generating errors or performing slowly
- millions of payments -> system latency degradation over time, identifying the types of payments of attributes affecting performance.
Proposed change
The ability to build a data flow model that support analytics such as number of flows, latency between steps. Break down millions of data flows against different attributes. Support filtering by analytics, attributes, latency and others. In someways it is equivalent to distributed tracing except it
- also includes related logline against a correlation-Id (provides context).
- doesn’t require special infrastructure (json data stored on cloud-storage)
- extensible. (I.e. can be updated to support cloud events in future)
- low impact to existing systems
V1 Mechanics:
Note limitations:
- supports a single-correlation-id (no propagation fan-out etc),
- rewriting correlation data (duplication)
Stage 1. Rewrite dataflows by time-correlation
Format: {timestamp}-corr-id.log -> records[timestamp:data]
This makes each stage of the dataflow an append-only log of that stage of the correlations dataflow. Timestamps will distinguish between different states of the flow.
For example.
time-1: corr-a data-1
time-2: corr-b data-2
time-3: corr-a data-3
Yield 2 data files
time-1-corr-a.log -> records for corr-a
time-2-corr-b.log -> records for corr-b
Stage 2. calculate dataflow level analysis.index
For each dataflow generate span-information into a time-n-corr-m.index file
Stage 3. build per-day aggregate view of correlation data
- Scan all index files to extract histogram analytics including:
- number of dataflows
- percentile breakdown of min,max,avg, 95th percentile, 99th percentile data
- The output histogram will be stored using the timerange of the query to prevent recaulcation requirements.
Datamodel:
- RPC call
I.e. Client->function(request)->client(result)
Client-app1.log
{ ts:’1234010’, corid:’aaaaa’}
{ ts:’1234011’, corid:’aaaaa’ stage:’request-calc’}
ServerlessFunction.log
{ ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’1234:name:version’}
{ ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’1234’}
{ ts:’1234023’, corid:’aaaaa’ msg:’processing some stuff for things’}
{ ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’1234’}
Lambda finished processing: memory:1234 time:2345
Client-app1.log
{ ts:’1234025, corid:’aaaaa’ stage:’calc-finished’}
Single Call attributes collected: Raw
Request start-time (corrid, timestamp, stage)
Fun load-time, (fnId, timestamp)
Fun handler start (corrid, fnId, timestamp, stage)
Fun handler end (corrId, fnId, timestamp , stage, fn-stats)
Response received-time (corrid, timestamp, stage)
Analytics
Call chain analytics
Stage1->stage2 elapsed (states are labels on the log line)
Stage2->stage3 elapsed
...etc...
Simple timeseries analysis:
Represented as candle-chart?
-
Individual stage-analytics using percentile/max/avg
stageA-stageB elapsed percentiles (label-stageA->stageB
Expression:analytics.dataflow.stageStats(dataset-name?) -
End-to-end stage analytics:
Stage1->Stage4 elapsed using percentiles
Expression:analytics.dataflow.endToEndStats(dataset-name?) -
Filter stage-stage latency by value:
Stage1->Stage4 elapsed > 100ms
Expression:analytics.dataflow.stageFilter(stage1 - stage4 < 100)analytics.dataflow.endToEndStats(dataset-name?) -
Branch RPC
Client-app1.log
{ ts:’1234010’, corid:’aaaaa’}
{ ts:’1234011’, corid:’aaaaa’ stage:’request-calc-1’}
{ ts:’1234011’, corid:’aaaaa’ stage:’request-calc-2’}
{ ts:’1234011’, corid:’aaaaa’ stage:’request-calc-3’}
ServerFunction.logs (single source file)
{ ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’1111:name:version’}
{ ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’1111’}
{ ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’1111’}
{ ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’2222:name:version’}
{ ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’2222’}
{ ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’2222’}
{ ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’3333:name:version’}
{ ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’3333’}
{ ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’3333’}
Client-app1.log
{ ts:’1234025, corid:’aaaaa’ stage:’calc-finished’}