Skip to content

FIP: Dataflow topology builder and analytics #50

@bluemonk3y

Description

@bluemonk3y

Motivation:

Understanding millions of related events and their relationship in a dataflow is impossible. Events span serverless functions, stream processors and microservices. Building that story and extracting instrumentation knowledge is virtually impossible. It should be possible to build a queryable dataflow model that supports instrumentation. The result would be the ability to scan millions of events (historcally and realtime), break them down by different attributes, identify anomolies, drill into the set of interest, and then down to a singular dataflow with relevant contextual information.

For example:

  • millions of IoT events -> those devices generating errors or performing slowly
  • millions of payments -> system latency degradation over time, identifying the types of payments of attributes affecting performance.

Proposed change

The ability to build a data flow model that support analytics such as number of flows, latency between steps. Break down millions of data flows against different attributes. Support filtering by analytics, attributes, latency and others. In someways it is equivalent to distributed tracing except it

  • also includes related logline against a correlation-Id (provides context).
  • doesn’t require special infrastructure (json data stored on cloud-storage)
  • extensible. (I.e. can be updated to support cloud events in future)
  • low impact to existing systems

V1 Mechanics:

Note limitations:

  1. supports a single-correlation-id (no propagation fan-out etc),
  2. rewriting correlation data (duplication)

Stage 1. Rewrite dataflows by time-correlation
Format: {timestamp}-corr-id.log -> records[timestamp:data]

This makes each stage of the dataflow an append-only log of that stage of the correlations dataflow. Timestamps will distinguish between different states of the flow.

For example.

time-1: corr-a data-1
time-2: corr-b data-2
time-3: corr-a data-3

Yield 2 data files

time-1-corr-a.log -> records for corr-a
time-2-corr-b.log -> records for corr-b

Stage 2. calculate dataflow level analysis.index

For each dataflow generate span-information into a time-n-corr-m.index file

Stage 3. build per-day aggregate view of correlation data

  1. Scan all index files to extract histogram analytics including:
  • number of dataflows
  • percentile breakdown of min,max,avg, 95th percentile, 99th percentile data
  1. The output histogram will be stored using the timerange of the query to prevent recaulcation requirements.

Datamodel:

  1. RPC call

I.e. Client->function(request)->client(result)

Client-app1.log
{ ts:’1234010’, corid:’aaaaa’}
{ ts:’1234011’, corid:’aaaaa’ stage:’request-calc’}

ServerlessFunction.log
{ ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’1234:name:version’}
{ ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’1234’}
{ ts:’1234023’, corid:’aaaaa’ msg:’processing some stuff for things’}
{ ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’1234’}
Lambda finished processing: memory:1234 time:2345

Client-app1.log
{ ts:’1234025, corid:’aaaaa’ stage:’calc-finished’}

Single Call attributes collected: Raw
Request start-time (corrid, timestamp, stage)
Fun load-time, (fnId, timestamp)
Fun handler start (corrid, fnId, timestamp, stage)
Fun handler end (corrId, fnId, timestamp , stage, fn-stats)
Response received-time (corrid, timestamp, stage)

Analytics

Call chain analytics
Stage1->stage2 elapsed (states are labels on the log line)
Stage2->stage3 elapsed
...etc...

Simple timeseries analysis:
Represented as candle-chart?

  1. Individual stage-analytics using percentile/max/avg
    stageA-stageB elapsed percentiles (label-stageA->stageB
    Expression: analytics.dataflow.stageStats(dataset-name?)

  2. End-to-end stage analytics:
    Stage1->Stage4 elapsed using percentiles
    Expression: analytics.dataflow.endToEndStats(dataset-name?)

  3. Filter stage-stage latency by value:
    Stage1->Stage4 elapsed > 100ms
    Expression: analytics.dataflow.stageFilter(stage1 - stage4 < 100) analytics.dataflow.endToEndStats(dataset-name?)

  4. Branch RPC
    Client-app1.log
    { ts:’1234010’, corid:’aaaaa’}
    { ts:’1234011’, corid:’aaaaa’ stage:’request-calc-1’}
    { ts:’1234011’, corid:’aaaaa’ stage:’request-calc-2’}
    { ts:’1234011’, corid:’aaaaa’ stage:’request-calc-3’}

ServerFunction.logs (single source file)
{ ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’1111:name:version’}
{ ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’1111’}
{ ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’1111’}
{ ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’2222:name:version’}
{ ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’2222’}
{ ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’2222’}
{ ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’3333:name:version’}
{ ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’3333’}
{ ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’3333’}

Client-app1.log
{ ts:’1234025, corid:’aaaaa’ stage:’calc-finished’}

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions