Skip to content

Benchmark Ideas #9

@salman1993

Description

@salman1993

Really cool project! I am curious about the benchmarks, especially around reliability across runs.

Say we have workflows like these:

  # Slack Channel Summary DM
  # Fetches messages from #general and #random, summarizes each, and DMs you

  import "slack" from "mcp:remote-slack-server"

  agent fetcher:
    model: haiku
    skills: ["slack"]
    prompt: "You fetch Slack messages"

  agent summarizer:
    model: sonnet
    skills: ["slack"]
    prompt: "You create concise one-line summaries"

  # Fetch both channels in parallel
  parallel:
    general_msgs = session: fetcher
      prompt: "Get messages from #general channel for the last 7 days"

    random_msgs = session: fetcher
      prompt: "Get messages from #random channel for the last 7 days"

  # Summarize each channel in parallel
  parallel:
    general_summary = session: summarizer
      prompt: "Write a single one-line summary of the key activity/themes"
      context: general_msgs

    random_summary = session: summarizer
      prompt: "Write a single one-line summary of the key activity/themes"
      context: random_msgs

  # DM the summaries
  session: summarizer
    prompt: """
      Send me a Slack DM with this format:

      📊 *7-Day Channel Summary*

      *#general:* {general_summary}
      *#random:* {random_summary}
    """
    context: { general_summary, random_summary }

I am curious how prose program perform compared to detailed English instructions since its still inferred by LLMs:

  1. Call slack__get_msgs with channel='general', num_days=7
  2. Call slack__get_msgs with channel='random', num_days=7
  3. Write a one-line summary of #general activity
  4. Write a one-line summary of #random activity
  5. Call slack__send_dm to me with both summaries formatted as:
     "📊 7-Day Channel Summary"
     "#general: <summary>"
     "#random: <summary>"

The one above is obviously really simple but if we were to do this over 100 slack channels, then it gets hard & there's lot of variance across runs (completing without traversing all channels for example). Have you looked into this at all or can you share anything?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions