Skip to content

Batch discovery pipeline#202

Merged
dimavrem22 merged 46 commits intomainfrom
batch-discovery-pipeline
Feb 27, 2026
Merged

Batch discovery pipeline#202
dimavrem22 merged 46 commits intomainfrom
batch-discovery-pipeline

Conversation

@dimavrem22
Copy link
Contributor

@dimavrem22 dimavrem22 commented Feb 24, 2026

Agent Refactor

  • Collapsed AbstractSpecialist into AbstractAgent — all specialists now inherit directly from AbstractAgent, eliminating the intermediate layer
  • Autonomous loop (run_autonomous, finalize tools, iteration tracking, urgency notices, output schema injection) lives entirely in AbstractAgent
  • @agent_tool gained persist (NEVER/ALWAYS/OVERFLOW), max_characters, and token_optimized parameters:
    • Token optimization: token_optimized=True encodes results via toon encoder for cheaper token usage
    • Workspace persistence: persist=ALWAYS saves every tool result as a workspace artifact
    • Character overflow management: persist=OVERFLOW auto-saves results exceeding max_characters to raw/ as artifacts, returning an 800-char preview + artifact ID pointer to the LLM instead of blowing up context
  • All agents can be attached to a workspace and execute Python unless explicitly disabled
  • Each concrete subclass must declare an AGENT_CARD (enforced by __init_subclass__)
  • _collect_tools is now @lru_cache per subclass to avoid repeated dir(cls) traversal on every LLM call
  • Auto-wrap fix: if the LLM passes finalize output fields as top-level kwargs instead of nested under "output", the base class rewraps automatically

Workspace Refactor

  • Workspace is now an artifact-oriented system with a strict directory layout
  • Allows mounting external files in read-only mode via hardlinks (no copying of large capture files)
  • Each workspace has these directories:
    • raw/ (read-only): tool result artifacts and mounted external files
    • output/: agent-generated deliverables (written by tools, read by humans)
    • context/: reusable notes/context saved for later use in the same run
    • meta/: system-managed metadata (manifest.jsonl, input_mounts.jsonl) — not editable
    • scratch/: ephemeral scratch space
  • save_artifact() is the core write API — records provenance in meta/manifest.jsonl with SHA-256, size, content type, timestamp, and optional tool/code-run metadata
  • Snapshot/diff primitives (snapshot_paths, diff_snapshot) for tracking which output files changed during a tool call

API Indexing Pipeline

Overview

End-to-end pipeline that turns raw CDP captures into a catalog of executable routines, fully autonomous.

Phase 1 — Exploration (4 specialists run in parallel via ThreadPoolExecutor):

  • NetworkSpecialistNetworkExplorationSummary
  • ValueTraceResolverSpecialistStorageExplorationSummary
  • DOMSpecialistDOMExplorationSummary
  • InteractionSpecialistUIExplorationSummary

Each filters thousands of raw events down to the 5–15 endpoints/tokens/forms that actually matter.

Phase 2 — Routine Construction (PI orchestrator loop):

  • PrincipalInvestigator reads all 4 exploration summaries, plans a routine catalog, dispatches experiments to concurrent ExperimentWorker agents
  • Workers have live browser tools + recorded capture lookup tools, execute experiments, report structured findings
  • PI reviews results, accumulates proven artifacts, assembles routines, and submits to RoutineInspector for quality gating
  • Routines that pass inspection ship; those that fail get iterated on
  • Full incremental persistence: every experiment, attempt, routine, and agent thread is written to disk as it happens
  • PI crash recovery: if the PI dies (context exhaustion, API error), a fresh PI is constructed from the persisted DiscoveryLedger and continues where the previous one left off (up to 3 attempts)

New Agents

  • PrincipalInvestigator: orchestrator with no browser access — plans routines, dispatches workers, reviews results, assembles and ships routines
  • ExperimentWorker: browser-capable execution agent with live browser_* tools (navigate, eval JS, raw CDP) and recorded capture lookup tools — executes experiments, does NOT make strategic decisions
  • RoutineInspector: independent quality gate — scores routines on 6 dimensions (task completion, data quality, parameter coverage, robustness, structural correctness, documentation), hard-fails on 4xx/5xx responses or unresolved placeholders

How to Run

bluebox-api-index \
  --cdp-captures-dir ./cdp_captures \
  --task "Recover and validate routines from this captured session. Get all routines that deliver useful data to the user!" \
  --output-dir ./api_indexing_output \
  --model gpt-5.2 \
  --post-run-analysis

Other Changes

  • DOM Data Loader: new DOMDataLoader for dom/events.jsonl — parses full DOM string-interning tables, classifies elements by tag family
  • Code Execution Sandbox: added Lambda backend (BLUEBOX_SANDBOX_MODE=lambda), auto mode (Lambda > Docker > blocklist), read_only_paths support for workspace safety, expanded blocked-module workaround hints
  • New data models: DiscoveryLedger, ExperimentEntry, RoutineSpec, RoutineAttempt, RoutineCatalog, RoutineInspectionResult in orchestration/; exploration summaries in api_indexing/
  • Agent docs: runtime-searchable markdown docs (agent_docs/) for auth token resolution, naming conventions, CORS workarounds
  • Deleted AbstractSpecialist, RoutineDiscoveryAgentBeta, and all old docs/ planning files

Post-Review Edits

Merged run_python_code into execute_python

BlueBoxAgent exposed two Python execution tools to the LLM (execute_python from AbstractAgent and run_python_code defined locally), which was confusing. Merged them:

  • Moved file-tracking logic (workspace snapshot/diff) from BlueBoxAgent._run_python_code into the base AbstractAgent._execute_python — all workspace-backed agents now get file-tracking automatically
  • Deleted _run_python_code and its imports from BlueBoxAgent
  • Updated BlueBoxAgent.SYSTEM_PROMPT to reference execute_python everywhere
  • Updated tests in test_blocklist_hints.py to use execute_python

Workspace file-tracking covers entire workspace

Previously execute_python only snapshotted output/ — files written to context/ or scratch/ via Python code were never diffed or uploaded to S3. Fixed by snapshotting the entire workspace root before and after execution, so every file change is captured regardless of directory.

Removed _after_chat_added wrapper

Removed the _after_chat_added method from AbstractAgent that silently swallowed all exceptions and ignored the Chat argument. The on_chat_added callback is now called directly in _add_chat with the Chat object passed through. Updated PI's lambda wiring to accept the _chat param.

@dimavrem22 dimavrem22 force-pushed the batch-discovery-pipeline branch 2 times, most recently from e05deca to 0aed6ce Compare February 26, 2026 20:33
dimavrem22 and others added 28 commits February 26, 2026 22:57
Create DOMDataLoader for parsing CDP DOMSnapshot.captureSnapshot data with
element extraction (forms, inputs, buttons, links, tables, headings, clickable),
plus new methods for meta tags, script tags (with inline content for __NEXT_DATA__
etc.), and hidden inputs for token/key discovery. Add DOMSpecialist agent with
15 tools wrapping the loader, TUI run script, HTTP adapter integration, and
exploration scripts for storage/network/UI domains. Includes 89 unit tests
and API indexing data models for the exploration phase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ation

- Rename UISpecialist → InteractionSpecialist (DOM tools optional, interactions primary)
- Delete old InteractionSpecialist (subset of new one)
- Clean run_dom_exploration.py to DOM-only (no interaction auto-upgrade)
- Create run_ui_exploration.py using InteractionSpecialist for user intent
- Add UIExplorationSummary model, remove user_inputs/inferred_intent from DOMExplorationSummary
- Update InterestLevel enum in network exploration prompt
- Add exploration_output/ with all 4 Premier League capture results
- Add planning docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ements

- Fix pipeline diagram to show workers report to PI, not directly to inspector
- Clarify iteration definition (one LLM API call, not one tool call)
- Remove fabricated anti-bot bullet from DOM exploration; add to spec_v2 as improvement
- Clarify PI-side quality gates are pure Python static checks (no LLM)
- Clarify auth-first ordering is prompt-only, not enforced in code
- Fix UI exploration data source: InteractionSpecialist also gets DOM loader
- Add --max-pi-attempts CLI flag (was hardcoded MAX_PI_ATTEMPTS = 3)
- Add docs/api_indexing_spec_v2/potential_improvements.md with 3 improvements:
  1. WindowProperty exploration specialist
  2. True pipeline resumability and agent thread replay
  3. Anti-bot detection as first-class exploration output

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rkers only

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ts improvement

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…provement (#5)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…orrect test params

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, tool observability, output schemas, PI execution visibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All system prompts from every agent in the API indexing pipeline in one
file for easy auditing: exploration specialists, PI, worker, inspector,
plus dynamic sections and schemas.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dd impact/effort ratings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimavrem22 and others added 5 commits February 27, 2026 00:19
…oilerplate

DRY up the ensure-browser / timeout / error envelope duplicated across all
browser tools in ExperimentWorker and related specialists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimavrem22 dimavrem22 marked this pull request as ready for review February 27, 2026 06:50
…ompt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alex-w-99

This comment was marked as resolved.

BlueBoxAgent exposed two Python execution tools to the LLM, which was
confusing. Move the file-tracking logic (output/ snapshot/diff) from
BlueBoxAgent._run_python_code into the base AbstractAgent._execute_python
so all workspace-backed agents benefit and BlueBoxAgent only exposes one
Python tool.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimavrem22 and others added 4 commits February 27, 2026 14:25
…tput/

Files written to context/ or scratch/ via execute_python were never
snapshotted or diffed, so S3Workspace never uploaded them. Add
WRITABLE_ROOTS class attribute to AgentWorkspace ABC and use it in
_execute_python so all writable directories are tracked consistently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the _after_chat_added method that silently swallowed all exceptions
and ignored the Chat argument. The on_chat_added callback is now called
directly in _add_chat with the Chat object passed through. Also fix the
callback type from Callable[[], None] to Callable[[Chat], None].

Also add WRITABLE_ROOTS to AgentWorkspace ABC so the agent doesn't
hardcode which directories are writable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove WRITABLE_ROOTS and snapshot the whole workspace root so every
file created or modified by execute_python gets diffed and uploaded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimavrem22 dimavrem22 merged commit 130e9fb into main Feb 27, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants