-
Notifications
You must be signed in to change notification settings - Fork 8
[SANDBOX-1531] Introduce a new EP for alert session scoring API. #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This is just the core API that will be followed by further EPs that will flesh out more functionality and UI integration on top of it.
WalkthroughAdds EP-0028 Phase 1 design: an async, LLM-based Alert Session Scoring API with multi-dimensional scoring (logical_flow, consistency, tool_relevance, synthesis_quality), criteria versioning via SHA256-hashed prompts, REST/WebSocket interactions, storage of freeform analyses, and a six-phase implementation roadmap. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as Scoring API
participant Service as Scoring Service
participant LLM
participant DB as Storage
Client->>API: POST /score-session (session_id, alert_data)
API->>Service: validate request, enqueue scoring job
Service->>DB: fetch session/context (if needed)
Service->>LLM: send constructed multi-turn prompt (SESSION_CONVERSATION, ALERT_DATA, OUTPUT_SCHEMA)
LLM-->>Service: structured judge response (JSON per schema)
Service->>Service: normalize/compute total_score, verify criteria SHA256
Service->>DB: persist score record, freeform analysis, metadata
Service-->>API: return scoring initiation/response
API-->>Client: 200 OK or job accepted, WebSocket progress updates as available
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
docs/enhancements/**/*.md
📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)
docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🧠 Learnings (6)
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🪛 LanguageTool
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
[grammar] ~839-~839: Ensure spelling is correct
Context: ...on (LLM-dependent) - Score retrieval: < 100ms (database query) - Support for concurre...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
17-17: Bare URL used
(MD034, no-bare-urls)
447-447: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests
🔇 Additional comments (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)
196-325: Verify referenced file paths and implementation status.The EP references several Python files for Phase 1 implementation:
backend/tarsy/services/scoring_service.pybackend/tarsy/models/scoring_models.pybackend/tarsy/controllers/scoring_controller.pybackend/tarsy/repositories/scoring_repository.pyconfig/scoring_config.yamlConfirm that:
- Directory structure (
services/,models/,controllers/,repositories/) follows established patterns in the codebase- Configuration file location (
config/scoring_config.yaml) aligns with existing config patterns- Main application file (
backend/tarsy/main.py) exists and can register new routes- File paths are consistent with the current project structure
|
|
||
| ## Experimental Background | ||
|
|
||
| **Note**: Initial prototype exists at https://github.com/metlos/TARSy-response-score (Go implementation). This EP proposes native integration into TARSy backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Format bare URL as markdown link.
Line 17 contains a bare URL that should be properly formatted as a markdown link to comply with MD034.
-**Note**: Initial prototype exists at https://github.com/metlos/TARSy-response-score (Go implementation). This EP proposes native integration into TARSy backend.
+**Note**: Initial prototype exists at [https://github.com/metlos/TARSy-response-score](https://github.com/metlos/TARSy-response-score) (Go implementation). This EP proposes native integration into TARSy backend.🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
17-17: Bare URL used
(MD034, no-bare-urls)
🤖 Prompt for AI Agents
In docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around line 17, the
comment contains a bare URL which violates MD034; update the line to replace the
bare URL with a markdown link by wrapping an appropriate link text in square
brackets followed by the URL in parentheses (e.g., use descriptive text like the
repository name as the link text), ensuring the link is inline and not a raw
URL.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #180 +/- ##
==========================================
+ Coverage 83.94% 83.97% +0.02%
==========================================
Files 124 124
Lines 11947 11947
==========================================
+ Hits 10029 10032 +3
+ Misses 1918 1915 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)
62-62: Format bare URL as markdown link to comply with MD034.Line 62 contains a bare URL that should be wrapped as a markdown link.
-**Note**: Initial prototype exists at https://github.com/metlos/TARSy-response-score (Go implementation). This EP proposes native integration into TARSy backend. +**Note**: Initial prototype exists at [https://github.com/metlos/TARSy-response-score](https://github.com/metlos/TARSy-response-score) (Go implementation). This EP proposes native integration into TARSy backend.
🧹 Nitpick comments (4)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (4)
38-1182: Consolidate design and implementation content into explicit formal sections.The document contains comprehensive design and implementation details but lacks formal section headers for "Design" and "Implementation Plan" as required by EP guidelines. Current design content is scattered across subsections (System Architecture, Data Design, API Design, Configuration, Security, Performance, etc.), and implementation details are embedded in Implementation Status rather than as a distinct section.
Reorganize as follows:
After "Use Cases" (line 149), add a clear section:
## DesignThen consolidate existing design subsections (System Architecture through Documentation Requirements) under this header.
Replace "Implementation Status" (line 1195) with formal sections:
## Implementation Plan ### Phases & Dependencies [expand "Next Steps" into phased breakdown with dependencies] ## Status & Tracking **Status:** Pending [rest of current status info]This provides the structure: Problem Statement → Goals → Overview → Design → Use Cases → Implementation Plan → Success Criteria → Status & Tracking, improving navigability and alignment with EP standards.
499-530: Clarify mutable vs. immutable aspects of scoring criteria.The design states that "Scoring criteria are defined in the freeform
judge_promptfield" and are "flexible and can evolve without code changes" (line 500). However, the content-hashing mechanism (lines 599-601) treats the entire configuration as immutable once hashed. This creates potential ambiguity:
- Question: If criteria evolve and the hash changes, how do operators know when to re-score historical sessions? What's the recommended workflow for criteria migration?
- Suggestion: Add a section clarifying the operational workflow when criteria change (e.g., "Operators re-score historical sessions with
force_rescore=trueto compare old vs. new scores and assess impact of criteria evolution").This is especially important for use case #4 (Criteria Evolution Tracking, line 115-121), which relies on clear criteria change management.
622-894: Judge prompt template is comprehensive but may need validation against actual LLM behavior.The judge prompt (lines 622-894) is extensive (~270 lines) and includes detailed scoring criteria, evaluation philosophy, and output format instructions. The design relies heavily on LLM compliance with these instructions, particularly:
- Scoring calibration (lines 843-865): Assumes LLM will produce average scores of 55-75
- Missing tools identification (lines 758-778): Requires LLM to distinguish between "critical" vs. "nice-to-have" tools
- Alternative approaches (lines 780-808): Requires LLM to generate realistic, tool-specific steps
Consideration: Phase 1 testing should validate that the chosen judge LLM (e.g., Gemini Flash, GPT-4) consistently produces scores and feedback that align with these expectations. If variance is high, the prompt may need refinement.
Recommendation: Add a section in Testing Strategy (line 973) specifically for "Judge LLM Consistency Validation" testing the same session 3-5 times to measure score variance and feedback stability. Use case #7 (line 141-147) hints at this but doesn't formalize it as a testing requirement.
1195-1207: Expand "Next Steps" into formal Implementation Plan with phases and dependencies.The "Next Steps" section (lines 1199-1207) lists 9 tasks sequentially but lacks explicit phase information, dependencies, and estimated effort. For an Enhancement Proposal, a more structured Implementation Plan would clarify:
- Phase breakdown: Which tasks are Phase 1a (database), 1b (service), 1c (API), 1d (testing)?
- Critical path: Which tasks block others (e.g., database migration must precede service implementation)?
- Dependency map: Does "Write unit and integration tests" depend on "Implement API endpoints"?
- Ownership: Is there a suggested owner or team for each phase?
Suggestion: Restructure "Implementation Status" section as:
## Implementation Plan ### Phased Tasks **Phase 1a: Infrastructure** - Create Alembic migration for all scoring tables - (Dependency: none; blocks Phase 1b) **Phase 1b: Service Layer** - Implement scoring service and models - (Dependency: Phase 1a; blocks Phase 1c) [continue for all phases...] ## Status & Tracking **Status:** Pending ...This clarifies sequencing and dependencies for implementation teams.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
docs/enhancements/**/*.md
📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)
docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🧠 Learnings (16)
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:51.044Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🪛 LanguageTool
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
[grammar] ~950-~950: Ensure spelling is correct
Context: ...on (LLM-dependent) - Score retrieval: < 100ms (database query) - Support for concurre...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
62-62: Bare URL used
(MD034, no-bare-urls)
92-92: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
100-100: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
108-108: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
116-116: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
124-124: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
134-134: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
142-142: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
558-558: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests
🔇 Additional comments (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)
155-167: Verify referenced components exist in codebase.The design references existing components (History Service at
backend/tarsy/services/history_service.py, LLM Client atbackend/tarsy/integrations/llm_client.py) and proposes new components following the patternbackend/tarsy/{services,models,controllers,repositories}. Confirm:
- History Service exists and exposes the required data retrieval logic for complete session data
- LLM Client exists and can be reused for judge LLM calls as proposed
- Codebase directory structure follows the backend/tarsy pattern for new service components
- Main application (
backend/tarsy/main.py) exists and supports route registration
| ### Primary Use Cases | ||
|
|
||
| 1. **Manual Quality Review** | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify language identifier for all fenced code blocks to comply with MD040.
Use cases embed code blocks without language specifications. Add text identifier to each opening fence:
1. **Manual Quality Review**
- ```
+ ```text
Operator: Reviews completed security investigationApply the same change to all 7 use case blocks (lines 92, 100, 108, 116, 124, 134, 142).
Also applies to: 100-100, 108-108, 116-116, 124-124, 134-134, 142-142
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
92-92: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines
92,100,108,116,124,134,142 the fenced code blocks are missing a language
identifier which violates MD040; update each opening triple-backtick fence at
those lines to use ```text so every use-case code block begins with ```text
(apply this change to all seven specified lines).
alexeykazakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the document! I've started the review (not finished yet). Here are just a few first comments.
| - **Criteria Evolution**: Content-addressed configuration enables tracking scoring criteria changes over time | ||
| - **Flexible Scoring**: JSONB score breakdowns adapt to evolving evaluation dimensions without schema migrations | ||
| - **Actionable Feedback**: Normalized storage of missing tools and alternative approaches enables analytics | ||
| - **Manual Control**: Phase 1 is operator-triggered only; automation deferred to future phases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's better to support triggering evaluation manually only for now? I wonder if automated evaluation would be actually easier...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea here was that a manually invokable REST endpoint would be easier to play with than something running on a schedule. Not much further thought behind it.
|
|
||
| ## Non-Goals (Future Phases) | ||
|
|
||
| - UI integration for displaying scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... I would include some simple/primitive UI in Phase 1. Otherwise we would have to use API any time we want to see the score/analysis. So we would have to build something anyway.
Co-authored-by: Alexey Kazakov <alkazako@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (5)
62-62: Format bare URL as markdown link (MD034 compliance).Line 62 contains a bare URL that should be wrapped in markdown link syntax to comply with MD034.
-**Note**: Initial prototype exists at https://github.com/metlos/TARSy-response-score (Go implementation). This EP proposes native integration into TARSy backend. +**Note**: Initial prototype exists at [GitHub: metlos/TARSy-response-score](https://github.com/metlos/TARSy-response-score) (Go implementation). This EP proposes native integration into TARSy backend.
92-92: Add language identifier to all use case code blocks (MD040 compliance).Use case blocks at these lines lack language identifiers. Standardize by adding
textidentifier:1. **Manual Quality Review** - ``` + ```text Operator: Reviews completed security investigationApply the same change to all seven use case blocks (lines 92, 100, 108, 116, 124, 134, 142).
Also applies to: 100-100, 108-108, 116-116, 124-124, 134-134, 142-142
558-558: Add language identifier to Example Prompt Flow code block (MD040 compliance).**Example Prompt Flow:** -``` +```text [Criteria and instructions]
1-1200: Reorganize into formal Design and Implementation Plan sections per EP guidelines.The document still lacks the explicit structure required by learnings. While comprehensive design content exists, it is scattered across subsections. Consolidate as follows:
Current structure issue: Content mixes design details (System Architecture, Data Design, API Design, etc.) without a consolidated "Design" header, and implementation roadmap is buried in "Implementation Status" with minimal detail.
Required restructuring:
After "Problem Statement" (line 31): Add new Design section that consolidates:
- Design Overview & Principles (currently lines 38-51)
- Design Goals (currently lines 52-59)
- System Architecture → System Architecture subsection
- Data Design → Data Design subsection
- API Design → API Design subsection
- Configuration → Configuration subsection
- Security → Security subsection
- Performance & Scalability → Performance subsection
- Error Handling → Error Handling subsection
- Monitoring & Observability → Monitoring subsection
After Design section: Add new Implementation Plan section that expands:
- Migration & Deployment → Migration subsection
- Testing Strategy → Testing subsection (move from current location)
- Move "Next Steps" (lines 1199-1209) into Implementation Plan with phased breakdown
Keep: Success Criteria (lines 1183-1192)
Rename: "Implementation Status" to "Status & Tracking" with cleaner metadata
This provides the required EP structure: Problem Statement → Design → Implementation Plan → Success Criteria → Status & Tracking
Based on learnings, this reorganization is essential for AI consumption of the document.
935-944: Detail input validation strategy for API endpoints.Line 940 mentions "Input validation on all API endpoints (session_id format, request body schema)" but provides no specifics:
- What is the session_id format validation? (UUID regex, length, allowed characters?)
- What schema validation is performed on request body?
- Are there limits on field sizes (e.g.,
force_rescoreas boolean)?- What about integer constraints (e.g., total_score bounds check before storage)?
Recommend specifying validation rules explicitly in API Design section (lines 438-496), possibly with examples.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
docs/enhancements/**/*.md
📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)
docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🧠 Learnings (17)
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:51.044Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:51.044Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: When EP references misaligned patterns (functions that have been renamed, uses deprecated patterns, or conflicts with existing architecture), fix the issue and document the change if clear, or stop and ask for clarification if unclear
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🪛 LanguageTool
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
[grammar] ~950-~950: Ensure spelling is correct
Context: ...on (LLM-dependent) - Score retrieval: < 100ms (database query) - Support for concurre...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
62-62: Bare URL used
(MD034, no-bare-urls)
92-92: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
100-100: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
108-108: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
116-116: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
124-124: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
134-134: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
142-142: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
558-558: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests
🔇 Additional comments (6)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (6)
170-188: Clarify History Service integration and error handling for incomplete sessions.Line 174 references retrieving session data via "History Service (via
/final-analysisendpoint logic)" but the integration details are sparse:
- How is the History Service called? Direct method call or HTTP endpoint?
- What if a session is not yet completed? Should return 400 (bad request) per line 484, but error handling details are missing.
- What constitutes "completed session data"? Is there a status check needed?
- How are partial or failed sessions handled?
Recommend adding clarification to the "Data Flow" section (lines 170-188) about:
- Exact method of retrieving session data from History Service
- Validation of session completion status
- Error response for incomplete or non-existent sessions
448-453: Clarify force_rescore semantics and score lifecycle.The
force_rescoreflag (line 451) is mentioned but behavior is underspecified:
- If
force_rescore=trueand criteria_hash has changed, does it:
- Create a new score_id?
- Update the existing score_id with new criteria_hash (breaking historical tracking)?
- Store both old and new scores with different score_ids?
- Data flow (line 172) mentions "Return existing score" but doesn't address criteria evolution.
- API response model (line 401-413) includes
is_current_criteriafield, suggesting multiple scores per session are possible, but this contradicts theUNIQUEconstraint onsession_idin line 215.The unique constraint on
session_id(line 215) suggests only one score per session, but historical tracking across criteria versions requires multiple scores per session.Recommend clarifying:
- Modify database schema: Should
session_idbe unique or should (session_id, criteria_hash) be unique instead?- Update data flow to explain multi-criteria scoring lifecycle
- Clarify
is_current_criteriacomputation logicAlso applies to: 451-452
155-167: Specify which components are modified vs new in main.py integration.Line 162 mentions "Register scoring API routes" in
backend/tarsy/main.pybut doesn't specify:
- Are existing routes in main.py? How do they currently load modules?
- Is this a simple
app.include_router()call, or does it require changes to initialization logic?- Are there existing FastAPI router patterns in the codebase to follow?
Recommend adding a concrete code example of the main.py integration pattern for clarity during implementation.
954-956: Clarify asynchronous scoring implementation details.Line 955 mentions "Asynchronous scoring using background tasks (FastAPI background tasks)" but implementation details are missing:
- Are background tasks the right choice, or should this be a job queue (Celery, RQ)?
- How does caller get notified when async scoring completes? Polling? Webhooks? (not mentioned)
- What is timeout for async task? (no timeout specified)
- If async task fails, how is retry handled? (section 962-965 specifies retry for API layer, but what about background task level?)
- Can user call GET /score while scoring is in progress? Return 202 Accepted?
Recommend expanding Performance section with async scoring flow and polling/notification strategy.
622-895: Judge prompt design is comprehensive and well-structured. ✓The judge prompt template demonstrates excellent prompt engineering with clear evaluation philosophy, detailed scoring criteria, explicit language patterns, and comprehensive guidance for missing tool and alternative approach identification. The emphasis on rigor and critical evaluation aligns well with the stated design goals.
Note: The prompt is intentionally lengthy (~300 lines) to support comprehensive evaluation. Per prototype findings (line 68), session data plus prompt fits comfortably in 1M token window, so this is acceptable for Phase 1.
156-166: Correct the file path reference for the LLM integration.The EP references
backend/tarsy/integrations/llm_client.py(line 165), but this file does not exist. The LLM client is implemented as a package atbackend/tarsy/integrations/llm/with modules likeclient.pyandmanager.py. Update the EP to reference:
backend/tarsy/integrations/llm/(or the specific module:llm/manager.py,llm/client.py)All other referenced paths are correct or will be created with the proper directory structure:
backend/tarsy/services/scoring_service.py✓ (directory exists)backend/tarsy/models/scoring_models.py✓ (directory exists)backend/tarsy/controllers/scoring_controller.py✓ (directory exists)backend/tarsy/repositories/scoring_repository.py✓ (directory exists)backend/tarsy/services/history_service.py✓ (file exists)backend/tarsy/main.py✓ (file exists)Likely an incorrect or invalid review comment.
| ## Security | ||
|
|
||
| ### Authentication & Authorization | ||
|
|
||
| **Authentication**: Handled at infrastructure level by oauth2-proxy (reverse proxy) | ||
| - Users authenticated via OAuth (GitHub, Google, etc.) or JWT tokens | ||
| - Application trusts `X-Forwarded-User` and `X-Forwarded-Email` headers from oauth2-proxy | ||
| - No application-level authentication middleware required | ||
|
|
||
| **Authorization**: Not implemented at application level | ||
| - All authenticated users have full access to all endpoints | ||
| - No role-based access control (RBAC) currently exists | ||
| - Authorization is a future enhancement (referenced in EP-0004 as "may be added in future phases") | ||
|
|
||
| **User Attribution**: | ||
| - User identification extracted from oauth2-proxy headers via `extract_author_from_request()` helper | ||
| - Used for audit trails and tracking who triggered scoring operations | ||
| - Follows same pattern as alert submission (see `alert_controller.py`) | ||
|
|
||
| ### Security Controls | ||
|
|
||
| - Input validation on all API endpoints (session_id format, request body schema) | ||
| - Rate limiting on scoring endpoints to prevent LLM API abuse (implemented at oauth2-proxy/infrastructure level) | ||
| - Judge LLM prompts sanitized to prevent prompt injection | ||
| - Score data treated with same sensitivity as session data (no additional PII) | ||
| - HTTPS/WSS enforced for all communication (infrastructure level) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add explicit secret management and LLM API key handling strategy.
Security section lacks detail on LLM API key management:
- How are LLM provider credentials (API keys) managed? (vault, environment variables, etc.)
- Is key rotation supported?
- Are API keys logged or exposed in error messages? (line 965 mentions logging LLM responses - truncated at 1000 chars - but are keys included?)
- How are failed LLM requests logged without leaking credentials?
Recommend adding explicit guidance:
- Use secure credential storage (e.g., HashiCorp Vault, AWS Secrets Manager)
- Never log full prompts if they contain sensitive data
- API error responses should not expose keys or internal prompts
Based on learnings, security posture gaps degrade guarantees.
🤖 Prompt for AI Agents
In docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines
918–944, add an explicit LLM API key and secret management section: state that
provider credentials must be stored in a managed secrets store (e.g., HashiCorp
Vault, AWS Secrets Manager) with IAM-restricted access, enforce key rotation
policies, and avoid embedding keys in environment/config files except as
short-lived tokens; require that all logs and error handlers redact or omit API
keys and full prompts (log only truncated/sanitized prompt snippets and
non-sensitive metadata), ensure failed LLM request logging strips credentials
and sensitive inputs, document recommended operational controls (audit logging
of secret access, least-privilege service accounts, CI checks for accidental
secret exposure), and reference these requirements as mandatory for production
deployments.
| 7. Store score in database: | ||
| - Insert `session_scores` record with score_id, total_score, score_breakdown, etc. | ||
| - Insert `score_missing_tools` records (one per missing tool) | ||
| - Insert `score_alternative_approaches` records (one per alternative) | ||
| - Insert `score_alternative_approach_steps` records (one per step in each alternative) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think about pros and cons of detailed DB schema vs. simplified one with a flexible JSON or even string filed as the model.
How are we going to consume the stored sections of the score report? My understanding that it's mostly for later AI analysis. In this case a JSON/string field should be good enough, right?
Pros of the proposed approach:
- Easy to query score reports for individual sections.
Cons:
- Requires initial parsing of the AI-generated report.
- Have to deal with formatting mistakes of AI (because of parsing)
- Any report template changes are very expensive: requires changes in parser, models, DB schema (including migration).
Alternative approaches.
Option 1:
- We keep detailed API/internal model with breakdown:
- score (int)
- missing_tools
- alternative_approaches
- etc.
- The DB model/schema is simplified and has:
- score (int)
- report (JSON) - serialized API/internal model (w/o score int as it's in a dedicated field)
Pros: we can always change the internal model and AP w/o changing the db schema.
Cons: Harder to migrate existing records when API changed w/o breaking de-serialization (from DB to API model).
Option 2:
- All models and API:
- score (int)
- report (string) - an MD document with all components like missing tools, alternative approaches, etc. Generated by AI.
Pros: very flexible. We can change the format of the report. No db migration ever needed.
Cons: You can't easily query DB for score reports with specific sections without the report parsing. But do we really need this?
I personally think the best option is Option 2. We just let the AI generate the score report. We extract the score number and store the entire report as an md document (string). We will use that report as-is later on for AI analysis. Even if the format of the document is not 100% consistent it's not a big problem with AI. But it will give us a lot of flexibility with changing the template. We also don't have to deal with parsing the initial report and making sure that AI follow the format properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd go half-way actually :)
I have a clearer vision for the missing tools, so let me describe why we need at least a little bit of structure (name + description) there and what it would be used for.
We'd aggregate all the missing tools from all the scores and let the LLM cluster them together semantically.
Once we have that clustering, we can "link" the missing tools in the scores with their "representative tools".
We then use just the "representative tools" (and the counts of the linked missing tools from the scores) for the higher level report on the "most missing tools".
Comparing the representative missing tools with the current tool "menu" that the agent provides (again using an LLM to do a semantic comparison) can enable us to mark the tools as implemented.
This information can then for example mark scores as obsolete (we can't do much more at that point unfortunately because we cannot just re-trigger sessions with new tools - the payloads that they analyzed will have been long gone). But we can for example exclude such obsolete scores from the quality reporting because they are no longer relevant to the current state of the tooling available to TARSy.
With alternative approaches we could still maybe do with name + description, too. This would again enable us to semantically cluster them together and then maybe report on analyses where the LLM most frequently chooses non-optimal approach. I have not developed that idea too much in my mind though - so not sure if and how that would be useful.
|
|
||
| The scoring system uses a YAML configuration file following TARSy's established configuration patterns (similar to `agents.yaml` and `llm_providers.yaml`). This file contains the judge prompt template and scoring parameters. The entire configuration is content-addressed using a hash to track criteria evolution. | ||
|
|
||
| **Example Configuration:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather:
- Hard code the template and access it via the prompt_builder. Pros: works OOTB. No need to provide your own template to use this feature.
- Keep the configuration minimal and move it to agents.yaml:
- disabled/enabled - Optional. Disabled by default.
- llm_provider (llm provider includes the model name, so no need to keep it separate) - Optional. Use the system default provider by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely want the judge prompt to be part of the default configuration so that people don't have to provide their own (not sure what the prompt_builder is).
I agree we can simplify the config and move it to the agents.yaml but I think it is important to keep the judge prompt configurable:
- so that people can play with it locally
- so that when it changes (and I hope it will be evolving), we can mark scores as obsolete and let the user re-run them.
For the latter point, I think it is also useful to keep the judge prompts in the DB so that we can e.g. have multiple scores for a single session, each done by a different judge prompt so that we can compare and contrast what effect on the score the different judge prompts have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)
9-37: Add explicit Problem Statement and Design/Implementation Plan headings per EP guidelinesThis EP-*.md starts with Overview and then jumps into configuration/API/implementation details. Per the EP learnings for
/docs/enhancements/pending/EP-*.md, each EP should explicitly include:
- A concise Problem Statement section near the top (what needs to be solved).
- A distinct Design section (how the solution works).
- A clear Implementation Plan section (phased tasks).
You already have good content for all three (Overview, Implementation Tasks, Implementation Phases). Consider:
- Adding a short
## Problem Statementimmediately after the metadata, summarizing the lack of systematic session-quality scoring.- Introducing a
## Designheading before “Configuration Syntax” and treating Configuration/API Usage/Scoring Output Schema/Use Cases as its subsections.- Renaming “## Implementation Phases” to
## Implementation Plan(keeping phases as subheadings).This keeps the doc aligned with the standard EP structure and improves AI consumption.
(prior review requested similar structure; this refines it further based on current content. As per coding guidelines, EP-*.md should expose these sections explicitly.)
🧹 Nitpick comments (2)
docs/enhancements/pending/tmp.md (2)
62-69: Format prototype URL as a markdown link to satisfy MD034The prototype reference currently uses a bare URL, which markdownlint flags (MD034). Consider changing it to an inline link for consistency:
-**Note**: Initial prototype exists at https://github.com/metlos/TARSy-response-score (Go implementation). This EP proposes native integration into TARSy backend. +**Note**: Initial prototype exists at [metlos/TARSy-response-score](https://github.com/metlos/TARSy-response-score) (Go implementation). This EP proposes native integration into TARSy backend.
91-147: Add language identifiers to fenced blocks to comply with MD040The narrative “scenario” blocks under Use Cases and the Example Prompt Flow fence use bare ``` without a language, which triggers MD040 and reduces readability.
Suggest annotating them as plain text:
-1. **Manual Quality Review** - ``` +1. **Manual Quality Review** + ```text Operator: Reviews completed security investigation ... ```And similarly for the other use-case blocks and the “Example Prompt Flow” snippet (use
```textthere as well).Also applies to: 558-566
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md(1 hunks)docs/enhancements/pending/tmp.md(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
docs/enhancements/**/*.md
📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)
docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.mddocs/enhancements/pending/tmp.md
🧠 Learnings (19)
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.mddocs/enhancements/pending/tmp.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.mddocs/enhancements/pending/tmp.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.mddocs/enhancements/pending/tmp.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.mddocs/enhancements/pending/tmp.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.mddocs/enhancements/pending/tmp.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.mddocs/enhancements/pending/tmp.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.mddocs/enhancements/pending/tmp.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.mddocs/enhancements/pending/tmp.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.mddocs/enhancements/pending/tmp.md
📚 Learning: 2025-11-24T20:22:51.044Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:51.044Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: When EP references misaligned patterns (functions that have been renamed, uses deprecated patterns, or conflicts with existing architecture), fix the issue and document the change if clear, or stop and ask for clarification if unclear
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.mddocs/enhancements/pending/tmp.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:23:41.441Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Use configuration over code for new agents when possible
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:23:41.441Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Applies to config/agents.yaml : Define agents in YAML configuration using template variable support `${VAR}` instead of hardcoding when possible
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🪛 LanguageTool
docs/enhancements/pending/tmp.md
[grammar] ~950-~950: Ensure spelling is correct
Context: ...on (LLM-dependent) - Score retrieval: < 100ms (database query) - Support for concurre...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
docs/enhancements/pending/tmp.md
62-62: Bare URL used
(MD034, no-bare-urls)
92-92: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
100-100: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
108-108: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
116-116: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
124-124: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
134-134: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
142-142: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
558-558: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests
docs/enhancements/pending/tmp.md
Outdated
| ## Security | ||
|
|
||
| ### Authentication & Authorization | ||
|
|
||
| **Authentication**: Handled at infrastructure level by oauth2-proxy (reverse proxy) | ||
| - Users authenticated via OAuth (GitHub, Google, etc.) or JWT tokens | ||
| - Application trusts `X-Forwarded-User` and `X-Forwarded-Email` headers from oauth2-proxy | ||
| - No application-level authentication middleware required | ||
|
|
||
| **Authorization**: Not implemented at application level | ||
| - All authenticated users have full access to all endpoints | ||
| - No role-based access control (RBAC) currently exists | ||
| - Authorization is a future enhancement (referenced in EP-0004 as "may be added in future phases") | ||
|
|
||
| **User Attribution**: | ||
| - User identification extracted from oauth2-proxy headers via `extract_author_from_request()` helper | ||
| - Used for audit trails and tracking who triggered scoring operations | ||
| - Follows same pattern as alert submission (see `alert_controller.py`) | ||
|
|
||
| ### Security Controls | ||
|
|
||
| - Input validation on all API endpoints (session_id format, request body schema) | ||
| - Rate limiting on scoring endpoints to prevent LLM API abuse (implemented at oauth2-proxy/infrastructure level) | ||
| - Judge LLM prompts sanitized to prevent prompt injection | ||
| - Score data treated with same sensitivity as session data (no additional PII) | ||
| - HTTPS/WSS enforced for all communication (infrastructure level) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify LLM API key and secret management in Security section
The Security section covers authn/z and general controls but does not specify how LLM provider credentials (API keys) are stored, rotated, and protected in logs/errors. For production use, the EP should explicitly require:
- Storing LLM API keys in a managed secrets store (Vault/Secrets Manager/Kubernetes secrets) with least-privilege access.
- Supporting key rotation (short-lived tokens or documented rotation process).
- Redacting keys and full prompts from logs and error messages (log only sanitized/truncated context and non-sensitive metadata).
Adding a short “LLM Credentials & Secrets” subsection here will close this design gap and align security posture with the rest of the system.
734aac2 to
835d202
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (2)
147-150: Add 404 error response documentation to POST endpoint for consistency.The POST
/api/v1/scoring/sessions/{session_id}/scoreendpoint omits404 Not Foundfrom its Error Responses, while the GET endpoint (lines 160–163) includes it. For API contract consistency, add:**Error Responses:** +* `404 Not Found` - Session not found * `400 Bad Request` - Session not completed or invalid state * `500 Internal Server Error` - LLM API failure or database error
39-607: Reorganize into explicit "Design" and "Implementation Plan" sections per EP structure.The document mixes design and implementation details without formal section headers. Per learnings, EP documents must have distinct sections: Problem Statement → Goals → Design → Implementation Plan → Success Criteria → Status & Tracking.
Current issue: Design details (Configuration, API Usage, Output Schema) are scattered, and implementation roadmap (Tasks + Phases) lacks a unified section header.
Recommended restructuring:
After "Problem Statement" (new), add section: "Goals" (extract/synthesize from Overview, lines 13–23)
Add new "Design" section (after Goals, consolidates lines 39–209):
- Move "Configuration Syntax" (39–86) → subsection
- Move "API Usage" (89–164) → subsection
- Move "Scoring Output Schema" (167–195) → subsection
- Move "Use Cases" (199–209) → subsection
- Add new subsections for: Security (with secret management strategy), Performance, Error Handling, Monitoring, Key Design Decisions
Add new "Implementation Plan" section (consolidates lines 212–607):
- Move "Implementation Tasks" (212–416) → Phase-by-phase breakdown with task details
- Integrate "Implementation Phases" (431–607) into the Implementation Plan structure
- Each phase should list: goals, dependencies, task checklist, deliverables, milestones
- Add explicit owners/teams if applicable
Rename "Next Steps" section (609–617) → "Status & Tracking" with metadata: last updated, current phase, blockers, next milestone
Optional: Add "Success Criteria" section (if not already defined in Overview)
This restructuring ensures: Design captures what and how (system architecture, API contracts, configuration); Implementation Plan captures phased execution (tasks, dependencies, deliverables, milestones).
🧹 Nitpick comments (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)
385-428: Clarify file path intent and add configuration option migration note.The "Files to Modify" section (lines 385–428) lists paths for new files that will be created during implementation. This is appropriate for a design document. However, note one important detail for Phase 2:
Per learnings and past review comments, there was discussion about whether to hardcode the judge prompt in code or keep it configurable. The current design assumes
config/scoring_config.yamlwith user-configurable judge prompt (line 402). This is mentioned in past comments where the preference was stated: "keep the judge prompt configurable so people can play with it locally and compare scoring across different prompts."Confirm during implementation that:
- Default prompt is shipped with the application (not requiring users to provide one to get started)
- Configuration path supports override:
config/scoring_config.yamlis optional; fall back to hardcoded/bundled default if missing- Prompt versioning is tracked: Judge prompts in DB enable historical comparison across scoring iterations
Consider adding a note to the "Configuration Loading" section (Phase 2, lines 454–471) clarifying this intent.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
docs/enhancements/**/*.md
📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)
docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🧠 Learnings (19)
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:51.044Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:51.044Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: When EP references misaligned patterns (functions that have been renamed, uses deprecated patterns, or conflicts with existing architecture), fix the issue and document the change if clear, or stop and ask for clarification if unclear
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:23:41.441Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Use configuration over code for new agents when possible
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:23:41.441Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Applies to config/agents.yaml : Define agents in YAML configuration using template variable support `${VAR}` instead of hardcoding when possible
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests
🔇 Additional comments (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)
35-35: Fix fenced code blocks to include language identifiers.The MD040 rule requires fenced code blocks to have a language specified. The file has bare closing fences (without language identifiers) at lines 58, 71, 105, 143, 193, and 893. Add appropriate language identifiers to each fence.
Note: Line 35 is already compliant—bare URLs wrapped in angle brackets satisfy MD034 requirements and do not need to be converted to markdown link syntax.
Likely an incorrect or invalid review comment.
| # EP-0028 Phase 1: Alert Session Scoring API | ||
|
|
||
| **Status:** Pending | ||
| **Created:** 2025-12-05 | ||
| **Phase:** Design | ||
|
|
||
| --- | ||
|
|
||
| ## Overview | ||
|
|
||
| Add systematic quality assessment for TARSy alert analysis sessions through an LLM-based judge that critically evaluates investigation methodology and provides actionable feedback. | ||
|
|
||
| **Core Capabilities:** | ||
|
|
||
| 1. **LLM-Based Scoring**: Judge LLM evaluates completed sessions using rigorous scoring framework | ||
| 2. **Critical Evaluation**: Methodology-focused assessment with strict standards (4 dimensions: logical flow, consistency, tool relevance, synthesis quality) | ||
| 3. **Actionable Feedback**: | ||
| * Identify missing MCP tools that should have been used | ||
| * Suggest alternative investigation approaches with detailed steps | ||
| * Provide improvement recommendations | ||
| 4. **Criteria Evolution**: Content-addressed configuration (SHA256 hash) tracks scoring criteria changes over time | ||
| 5. **Quality Metrics**: Quantitative scores (0-100) enable tracking improvements and identifying patterns | ||
| 6. **Non-Intrusive**: Post-session scoring with zero impact on alert processing | ||
| 7. **Manual Control**: Operator-triggered scoring only (Phase 1 scope) | ||
|
|
||
| **Primary Use Cases:** | ||
|
|
||
| * Agent development feedback and improvement tracking | ||
| * MCP server prioritization based on gap analysis | ||
| * Training data curation for high-quality sessions | ||
| * Investigation methodology pattern discovery | ||
|
|
||
| **Scope:** On-demand scoring API for individual sessions via REST endpoints and basic UI integration. Scheduled batch scoring, analytics aggregation and reporting are left for the future. | ||
|
|
||
| **POC Reference:** Initial prototype at <https://github.com/metlos/tarsy-response-score> proved judge LLM provides critical scoring even for self-produced analyses. | ||
|
|
||
| --- | ||
|
|
||
| ## Configuration Syntax | ||
|
|
||
| **Configuration File:** `config/scoring_config.yaml` | ||
|
|
||
| ```yaml | ||
| # Scoring service settings (supports template variable substitution) | ||
| scoring: | ||
| enabled: ${SCORING_ENABLED:-true} | ||
| llm_provider: ${SCORING_LLM_PROVIDER:-${DEFAULT_LLM_PROVIDER}} | ||
| llm_model: ${SCORING_LLM_MODEL:-} # Empty = use provider default | ||
|
|
||
| # Judge prompt template with placeholders | ||
| # The entire configuration (including this prompt) is hashed to track criteria evolution. | ||
| judge_prompt: | | ||
| A prompt explaining how to perform the score. This prompt includes placeholders | ||
| to be replaced with the actual data from the session. See below for more details. | ||
|
|
||
| The prompt used with good success in the POC is included in the "attachment" | ||
| section at the end of this EP. | ||
| ``` | ||
|
|
||
| **Environment Variables** (optional overrides): | ||
|
|
||
| ```bash | ||
| # Enable/disable scoring (default: true) | ||
| SCORING_ENABLED=true | ||
|
|
||
| # LLM provider for scoring (default: falls back to DEFAULT_LLM_PROVIDER) | ||
| SCORING_LLM_PROVIDER=google-default | ||
|
|
||
| # Model for scoring (default: provider-specific default) | ||
| SCORING_LLM_MODEL=gemini-2.0-flash-exp | ||
| ``` | ||
|
|
||
| **Placeholder System:** | ||
|
|
||
| The judge prompt supports three placeholders that are replaced at runtime: | ||
|
|
||
| * `{{SESSION_CONVERSATION}}` - Complete conversation from History Service (includes all MCP tool usage) | ||
| * `{{ALERT_DATA}}` - Original alert data for reference | ||
| * `{{OUTPUT_SCHEMA}}` - JSON schema specification (automatically injected from `JudgeOutputSchema` class) | ||
|
|
||
| **Design Rationale:** | ||
|
|
||
| * **Separation of Concerns**: Configuration defines how to *evaluate* (criteria), code defines how to *report* it (output schema) | ||
| * **Parsing Reliability**: Output format guaranteed to match code expectations | ||
| * **Criteria Evolution**: Full prompt is hashed (SHA256) to track changes; output schema changes require code updates | ||
|
|
||
| --- | ||
|
|
||
| ## API Usage | ||
|
|
||
| **Note:** Phase 6 provides dashboard UI for these endpoints with visual score display and manual triggering. | ||
|
|
||
| ### Score Session | ||
|
|
||
| **Endpoint:** `POST /api/v1/scoring/sessions/{session_id}/score` | ||
|
|
||
| **Purpose:** Trigger scoring for a specific session (operator-initiated or UI-triggered) | ||
|
|
||
| **Request Body:** | ||
|
|
||
| ```json | ||
| { | ||
| "force_rescore": false // optional: re-score even if score exists | ||
| } | ||
| ``` | ||
|
|
||
| **Response (200 OK):** | ||
|
|
||
| ```json | ||
| { | ||
| "score_id": "550e8400-e29b-41d4-a716-446655440000", | ||
| "session_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7", | ||
| "criteria_hash": "a3f5b2c1d4e7f9a1b3c5d7e9f1a3b5c7d9e1f3a5b7c9d1e3f5a7b9c1d3e5f7a9", | ||
| "total_score": 67, | ||
| "score_breakdown": { | ||
| "logical_flow": 15, | ||
| "consistency": 18, | ||
| "tool_relevance": 14, | ||
| "synthesis_quality": 20 | ||
| }, | ||
| "score_reasoning": "The investigation demonstrates adequate methodology with notable gaps...", | ||
| "missing_tools": [ | ||
| { | ||
| "tool_name": "list-processes-in-pod", | ||
| "rationale": "Would have confirmed whether the suspicious binary was actively running..." | ||
| } | ||
| ], | ||
| "alternative_approaches": [ | ||
| { | ||
| "name": "File-First Forensic Approach", | ||
| "description": "Systematically enumerate all suspicious files before attempting execution verification", | ||
| "steps": [ | ||
| "Use list-files to discover all files in suspicious directory", | ||
| "Read content of each file to identify actual malware vs false positives", | ||
| "Correlate file timestamps with alert time window" | ||
| ] | ||
| } | ||
| ], | ||
| "scored_triggered_by": "alice@example.com", | ||
| "scored_at": "2025-12-05T10:30:00Z", | ||
| "is_current_criteria": true | ||
| } | ||
| ``` | ||
|
|
||
| **Note:** The `score_breakdown` structure is flexible and depends on scoring criteria. Structure may evolve as criteria change. | ||
|
|
||
| **Error Responses:** | ||
|
|
||
| * `400 Bad Request` - Session not completed or invalid state | ||
| * `500 Internal Server Error` - LLM API failure or database error | ||
|
|
||
| ### Get Session Score | ||
|
|
||
| **Endpoint:** `GET /api/v1/scoring/sessions/{session_id}/score` | ||
|
|
||
| **Purpose:** Retrieve existing score for a session | ||
|
|
||
| **Response:** Same as Score Session endpoint | ||
|
|
||
| **Error Responses:** | ||
|
|
||
| * `404 Not Found` - Session not found or not yet scored | ||
| * `500 Internal Server Error` - LLM API failure or database error | ||
|
|
||
| --- | ||
|
|
||
| ## Scoring Output Schema | ||
|
|
||
| The judge LLM must return JSON matching this structure (defined in code as `JudgeOutputSchema`): | ||
|
|
||
| ```python | ||
| class MissingTool(BaseModel): | ||
| """Missing tool that should have been used in the investigation.""" | ||
| tool_name: str = Field(..., description="Name of the missing MCP tool") | ||
| rationale: str = Field(..., description="Explanation of why this tool was needed") | ||
|
|
||
| class AlternativeApproach(BaseModel): | ||
| """Alternative investigation approach that could have been more efficient/rigorous.""" | ||
| name: str = Field(..., description="Brief descriptive title of the approach") | ||
| description: str = Field(..., description="1-2 sentences explaining the strategy") | ||
| steps: List[str] = Field(..., description="Ordered list of specific actions") | ||
|
|
||
| class JudgeOutputSchema(BaseModel): | ||
| """ | ||
| Output schema for judge LLM - defined in code for parsing reliability. | ||
| Injected into judge prompt at {{OUTPUT_SCHEMA}} placeholder. | ||
| """ | ||
| total_score: int = Field(..., ge=0, le=100, description="Overall score 0-100") | ||
| score_breakdown: Dict[str, Any] = Field(..., description="Flexible dimension scores (e.g., {'logical_flow': 18, ...})") | ||
| score_reasoning: str = Field(..., description="Detailed explanation (200+ words recommended)") | ||
| missing_tools: List[MissingTool] = Field(default_factory=list, description="Tools that should have been used") | ||
| alternative_approaches: List[AlternativeApproach] = Field(default_factory=list, description="Alternative investigation approaches") | ||
| ``` | ||
|
|
||
| **Note:** Only `total_score` is required. The `score_breakdown` dimensions can evolve as criteria change (stored as JSONB for flexibility). The nested models (`MissingTool`, `AlternativeApproach`) ensure type safety and clean database mapping. | ||
|
|
||
| --- | ||
|
|
||
| ## Use Cases | ||
|
|
||
| **Primary Use Cases:** | ||
|
|
||
| * **Manual Quality Review**: Operators score completed investigations to understand quality and identify improvement areas | ||
| * **Agent Development Feedback**: Developers quantify agent improvements after code changes and identify specific gaps (e.g., "missing file-read tools") | ||
| * **MCP Server Prioritization**: Team analyzes `score_missing_tools` frequency to make data-driven decisions on which tools to develop next | ||
| * **Criteria Evolution Tracking**: Operators re-score historical sessions with new criteria (different `criteria_hash`) to validate criteria changes | ||
| * **Investigation Methodology Improvement**: Operators review alternative approaches from scores to learn investigation patterns | ||
| * **Training Data Curation**: ML engineers identify high-quality sessions (total_score >= 85) for fine-tuning datasets | ||
|
|
||
| --- | ||
|
|
||
| ## Implementation Tasks | ||
|
|
||
| ### Task 1: Database Schema | ||
|
|
||
| **Tables to Create:** | ||
|
|
||
| 1. **`scoring_criteria_definitions`** | ||
| * Primary key: `criteria_hash` (VARCHAR 64, SHA256 of full config) | ||
| * `criteria_content` (JSONB) - Complete scoring configuration | ||
| * `created_at` (TIMESTAMP) | ||
| * Index on `created_at` | ||
|
|
||
| 2. **`session_scores`** | ||
| * Primary key: `score_id` (UUID) | ||
| * Unique constraint: `session_id` | ||
| * Foreign keys: `session_id` → alert_sessions, `criteria_hash` → scoring_criteria_definitions | ||
| * `total_score` (INTEGER, 0-100) | ||
| * `score_breakdown` (JSONB) - Flexible dimension scores | ||
| * `score_reasoning` (TEXT) | ||
| * `scored_triggered_by` (VARCHAR 255) - User who triggered scoring (from X-Forwarded-User header) | ||
| * `scored_at` (TIMESTAMP) | ||
| * Indexes on: `session_id`, `criteria_hash`, `total_score`, `scored_at` | ||
|
|
||
| 3. **`score_missing_tools`** | ||
| * Primary key: `id` (UUID) | ||
| * Foreign key: `score_id` → session_scores | ||
| * `tool_name` (VARCHAR 255) | ||
| * `rationale` (TEXT) | ||
| * Indexes on: `score_id`, `tool_name` | ||
|
|
||
| 4. **`score_alternative_approaches`** | ||
| * Primary key: `id` (UUID) | ||
| * Foreign key: `score_id` → session_scores | ||
| * `name` (VARCHAR 255) | ||
| * `description` (TEXT) | ||
| * Indexes on: `score_id`, `name` | ||
|
|
||
| 5. **`score_alternative_approach_steps`** | ||
| * Primary key: `id` (UUID) | ||
| * Foreign key: `approach_id` → score_alternative_approaches | ||
| * `step_order` (INTEGER) | ||
| * `step_description` (TEXT) | ||
| * Index on: `approach_id` | ||
|
|
||
| **Alembic Migration:** | ||
|
|
||
| * Create forward migration with all tables and indexes | ||
| * Create rollback migration (drop tables in reverse dependency order) | ||
|
|
||
| ### Task 2: Data Models | ||
|
|
||
| **Database Models (SQLModel):** | ||
|
|
||
| * `ScoringCriteriaDefinitionDB` - Criteria storage | ||
| * `SessionScoreDB` - Main score record with relationships | ||
| * `MissingToolDB` - Missing tool entries | ||
| * `AlternativeApproachDB` - Alternative approach entries | ||
| * `AlternativeApproachStepDB` - Approach step entries | ||
|
|
||
| **API Models (Pydantic):** | ||
|
|
||
| * `SessionScore` - API response with nested objects | ||
| * `MissingTool` - Missing tool representation | ||
| * `AlternativeApproach` - Alternative approach with steps list | ||
| * `ScoringCriteriaDefinition` - Criteria definition response | ||
| * `JudgeOutputSchema` - For parsing LLM JSON responses | ||
|
|
||
| **Repository Layer:** | ||
|
|
||
| * Convert database models (with relationships) to API models (with nested objects) | ||
| * Handle eager loading for efficiency | ||
| * Map `is_current_criteria` by comparing hash to current config | ||
|
|
||
| ### Task 3: Scoring Service | ||
|
|
||
| **Core Logic (`backend/tarsy/services/scoring_service.py`):** | ||
|
|
||
| 1. **Configuration Loading**: | ||
| * Load `config/scoring_config.yaml` with template variable substitution | ||
| * Compute SHA256 hash of full configuration (deterministic criteria versioning) | ||
| * Store criteria definition in database if not exists | ||
|
|
||
| 2. **Session Data Retrieval**: | ||
| * Call History Service to get complete session conversation | ||
| * Equivalent to `/final-analysis` endpoint data (~25k tokens) | ||
|
|
||
| 3. **Judge Prompt Construction**: | ||
| * Replace `{{SESSION_CONVERSATION}}` with full conversation | ||
| * Replace `{{ALERT_DATA}}` with original alert | ||
| * Replace `{{OUTPUT_SCHEMA}}` with JSON schema from `JudgeOutputSchema` | ||
|
|
||
| 4. **LLM Integration**: | ||
| * Send prompt to judge LLM (configurable provider/model) | ||
| * Retry with exponential backoff (max 3 retries: 1s, 2s, 4s) | ||
| * Circuit breaker: open after 5 consecutive failures | ||
|
|
||
| 5. **Response Parsing**: | ||
| * Parse JSON response into `JudgeOutputSchema` | ||
| * Validate total_score, score_breakdown, missing_tools, alternative_approaches | ||
|
|
||
| 6. **Database Storage**: | ||
| * Insert `session_scores` record | ||
| * Insert `score_missing_tools` records (bulk) | ||
| * Insert `score_alternative_approaches` and `score_alternative_approach_steps` (bulk) | ||
| * Return populated `SessionScore` API model | ||
|
|
||
| 7. **Error Handling**: | ||
| * Invalid JSON: Log raw response, return 500 with details | ||
| * Database failures: Retry once, then return 500 | ||
| * Missing session: Return 404 | ||
|
|
||
| ### Task 4: API Controller | ||
|
|
||
| **Controller (`backend/tarsy/controllers/scoring_controller.py`):** | ||
|
|
||
| * `POST /api/v1/scoring/sessions/{session_id}/score` | ||
| * Check if score exists (unless `force_rescore=true`) | ||
| * Execute scoring via background task (FastAPI BackgroundTasks) | ||
| * Return score immediately if exists, or start scoring and poll/wait | ||
|
|
||
| * `GET /api/v1/scoring/sessions/{session_id}/score` | ||
| * Retrieve score from repository | ||
| * Compute `is_current_criteria` by comparing hash | ||
| * Return 404 if not scored | ||
|
|
||
| **User Attribution:** | ||
|
|
||
| * Extract user from `X-Forwarded-User` header (oauth2-proxy) | ||
| * Store in `scored_triggered_by` field of session_scores table for audit trail | ||
|
|
||
| ### Task 5: Configuration Loading | ||
|
|
||
| **Config Service Updates:** | ||
|
|
||
| * Load `scoring_config.yaml` similar to `agents.yaml` and `llm_providers.yaml` | ||
| * Template variable substitution (e.g., `${SCORING_LLM_PROVIDER:-${DEFAULT_LLM_PROVIDER}}`) | ||
| * SHA256 hashing logic: Hash full resolved YAML content for deterministic `criteria_hash` | ||
| * Store criteria definition on first use (idempotent) | ||
|
|
||
| ### Task 6: Testing | ||
|
|
||
| **Unit Tests:** | ||
|
|
||
| * Model validation (JudgeOutputSchema, SessionScore) | ||
| * Configuration hashing (deterministic results) | ||
| * Prompt placeholder substitution | ||
| * JSON response parsing (valid and invalid cases) | ||
| * Repository conversion (DB models → API models) | ||
|
|
||
| **Integration Tests:** | ||
|
|
||
| * Full scoring flow: session retrieval → LLM call → database storage | ||
| * Mock LLM responses for consistency | ||
| * Re-scoring with `force_rescore=true` | ||
| * Error handling: missing session, invalid JSON, database failures | ||
|
|
||
| **Test Coverage Target:** 80% minimum for new components, 100% for critical paths (scoring logic, database operations) | ||
|
|
||
| --- | ||
|
|
||
| ## Key Design Decisions | ||
|
|
||
| 1. **Content-Addressed Criteria**: SHA256 hash of full configuration eliminates manual version management and provides automatic obsolescence detection | ||
| 2. **JSONB Flexibility vs Normalized Feedback**: Score dimensions in flexible JSONB (criteria evolve), but missing tools and alternatives in normalized tables (enables analytics) | ||
| 3. **Output Schema in Code, Criteria in Config**: Separation of concerns - config defines *what* to evaluate, code defines *how* to report it; ensures parsing reliability | ||
| 4. **Placeholder System**: Enables optimal prompt engineering (preamble → session data → refocusing instructions → schema) | ||
| 5. **Two-Layer Model Architecture**: Database models (SQLModel with relationships) → Repository conversion → API models (Pydantic with nested objects) | ||
| 6. **Non-Intrusive Operation**: Post-session scoring, zero impact on alert processing performance | ||
| 7. **Background Task Execution**: Scoring runs asynchronously to avoid blocking API responses | ||
| 8. **Manual Control**: Phase 1 is operator-triggered only; automation deferred to future phases for cost control | ||
|
|
||
| --- | ||
|
|
||
| ## Files to Modify | ||
|
|
||
| ### Backend Models | ||
|
|
||
| * **`backend/tarsy/models/scoring_models.py`** (new) - All database and API models | ||
|
|
||
| ### Backend Services & Repositories | ||
|
|
||
| * **`backend/tarsy/services/scoring_service.py`** (new) - Core scoring logic | ||
| * **`backend/tarsy/repositories/scoring_repository.py`** (new) - Database operations | ||
|
|
||
| ### Backend Controllers | ||
|
|
||
| * **`backend/tarsy/controllers/scoring_controller.py`** (new) - REST API endpoints | ||
|
|
||
| ### Configuration | ||
|
|
||
| * **`config/scoring_config.yaml`** (new) - Scoring configuration with judge prompt | ||
|
|
||
| ### Database | ||
|
|
||
| * **`backend/alembic/versions/XXXX_add_scoring_tables.py`** (new) - Database migration | ||
|
|
||
| ### Main Application | ||
|
|
||
| * **`backend/tarsy/main.py`** - Register scoring routes | ||
|
|
||
| ### Backend Tests | ||
|
|
||
| * **`backend/tests/unit/test_scoring_service.py`** (new) - Unit tests | ||
| * **`backend/tests/integration/test_scoring_api.py`** (new) - Integration tests | ||
|
|
||
| ### Dashboard Components (Phase 6) | ||
|
|
||
| * **`dashboard/src/components/scoring/ScoreDetailView.tsx`** (new) - Main score view | ||
| * **`dashboard/src/components/scoring/ScoreBreakdownCard.tsx`** (new) - Score breakdown display | ||
| * **`dashboard/src/components/scoring/MissingToolsCard.tsx`** (new) - Missing tools list | ||
| * **`dashboard/src/components/scoring/AlternativeApproachesCard.tsx`** (new) - Alternative approaches | ||
| * **`dashboard/src/components/scoring/ScoreBadge.tsx`** (new) - Reusable color-coded badge | ||
| * **`dashboard/src/components/HistoricalAlertsList.tsx`** - Add Score column | ||
| * **`dashboard/src/components/SessionDetailPageBase.tsx`** - Add Score toggle | ||
| * **`dashboard/src/types/`** - Add score-related TypeScript interfaces | ||
| * **`dashboard/src/api/`** - Add scoring API client functions | ||
|
|
||
| --- | ||
|
|
||
| ## Implementation Phases | ||
|
|
||
| ### Phase 1: Database Schema & Models | ||
|
|
||
| **Goal:** Establish data structures and persistence layer | ||
|
|
||
| * [ ] Create Alembic migration for all scoring tables (criteria_definitions, session_scores, missing_tools, alternative_approaches, approach_steps) | ||
| * [ ] Implement database models (SQLModel): ScoringCriteriaDefinitionDB, SessionScoreDB, MissingToolDB, AlternativeApproachDB, AlternativeApproachStepDB | ||
| * [ ] Implement API models (Pydantic): SessionScore, MissingTool, AlternativeApproach, ScoringCriteriaDefinition, JudgeOutputSchema | ||
| * [ ] Create repository layer with DB → API model conversion logic | ||
| * [ ] Test database schema and model conversions | ||
|
|
||
| **Dependencies:** None (foundational work) | ||
|
|
||
| **Deliverables:** | ||
|
|
||
| * Database migration script | ||
| * Complete data model definitions | ||
| * Repository with conversion logic | ||
| * Unit tests for models | ||
|
|
||
| --- | ||
|
|
||
| ### Phase 2: Configuration & Hashing | ||
|
|
||
| **Goal:** Implement configuration loading and criteria versioning | ||
|
|
||
| * [ ] Implement config loading from `scoring_config.yaml` | ||
| * [ ] Add template variable substitution (follows existing pattern from `agents.yaml`) | ||
| * [ ] Implement SHA256 hashing logic for full configuration content | ||
| * [ ] Add criteria definition storage (idempotent insert) | ||
| * [ ] Test configuration loading and hash determinism | ||
|
|
||
| **Dependencies:** Phase 1 (needs ScoringCriteriaDefinitionDB model) | ||
|
|
||
| **Deliverables:** | ||
|
|
||
| * Configuration loading service | ||
| * Hashing implementation with tests | ||
| * Criteria storage logic | ||
|
|
||
| --- | ||
|
|
||
| ### Phase 3: Scoring Service & LLM Integration | ||
|
|
||
| **Goal:** Implement core scoring logic and judge LLM integration | ||
|
|
||
| * [ ] Implement session data retrieval from History Service | ||
| * [ ] Build judge prompt construction with placeholder substitution | ||
| * [ ] Integrate LLM client for judge calls | ||
| * [ ] Implement JSON response parsing (JudgeOutputSchema) | ||
| * [ ] Add retry logic with exponential backoff (1s, 2s, 4s) | ||
| * [ ] Implement circuit breaker pattern (open after 5 failures) | ||
| * [ ] Implement database storage logic (scores, missing tools, alternatives) | ||
| * [ ] Test end-to-end scoring flow with mocked LLM responses | ||
|
|
||
| **Dependencies:** Phases 1-2 (needs models and configuration) | ||
|
|
||
| **Deliverables:** | ||
|
|
||
| * Complete scoring service | ||
| * LLM integration with resilience patterns | ||
| * Integration tests with mocked LLM | ||
|
|
||
| --- | ||
|
|
||
| ### Phase 4: API Endpoints | ||
|
|
||
| **Goal:** Expose scoring functionality via REST API | ||
|
|
||
| * [ ] Implement scoring controller (`backend/tarsy/controllers/scoring_controller.py`) | ||
| * [ ] Add POST /score endpoint with background task execution | ||
| * [ ] Add GET /score endpoint for retrieval | ||
| * [ ] Implement error handling and validation | ||
| * [ ] Add user attribution from oauth2-proxy headers | ||
| * [ ] Register routes in main.py | ||
| * [ ] Test API endpoints (success and error cases) | ||
|
|
||
| **Dependencies:** Phase 3 (needs scoring service) | ||
|
|
||
| **Deliverables:** | ||
|
|
||
| * REST API endpoints | ||
| * Error handling and validation | ||
| * API integration tests | ||
|
|
||
| --- | ||
|
|
||
| ### Phase 5: Testing & Documentation | ||
|
|
||
| **Goal:** Comprehensive testing and documentation | ||
|
|
||
| * [ ] Complete unit test coverage for all components (target: 80% overall, 100% critical paths) | ||
| * [ ] Complete integration tests for full scoring flow | ||
| * [ ] Add API documentation (OpenAPI/Swagger via FastAPI) | ||
| * [ ] Update CLAUDE.md with scoring system overview (if needed) | ||
| * [ ] Create example scoring_config.yaml in repo | ||
| * [ ] Document scoring criteria evolution workflow | ||
|
|
||
| **Dependencies:** Phases 1-4 (all implementation complete) | ||
|
|
||
| **Deliverables:** | ||
|
|
||
| * Comprehensive test suite | ||
| * API documentation | ||
| * User-facing documentation | ||
| * Configuration examples | ||
|
|
||
| --- | ||
|
|
||
| ### Phase 6: UI Integration | ||
|
|
||
| **Goal:** Provide basic scoring visualization in TARSy dashboard | ||
|
|
||
| * [ ] Create reusable ScoreBadge component with 5-tier color coding | ||
| * [ ] Add Score column to HistoricalAlertsList with click navigation | ||
| * [ ] Implement ScoreDetailView with score breakdown, reasoning, missing tools, and alternatives | ||
| * [ ] Add Score toggle option to SessionDetailPageBase | ||
| * [ ] Implement "Score Session" button with API integration | ||
| * [ ] Add TypeScript interfaces for score data models | ||
| * [ ] Update dashboard API client with scoring endpoints | ||
| * [ ] Test UI with various score ranges and edge cases (unscored sessions, errors) | ||
|
|
||
| **Dependencies:** Phases 1-5 (requires backend API endpoints) | ||
|
|
||
| **Deliverables:** | ||
|
|
||
| * Score visualization in session list | ||
| * Comprehensive score detail view | ||
| * Manual scoring trigger UI | ||
| * TypeScript types and API integration | ||
|
|
||
| **UI Features:** | ||
|
|
||
| 1. **Session List Enhancement:** | ||
| * New "Score" column with color-coded badges | ||
| * Color scheme (matches judge prompt scoring philosophy): | ||
| * 0-44: Red (failed investigation) | ||
| * 45-59: Orange (weak investigation) | ||
| * 60-74: Yellow (adequate investigation) | ||
| * 75-89: Light Green (good investigation) | ||
| * 90-100: Dark Green (near-perfect investigation) | ||
| * "Not Scored" badge for sessions without scores | ||
| * Clickable scores navigate to score detail view | ||
|
|
||
| 2. **Score Detail Page:** | ||
| * New "Score" toggle option (alongside Conversation/Technical) | ||
| * Displays: | ||
| * Alert detail (reuses OriginalAlertCard) | ||
| * Final analysis (reuses FinalAnalysisCard) | ||
| * Score breakdown card (new component) | ||
| * Score reasoning card (new component) | ||
| * Missing tools list (new component) | ||
| * Alternative approaches with steps (new component) | ||
|
|
||
| 3. **Manual Scoring Trigger:** | ||
| * "Score Session" button in session detail page | ||
| * Calls `POST /api/v1/scoring/sessions/{session_id}/score` | ||
| * Loading state during scoring | ||
| * Auto-refresh after completion | ||
|
|
||
| **New Dashboard Components:** | ||
|
|
||
| * `dashboard/src/components/scoring/ScoreDetailView.tsx` - Main score view | ||
| * `dashboard/src/components/scoring/ScoreBreakdownCard.tsx` - Score breakdown display | ||
| * `dashboard/src/components/scoring/MissingToolsCard.tsx` - Missing tools list | ||
| * `dashboard/src/components/scoring/AlternativeApproachesCard.tsx` - Alternative approaches | ||
| * `dashboard/src/components/scoring/ScoreBadge.tsx` - Reusable color-coded badge | ||
|
|
||
| **Modified Dashboard Components:** | ||
|
|
||
| * `dashboard/src/components/HistoricalAlertsList.tsx` - Add Score column | ||
| * `dashboard/src/components/SessionDetailPageBase.tsx` - Add Score toggle | ||
| * `dashboard/src/types/` - Add score-related TypeScript interfaces | ||
| * `dashboard/src/api/` - Add scoring API client functions | ||
|
|
||
| --- | ||
|
|
||
| ## Next Steps | ||
|
|
||
| 1. Review and approve Phase 1 design | ||
| 2. Begin Phase 1 implementation (database schema and models) | ||
| 3. Iterate through phases with testing at each stage | ||
| 4. Deploy to staging for user acceptance testing | ||
| 5. Implement Phase 6 UI integration after backend API is validated | ||
| 6. Production deployment after validation | ||
|
|
||
| ## Attachment: The Judge Prompt from the POC | ||
|
|
||
| ```yaml | ||
| You are a computer security expert specializing in DevOps and Kubernetes security operations. | ||
|
|
||
| Your role is to evaluate SRE investigations with EXTREME CRITICAL RIGOR. You are a methodology-focused perfectionist who: | ||
| - Demands optimal investigation paths, not just successful outcomes | ||
| - Penalizes ANY logical shortcuts, premature conclusions, or incomplete exploration | ||
| - Holds investigations to the highest professional standards | ||
| - Identifies what SHOULD HAVE been done, not just what WAS done | ||
|
|
||
| ## EVALUATION PHILOSOPHY | ||
|
|
||
| You prefer CRITICISM over PRAISE. Your default stance is skeptical. When an investigation reaches a conclusion: | ||
| 1. First ask: "Was ALL available evidence gathered?" | ||
| 2. Then ask: "Were ALL available tools explored?" | ||
| 3. Then ask: "Does the confidence level match the evidence quality?" | ||
| 4. Finally ask: "What alternative approaches would have been faster or more rigorous?" | ||
|
|
||
| ## SCORING FRAMEWORK | ||
|
|
||
| You will score investigations across 4 categories, each worth 25 points (100 total): | ||
|
|
||
| ### 1. LOGICAL FLOW (0-25 points) | ||
| **What to evaluate:** | ||
| - Did the investigation follow optimal reasoning paths? | ||
| - Were steps sequenced efficiently, or was there trial-and-error waste? | ||
| - Did the agent pivot at the right time, or give up prematurely/persist too long? | ||
| - Are there logical leaps or shortcuts that bypass valuable investigative steps? | ||
|
|
||
| **Deduct heavily for:** | ||
| - Jumping to conclusions without exhausting investigation paths | ||
| - Repeated failed attempts without strategy adjustment | ||
| - Premature abandonment when alternative approaches exist | ||
| - Not using information already available (e.g., timestamp in alert) | ||
| - Trial-and-error guessing instead of systematic discovery | ||
|
|
||
| **Typical score range:** 10-22 points. Award 23+ only for near-flawless investigation flow. | ||
|
|
||
| ### 2. CONSISTENCY (0-25 points) | ||
| **What to evaluate:** | ||
| - Do observations logically support conclusions? | ||
| - Is confidence level justified by evidence gathered? | ||
| - Are there contradictions between stated limitations and claimed certainty? | ||
| - Does the classification match the evidence severity? | ||
|
|
||
| **Deduct heavily for:** | ||
| - HIGH confidence with incomplete evidence gathering | ||
| - Claiming "malicious" without verifying execution vs. mere file presence | ||
| - Contradictions like "pod is terminated" + "zero evidence" → "high confidence MALICIOUS" | ||
| - Over-interpreting weak signals (e.g., dictionary word = software installation) | ||
| - Under-interpreting strong signals (e.g., dismissing repeated failures) | ||
|
|
||
| **Typical score range:** 15-22 points. Award 23+ only for ironclad logical consistency. | ||
|
|
||
| ### 3. TOOL RELEVANCE (0-25 points) | ||
| **What to evaluate:** | ||
| - Were the MOST appropriate tools selected for each investigation phase? | ||
| - Was tool failure handled by trying alternative tools, or by giving up? | ||
| - Were tools used efficiently (right parameters, right sequence)? | ||
|
|
||
| **Deduct heavily for:** | ||
| - Not attempting to access logs, files, processes or other relevant evidence when tools exist | ||
| - Guessing parameters instead of discovering correct values first | ||
| - Not checking historical data when live resources are unavailable | ||
|
|
||
| **Typical score range:** 10-22 points. Award 23+ only if tool selection was optimal AND comprehensive. | ||
|
|
||
| ### 4. SYNTHESIS QUALITY (0-25 points) | ||
| **What to evaluate:** | ||
| - Is the final analysis supported by DIRECT evidence, not just inference? | ||
| - Does the report acknowledge gaps and limitations appropriately? | ||
| - Are recommendations proportional to evidence strength? | ||
| - Does the synthesis integrate ALL gathered data, not just selected pieces? | ||
|
|
||
| **Deduct heavily for:** | ||
| - Conclusions based on circumstantial evidence when direct evidence was accessible | ||
| - Severe recommendations (BAN, MALICIOUS) without verification of actual execution/harm | ||
| - Not acknowledging critical investigation gaps | ||
| - Failing to consider benign alternative explanations | ||
| - Ignoring contradictory evidence | ||
|
|
||
| **Typical score range:** 8-20 points. Award 21+ only for evidence-rich, nuanced synthesis. | ||
|
|
||
| ## CRITICAL EVALUATION CHECKLIST | ||
|
|
||
| For each investigation, systematically check: | ||
|
|
||
| **Evidence Quality:** | ||
| - [ ] Was direct evidence gathered (logs, files, processes) or only metrics? | ||
| - [ ] Were files READ and verified, or only detected by filename? | ||
| - [ ] Were processes INSPECTED, or activity inferred from resource usage? | ||
| - [ ] Was execution CONFIRMED, or assumed from file presence? | ||
|
|
||
| **Tool Completeness:** | ||
| - [ ] Were forensic tools used? (list-files, read-file, grep-files) | ||
| - [ ] Were historical tools attempted? (terminated pod logs, event history) | ||
| - [ ] Were cross-reference tools used? (correlating network + CPU data) | ||
|
|
||
| **Logical Rigor:** | ||
| - [ ] Did each step build on previous findings, or was there random exploration? | ||
| - [ ] Were failed attempts analyzed to inform next steps? | ||
| - [ ] Was the investigation abandoned prematurely when alternatives existed? | ||
| - [ ] Were time windows adjusted appropriately based on alert timestamps? | ||
|
|
||
| **Confidence Calibration:** | ||
| - [ ] Does HIGH confidence have comprehensive verification? | ||
| - [ ] Does MEDIUM confidence acknowledge specific gaps? | ||
| - [ ] Are limitations explicitly stated when evidence is incomplete? | ||
|
|
||
| ## LANGUAGE PATTERNS TO USE | ||
|
|
||
| When critiquing, use these patterns: | ||
|
|
||
| **Identifying problems:** | ||
| - "However, there are significant logical issues..." | ||
| - "This represents a critical logical shortcut because..." | ||
| - "The agent should have immediately..." | ||
| - "A rigorous investigation would have..." | ||
| - "The conclusion jumps to X despite never actually..." | ||
| - "Tool selection is severely inadequate because..." | ||
|
|
||
| **Highlighting missed opportunities:** | ||
| - "The agent failed to use available tools such as..." | ||
| - "After identifying X, the agent should have..." | ||
| - "The agent never attempted to..." | ||
| - "A more logical approach would have been..." | ||
|
|
||
| **Pointing out inconsistencies:** | ||
| - "There's a significant inconsistency: the agent reports X but concludes Y" | ||
| - "The agent paradoxically expresses HIGH confidence despite..." | ||
| - "This contradicts the earlier observation that..." | ||
|
|
||
| **Avoid excessive praise:** | ||
| - Don't use "excellent" or "flawless" unless truly warranted (rare) | ||
| - Replace "good" with "reasonable" or "adequate" | ||
| - Temper positives: "While the tool selection was appropriate initially, however..." | ||
|
|
||
| ## IDENTIFYING MISSING TOOLS | ||
|
|
||
| After reviewing the investigation, identify tools that SHOULD have been used but WEREN'T. | ||
|
|
||
| **What qualifies as a "missing tool"?** | ||
|
|
||
| Include tools that would have: | ||
| - Provided DIRECT evidence instead of circumstantial inference | ||
| - Enabled verification of assumptions that were left unverified | ||
| - Revealed information that was guessed or inferred | ||
| - Made the investigation more efficient or systematic | ||
| - Eliminated ambiguity in the findings | ||
|
|
||
| **DO NOT include:** | ||
| - Nice-to-have tools that wouldn't significantly change the analysis | ||
| - Tools that are redundant with what was already done | ||
| - Tools for information that was already conclusively obtained another way | ||
|
|
||
| **For each missing tool, provide:** | ||
| - **tool_name**: Specific tool that should have been used (e.g., "read-file", "kubectl-logs", "list-processes") | ||
| - **rationale**: Explain what evidence it would have provided and why it was needed. Be specific about what gap this would have filled. | ||
|
|
||
| ## CONSTRUCTING ALTERNATIVE APPROACHES | ||
|
|
||
| Provide 1-3 alternative investigation paths that would have been MORE EFFICIENT or MORE RIGOROUS than what was actually done. | ||
|
|
||
| **What makes a good alternative approach?** | ||
|
|
||
| An alternative approach should: | ||
| - Be more EFFICIENT (fewer steps to reach same or better conclusion) | ||
| - Be more RIGOROUS (gather more direct evidence) | ||
| - Address specific gaps you identified in the actual investigation | ||
| - Be realistic given the available tools shown in the session | ||
|
|
||
| **Structure each alternative as:** | ||
| - **name**: Brief, descriptive title (e.g., "File-First Forensic Approach", "Systematic Resource Analysis") | ||
| - **description**: 1-2 sentences explaining the strategy and why it's better than what was done | ||
| - **steps**: Ordered list of specific actions with tool names and expected outcomes | ||
|
|
||
| **Guidelines for steps:** | ||
| - Each step should be specific and actionable | ||
| - Include actual tool names you saw available in the session (not generic "check logs") | ||
| - Show logical progression: discovery → verification → conclusion | ||
| - Demonstrate how this avoids the problems you criticized | ||
| - Each step should build on the previous step's findings | ||
|
|
||
| **How many alternatives to provide:** | ||
| - Provide 0-3 alternative approaches | ||
| - Only include alternatives that are genuinely BETTER than what was done | ||
| - If the investigation was reasonably optimal, provide fewer alternatives (or even zero) | ||
| - Each alternative should address different weaknesses (e.g., one for efficiency, one for rigor) | ||
|
|
||
| ## SESSION DATA | ||
|
|
||
| Below is the complete conversation from the alert analysis session. | ||
| The conversation includes all MCP tool interactions and their results. | ||
|
|
||
| {{SESSION_CONVERSATION}} | ||
|
|
||
| Original Alert: | ||
| {{ALERT_DATA}} | ||
|
|
||
| ## YOUR TASK NOW | ||
|
|
||
| You have now seen the complete investigation from start to finish. | ||
|
|
||
| Provide your critical evaluation following the methodology-focused framework given at the start. | ||
|
|
||
| **BEFORE YOU SCORE, ASK YOURSELF:** | ||
|
|
||
| 1. **Evidence Gathering**: Did they gather DIRECT evidence (read files, check logs, inspect processes) or just rely on metrics and alerts? | ||
|
|
||
| 2. **Tool Completeness**: List ALL tools they COULD have used but DIDN'T. For each unused tool, deduct points. | ||
|
|
||
| 3. **Logical Shortcuts**: Identify ANY place where they: | ||
| - Jumped to a conclusion without verification | ||
| - Gave up when alternatives existed | ||
| - Repeated failed attempts without pivoting | ||
| - Used trial-and-error instead of systematic discovery | ||
|
|
||
| 4. **Confidence vs Evidence**: Does their confidence level (HIGH/MEDIUM/LOW) match the evidence they actually gathered? If they claim HIGH confidence with gaps in investigation, deduct heavily. | ||
|
|
||
| 5. **Efficiency**: Could they have reached the same conclusion faster with better tool selection or sequencing? | ||
|
|
||
| **SCORING CALIBRATION REMINDER:** | ||
|
|
||
| - If you're scoring above 70, you're being too lenient. Re-examine for missed opportunities. | ||
| - If they didn't exhaust investigation paths, score should be ≤ 60 | ||
| - If they made logical leaps without evidence, score should be ≤ 55 | ||
| - If tool usage was incomplete, deduct 5-10 points from tool_relevance | ||
| - If confidence doesn't match evidence, deduct 5-10 points from consistency | ||
|
|
||
| **SCORING PHILOSOPHY:** | ||
|
|
||
| Your average score should be 55-75 out of 100. This reflects professional standards where: | ||
| - 90-100: Near-perfect investigation (extremely rare - reserve for exemplary cases) | ||
| - 75-89: Good investigation with minor issues | ||
| - 60-74: Adequate investigation with notable gaps | ||
| - 45-59: Weak investigation with major methodology problems | ||
| - 0-44: Failed investigation (reserve for truly incomplete work) | ||
|
|
||
| Remember: If you're tempted to give a high score, ask yourself: | ||
| - "Is any evidence missing?" | ||
| - "Is there ANY more efficient approach?" | ||
| - "Is confidence level FULLY justified by direct evidence?" | ||
| - "Could the investigation have been more thorough?" | ||
|
|
||
| If the answer to any is "yes," deduct more points. | ||
|
|
||
| **CRITICAL REMINDERS:** | ||
|
|
||
| 1. Process > Outcome: Reaching the right conclusion via inefficient/incomplete methods still deserves criticism | ||
| 2. Direct > Circumstantial: Correlation is not verification. Demand direct evidence. | ||
| 3. Explore > Conclude: Premature conclusion is worse than over-investigation | ||
| 4. Evidence > Confidence: High confidence requires comprehensive evidence gathering | ||
|
|
||
| **Your evaluation must:** | ||
| - Be at least 200 words in the reasoning section | ||
| - Use critical language ("however," "failed to," "should have," "never attempted") | ||
| - Explain point deductions explicitly for each category | ||
|
|
||
| **MISSING TOOLS REMINDER:** | ||
| - Review the "IDENTIFYING MISSING TOOLS" section above | ||
| - List ONLY tools that would have provided direct evidence or filled critical gaps | ||
| - For each tool: provide specific tool name and explain what evidence gap it would have filled | ||
| - If no critical tools are missing, provide an empty array | ||
|
|
||
| **ALTERNATIVE APPROACHES REMINDER:** | ||
| - Review the "CONSTRUCTING ALTERNATIVE APPROACHES" section above | ||
| - Provide 1-3 alternatives ONLY if they would be genuinely more efficient or rigorous | ||
| - Each alternative must include: name, description, and specific ordered steps with tool names | ||
| - Show how each alternative addresses specific weaknesses you criticized | ||
| - If the investigation was reasonably optimal, provide fewer alternatives (or zero) | ||
|
|
||
| ## OUTPUT FORMAT | ||
|
|
||
| {{OUTPUT_SCHEMA}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add explicit secret management and LLM API key handling strategy to Design section.
The document lacks critical security guidance for LLM provider credentials:
- How are LLM API keys managed (vault, environment variables, temporary tokens)?
- Is key rotation supported and documented?
- How are API keys protected from logs, error messages, and debugging output?
- What are the operational controls (audit logging, least-privilege IAM, CI secret scanning)?
Per learnings, security posture gaps degrade guarantees. Recommend adding a new Security subsection to the Design section (once reorganized) with:
## Security
### LLM API Key Management
- **Credential Storage**: LLM provider credentials (API keys) **must** be stored in a managed secrets store (e.g., HashiCorp Vault, AWS Secrets Manager, Kubernetes Secrets) with IAM-restricted access. Never embed keys in configuration files, environment variables (except short-lived CI tokens), or application code.
- **Key Rotation**: Support periodic key rotation with zero-downtime reloading. Document rotation procedures and testing.
- **Logging & Error Handling**:
- Never log full prompts if they contain sensitive data or PII
- Redact or omit API keys and credentials from all error responses
- Log only truncated/sanitized prompt snippets and non-sensitive metadata
- Ensure failed LLM request logging strips credentials and sensitive user inputs
- **Operational Controls**:
- Enable audit logging for all secret access (who accessed the key, when, from where)
- Use service accounts with least-privilege IAM roles
- Implement CI/CD checks to prevent accidental secret exposure in commits
- Regularly scan dependencies for known vulnerabilities in LLM client libraries| # EP-0028 Phase 1: Alert Session Scoring API | ||
|
|
||
| **Status:** Pending | ||
| **Created:** 2025-12-05 | ||
| **Phase:** Design | ||
|
|
||
| --- | ||
|
|
||
| ## Overview | ||
|
|
||
| Add systematic quality assessment for TARSy alert analysis sessions through an LLM-based judge that critically evaluates investigation methodology and provides actionable feedback. | ||
|
|
||
| **Core Capabilities:** | ||
|
|
||
| 1. **LLM-Based Scoring**: Judge LLM evaluates completed sessions using rigorous scoring framework | ||
| 2. **Critical Evaluation**: Methodology-focused assessment with strict standards (4 dimensions: logical flow, consistency, tool relevance, synthesis quality) | ||
| 3. **Actionable Feedback**: | ||
| * Identify missing MCP tools that should have been used | ||
| * Suggest alternative investigation approaches with detailed steps | ||
| * Provide improvement recommendations | ||
| 4. **Criteria Evolution**: Content-addressed configuration (SHA256 hash) tracks scoring criteria changes over time | ||
| 5. **Quality Metrics**: Quantitative scores (0-100) enable tracking improvements and identifying patterns | ||
| 6. **Non-Intrusive**: Post-session scoring with zero impact on alert processing | ||
| 7. **Manual Control**: Operator-triggered scoring only (Phase 1 scope) | ||
|
|
||
| **Primary Use Cases:** | ||
|
|
||
| * Agent development feedback and improvement tracking | ||
| * MCP server prioritization based on gap analysis | ||
| * Training data curation for high-quality sessions | ||
| * Investigation methodology pattern discovery | ||
|
|
||
| **Scope:** On-demand scoring API for individual sessions via REST endpoints and basic UI integration. Scheduled batch scoring, analytics aggregation and reporting are left for the future. | ||
|
|
||
| **POC Reference:** Initial prototype at <https://github.com/metlos/tarsy-response-score> proved judge LLM provides critical scoring even for self-produced analyses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing explicit "Problem Statement" section per EP guidelines.
The document lacks a formal "Problem Statement" section after metadata (line 6) and before "Overview." Per learnings and coding guidelines, EP documents must include a clear Problem Statement that briefly describes what needs to be solved.
Insert a new "Problem Statement" section between the metadata separator (line 7) and the Overview (line 9). Suggested placement and content:
---
+## Problem Statement
+
+TARSy cannot systematically evaluate alert analysis session quality. Manual quality assessment capability is needed to:
+- Critically evaluate individual alert analysis sessions across multiple dimensions
+- Provide structured feedback on investigation methodology
+- Identify missing investigation tools and gaps
+- Enable future aggregation and trend analysis on quality metrics
+
## OverviewKeep this section brief and focused on what needs to be solved (not why), as per EP document guidelines.
🤖 Prompt for AI Agents
In docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 1 to
35, add a new "Problem Statement" section between the metadata separator (after
line 7) and the "Overview" (before line 9): create a short, focused paragraph
that states what must be solved (e.g., need for an on-demand, non-intrusive,
operator-triggered API to assess the methodological quality of completed TARSy
alert investigation sessions using an LLM judge and produce quantitative scores
and actionable feedback), keeping it strictly about the problem (not rationale
or implementation) and consistent with EP guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (2)
1-35: Problem Statement section marked as addressed but still missing.Past review comments indicate a formal Problem Statement section was added (commit b78f203) between metadata and Overview, but the current document shows Overview starting immediately at line 9 without a dedicated Problem Statement section. Either:
- The commit was not applied to this branch, or
- The fix was reverted in a later commit.
Per learnings and EP guidelines, a formal Problem Statement section must be present after the metadata separator (line 7) and before the Overview (line 9). Restore this section to match the marked completion.
🔎 Restore the missing Problem Statement section
--- +## Problem Statement + +TARSy cannot systematically evaluate alert analysis session quality. Manual quality assessment capability is needed to: +- Critically evaluate individual alert analysis sessions across multiple dimensions +- Provide structured feedback on investigation methodology +- Identify missing investigation tools and gaps +- Enable future aggregation and trend analysis on quality metrics + ## Overview
36-142: Design section is not explicitly consolidated; content remains scattered.Past review comments indicate Design and Implementation Plan sections should be formally reorganized (commit 835d202), with a single consolidated Design section containing System Architecture, Data Design, API Design, Configuration, Security, Performance, Error Handling, and Monitoring subsections. The current document has:
- Configuration (line 36)
- API Usage (line 67)
- Key Design Decisions (line 123)
But lacks a cohesive "## Design" section grouping these. The structure should be:
- Problem Statement
- Overview
- Goals (if present)
- Design ← Single section with subsections for Configuration, API Design, Data Models, Security, etc.
- Implementation Plan ← Currently at line 139, which is correct placement
- Success Criteria
- Status & Tracking
Reorganize to consolidate design-related subsections under a single "## Design" header.
🔎 Proposed consolidation structure
## Problem Statement [...] ## Overview [...] +## Design + +### Configuration [move lines 36-65 content here] -## API Usage +### API Usage [move lines 67-120 content here] -## Key Design Decisions +### Key Design Decisions [move lines 123-137 content here] ## Implementation Plan [keep as top-level]
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🧰 Additional context used
📓 Path-based instructions (1)
docs/enhancements/**/*.md
📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)
docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🧠 Learnings (19)
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:51.044Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:51.044Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: When EP references misaligned patterns (functions that have been renamed, uses deprecated patterns, or conflicts with existing architecture), fix the issue and document the change if clear, or stop and ask for clarification if unclear
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:23:41.441Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Use configuration over code for new agents when possible
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:23:41.441Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Applies to config/agents.yaml : Define agents in YAML configuration using template variable support `${VAR}` instead of hardcoding when possible
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🪛 LanguageTool
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
[grammar] ~173-~173: Use a hyphen to join words.
Context: ...or is_current_criteria * Database to API model mapping * [ ] Test database s...
(QB_NEW_EN_HYPHEN)
🪛 markdownlint-cli2 (0.18.1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
472-472: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
708-708: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests
🔇 Additional comments (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)
139-442: Implementation Plan structure is solid and actionable.The six-phase implementation roadmap (lines 139–442) is well-organized with clear dependencies, deliverables, and task breakdowns. Each phase is concrete and implementable:
- Phase 1 establishes the database layer and models
- Phase 2 hardcodes prompts and hashing logic
- Phase 3 implements scoring service and LLM integration (multi-turn conversation)
- Phase 4 exposes REST API endpoints
- Phase 5 covers testing and documentation
- Phase 6 adds basic UI integration
Dependencies are explicitly tracked, enabling parallel work where appropriate. Deliverables are clear and testable. This aligns well with EP guidelines requiring step-by-step phases for complex implementations.
alexeykazakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! I reviewed it up to the Phase 3 and left some comments.
Phase 3 (the actual agent implementation) requires more details but let's leave it for later. We can proceed with the overall design and Phases 1 & 2 and then we will discuss the Phase 3 separately. Maybe in a separate document. I would like to follow the existing agent pasterns as mush possible (re-use the BaseAgent and LLMClient which already provide a lot of built-in functionality like automatic re-tries, observability, etc). But again, we can discuss it separately.
| **Error Responses:** | ||
|
|
||
| * `400 Bad Request` - Session not completed or invalid state | ||
| * `500 Internal Server Error` - LLM API failure or database error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also return:
- 401 - if not authenticated
- 404 - session not found
| } | ||
| ``` | ||
|
|
||
| **Response (200 OK):** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't return the actual scoring to this POST request. It will take some time to process the request.
We should return 202 Accepted with the score_id only instead. The client should poll GET /api/v1/scoring/sessions/{session_id}/score to get the actual result. We need to keep in mind that if we are re-scoring then GET will return the old score until re-scoring is complete. So the client should either compare the score ID from POST with the score ID from GET or we should introduce an endpoint to get the score by score ID in addition to GET the score by session ID.
|
|
||
| 1. **Multi-Turn Conversation**: Single conversation context enables LLM to reference its own scoring when analyzing missing tools; reduces repetitive context loading | ||
| 2. **Freeform Text Storage**: Reduces LLM parsing errors compared to strict JSON schemas; enables natural language analysis and flexibility as criteria evolve | ||
| 3. **Number-on-Last-Line Pattern**: Simple, reliable score extraction using regex (`r'(\d+)\s*$'`) without JSON parsing complexity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep it this way for now. Later we can improve the score number thing. If the score is not detected then we can send a separate score-number-only request:
- Ask LLM to generate the score document with the score number in in.
- If no score number detected: Ask LLM to generate the score number only for the score document.
- Repeat step 2 ^^^ until we get the score number.
| * Columns: | ||
| * `score_id` (UUID, primary key) | ||
| * `session_id` (UUID, unique constraint, foreign key → alert_sessions) | ||
| * `criteria_hash` (VARCHAR 64) - SHA256 hash of hardcoded judge prompts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have two prompts, one for the main score prompt and one for the missing tools prompt. But in the model there is only one hash.
I would also rename the filed(s) so it's clear what it's about.
Something like: score_analysis_prompt_hash and missing_tools_analysis_prompt_hash
Or judge_prompt_hash and missing_tools_prompt_hash or something like this.
| * Constants: | ||
| * `JUDGE_PROMPT_SCORE` - Main scoring evaluation prompt (see Attachment section) | ||
| * `JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS` - Missing tools analysis prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the prompts and the methods to build them in the prompts package: https://github.com/codeready-toolchain/tarsy-bot/tree/master/backend/tarsy/agents/prompts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In @docs/enhancements/pending/EP-0028-phase-1-scoring-api.md:
- Around line 343-344: Remove the duplicate bullet in the Phase 3 LLM
integration list: delete the redundant line that repeats "Support conversation
history across turns" (the duplicate of the previous bullet), leaving only a
single instance and ensuring the list now contains "Support conversation history
across turns" and "Use LLM client with conversation history support" once each.
- Around line 324-340: The turn numbering is inconsistent: you list only Turn 1
(Score Prompt) and Turn 2 (Missing Tools Prompt) but later refer to "turn 3"
when storing missing_tools_analysis; update the text so the store step
references "turn 2" (i.e., change "from turn 3" to "from turn 2") OR add a new
"Turn 3" subsection describing the additional follow-up if you intend three
turns; ensure the symbol missing_tools_analysis is documented under the correct
turn label to keep the flow consistent.
- Around line 238-243: Rename the grammatically awkward field
`scored_triggered_by` to `score_triggered_by` everywhere: update the schema
definition (replace `scored_triggered_by` with `score_triggered_by`), update the
JSON response example that references `scored_triggered_by`, update the Phase 1
deliverable reference, and update the implementation note; ensure consistency in
related names (e.g., keep `score_id`, `score_analysis` style) and adjust any
explanatory text or references to the old name so all four occurrences are
replaced.
- Around line 404-414: The Phase 4 checklist is ambiguous about immediate async
return for POST /api/v1/scoring/sessions/{session_id}/score; update the
implementation bullets to explicitly split the two flows: (1) if a score already
exists and force_rescore is false, return the existing score immediately with
full data, and (2) if no score exists or force_rescore is true, enqueue the
scoring work via FastAPI BackgroundTasks and return immediately with a response
indicating status: "pending" (and any job/session id), not waiting for
background completion; make the wording align with the response schema in the
earlier section and reference the endpoint path POST
/api/v1/scoring/sessions/{session_id}/score and the request body
{"force_rescore": false}.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🧰 Additional context used
📓 Path-based instructions (1)
docs/enhancements/**/*.md
📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)
docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🧠 Learnings (19)
📓 Common learnings
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:51.044Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:51.044Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: When EP references misaligned patterns (functions that have been renamed, uses deprecated patterns, or conflicts with existing architecture), fix the issue and document the change if clear, or stop and ask for clarification if unclear
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:41.519Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:22:32.529Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:23:41.441Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Use configuration over code for new agents when possible
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
📚 Learning: 2025-11-24T20:23:41.441Z
Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Applies to config/agents.yaml : Define agents in YAML configuration using template variable support `${VAR}` instead of hardcoding when possible
Applied to files:
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
🪛 LanguageTool
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
[grammar] ~268-~268: Use a hyphen to join words.
Context: ...or is_current_criteria * Database to API model mapping * [ ] Test database s...
(QB_NEW_EN_HYPHEN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests
🔇 Additional comments (5)
docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (5)
229-248: Approve database schema design with comprehensive indexing strategy.The Phase 1 database schema is well-designed with:
- Clear field definitions with proper types and nullability constraints
- Comprehensive indexes supporting common query patterns (session_id, status, criteria_hash, total_score)
- Unique partial constraint preventing duplicate in-progress scorings
- Explicit async state tracking fields (status, started_at_us, completed_at_us, error_message)
- Proper foreign key to alert_sessions table
The schema supports the designed async workflow and concurrent scoring scenarios effectively.
185-204: Excellent design decisions documentation.The 15 design decisions (lines 185-204) are clearly articulated and well-reasoned. They effectively balance:
- Simplicity (hardcoded prompts for Phase 1, freeform text storage, number-on-last-line pattern)
- Auditability (SHA256 criteria versioning, user attribution, status tracking)
- Async resilience (background task execution, state transitions, orphan cleanup)
- Extensibility (explicit mention that automation and concurrency limits are deferred to future phases)
This demonstrates thoughtful architectural planning with clear acknowledgment of Phase 1 scope constraints.
563-799: Judge prompt design is thorough and operationally sound.The scoring framework (lines 567-799) demonstrates:
- Clear 4-dimensional scoring rubric (Logical Flow, Consistency, Tool Relevance, Synthesis Quality)
- Specific deduction criteria with realistic score ranges (typical 10-22 points per category)
- Self-correction mechanisms ("if you're scoring above 70, you're being too lenient")
- Operational guidance on score calibration (90-100 rare, typical 55-75)
- Comprehensive evaluation checklist with evidence quality, tool completeness, logical rigor
The prompt emphasizes methodology over outcome and direct evidence over inference, which aligns with the stated goal of rigorous agent development feedback.
290-301: File path pattern is correctly specified and aligns with existing codebase conventions.The specified path
backend/tarsy/services/scoring_service.pymatches the existing directory structure and service naming conventions used throughout the codebase. Services likedata_masking_service.pyalready follow this*_service.pypattern inbackend/tarsy/services/.
342-345: LLM client infrastructure and conversation history support are verified; placeholder substitution implementation is pending.The LLMClient with multi-turn conversation history support exists in the codebase (
backend/tarsy/integrations/llm/client.py,LLMManager.generate_response()) and can be configured with default LLM settings viasettings.llm_provider. LLMConversation model supports multi-turn dialogue with message persistence across turns, and the History Service is available for session conversation retrieval.However, the specific placeholder substitution pattern described in lines 324–340 ({{SESSION_CONVERSATION}}, {{ALERT_DATA}}, {{OUTPUT_SCHEMA}}) and the associated prompt constants (JUDGE_PROMPT_SCORE, JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) are not yet implemented in the codebase. These remain TODO items in the EP's Phase 1 implementation checklist.
| - No request timeouts on long-running LLM operations | ||
| - Real-time progress feedback via WebSocket events | ||
| - Consistent UX with alert processing workflow | ||
| - Reuses existing event infrastructure (no new systems needed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the EP document to understand the context
cat -n docs/enhancements/pending/EP-0028-phase-1-scoring-api.md | sed -n '40,180p'Repository: codeready-toolchain/tarsy-bot
Length of output: 6825
🏁 Script executed:
# Search for WebSocket/event infrastructure in the backend
find backend -type f -name "*.py" | head -20Repository: codeready-toolchain/tarsy-bot
Length of output: 1769
🏁 Script executed:
# Look for event-related classes and functions
rg -n "class.*Event|def.*broadcast|def.*emit|def.*publish" backend/ --type py -A 1 | head -50Repository: codeready-toolchain/tarsy-bot
Length of output: 4379
🏁 Script executed:
# Search for channel patterns
rg -n "channel|session:" backend/ --type py -i | head -30Repository: codeready-toolchain/tarsy-bot
Length of output: 3556
🏁 Script executed:
# Search for broadcast_to_channel usage and event publishing
rg -n "broadcast_to_channel|publish.*event|emit.*event" backend/ --type py -B 2 -A 2 | head -100Repository: codeready-toolchain/tarsy-bot
Length of output: 8626
🏁 Script executed:
# Look for specific channel names like "sessions" or "session:"
rg -n '"sessions"|\'sessions\'|"session:|\$.*session' backend/ --type py | head -40Repository: codeready-toolchain/tarsy-bot
Length of output: 141
🏁 Script executed:
# Check the websocket_connection_manager.py file more thoroughly
cat -n backend/tarsy/services/websocket_connection_manager.pyRepository: codeready-toolchain/tarsy-bot
Length of output: 5140
🏁 Script executed:
# Search for event payload/data structure definitions
rg -n "class.*Event|type.*timestamp|score_id|session_id" backend/ --type py | head -50Repository: codeready-toolchain/tarsy-bot
Length of output: 4651
🏁 Script executed:
# Search for scoring event implementations
rg -n "scoring\.(started|progress|completed|failed)|score_id|ScoringEvent" backend/ --type py | head -50Repository: codeready-toolchain/tarsy-bot
Length of output: 55
🏁 Script executed:
# Look for scoring service or scoring-related event publishing
rg -n "scoring|score" backend/tarsy/services/ --type py -l | head -20Repository: codeready-toolchain/tarsy-bot
Length of output: 55
🏁 Script executed:
# Check if there's a scoring service that might publish these events
find backend -name "*scoring*" -type fRepository: codeready-toolchain/tarsy-bot
Length of output: 55
🏁 Script executed:
# Look at event-related files to understand event structure patterns
rg -n "type.*timestamp|timestamp_us|event.*payload" backend/ --type py | head -30Repository: codeready-toolchain/tarsy-bot
Length of output: 3350
🏁 Script executed:
# Look at event publisher implementation to understand event structure
cat -n backend/tarsy/services/events/publisher.py | head -100Repository: codeready-toolchain/tarsy-bot
Length of output: 4522
🏁 Script executed:
# Search for existing event class definitions
rg -n "class.*Event.*:" backend/tarsy --type py | head -40Repository: codeready-toolchain/tarsy-bot
Length of output: 3076
🏁 Script executed:
# Check the event models and payload structure
find backend -path "*/events/*" -name "*.py" -type f | head -20Repository: codeready-toolchain/tarsy-bot
Length of output: 509
🏁 Script executed:
# Look at the event models to understand the structure
cat -n backend/tarsy/models/event_models.py | head -150Repository: codeready-toolchain/tarsy-bot
Length of output: 6626
🏁 Script executed:
# Check the channels definition
cat -n backend/tarsy/services/events/channels.pyRepository: codeready-toolchain/tarsy-bot
Length of output: 1421
🏁 Script executed:
# Look for BaseEvent to understand the base structure
rg -n "class BaseEvent" backend/ --type py -A 10Repository: codeready-toolchain/tarsy-bot
Length of output: 895
🏁 Script executed:
# Search for any scoring-related code or references in the codebase
rg -n "scoring" backend/ --type py -i | head -30Repository: codeready-toolchain/tarsy-bot
Length of output: 55
Implement scoring-specific events following the existing TARSy event pattern.
The WebSocket infrastructure supports the channel patterns mentioned (global sessions channel and session-specific session:{session_id} channels). The event payload structure (type, timestamp_us, plus additional fields) aligns with existing events like SessionCompletedEvent and SessionFailedEvent.
However, the scoring-specific event classes (scoring.started, scoring.progress, scoring.completed, scoring.failed) do not exist in the codebase and must be created. Add them to backend/tarsy/models/event_models.py following the existing pattern: each should inherit from BaseEvent, include session_id, score_id, and timestamp_us fields, and use Literal type for the type field. Implement publishing of these events in the scoring service using EventPublisher, following the pattern in backend/tarsy/services/events/publisher.py with channel determined by whether it's a global event (sessions) or session-specific detail (session:{session_id}).
| * `scored_triggered_by` (VARCHAR 255) - User from X-Forwarded-User header | ||
| * `scored_at_us` (BIGINT) - When scoring was triggered (microseconds) | ||
| * **Async fields:** | ||
| * `status` (VARCHAR 50, NOT NULL) - pending | in_progress | completed | failed | ||
| * `started_at_us` (BIGINT, NOT NULL) - When status became in_progress | ||
| * `completed_at_us` (BIGINT, nullable) - When status became completed/failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and verify the file exists
find . -name "EP-0028-phase-1-scoring-api.md" -type fRepository: codeready-toolchain/tarsy-bot
Length of output: 131
🏁 Script executed:
# Read the relevant sections of the file to verify the field naming
# Lines around 238-243
sed -n '235,245p' docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 978
🏁 Script executed:
# Lines around 141
sed -n '138,145p' docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 419
🏁 Script executed:
# Lines around 366
sed -n '363,369p' docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 426
🏁 Script executed:
# Let's also check the overall naming patterns used in the document
# Look for other field names with "score" in them
grep -n "score_" docs/enhancements/pending/EP-0028-phase-1-scoring-api.md | head -20Repository: codeready-toolchain/tarsy-bot
Length of output: 1547
🏁 Script executed:
# Check for "scored_" patterns to see if there are other grammatically awkward names
grep -n "scored_" docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 430
Standardize field naming for grammatical consistency.
The field scored_triggered_by is grammatically awkward (mixing past participle with action description). Rename to score_triggered_by to align with the document's naming patterns (e.g., score_id, score_analysis) and improve clarity.
Update all 4 occurrences:
- Line 238: schema definition
- Line 141: JSON response example
- Line 366: Phase 1 deliverable reference
- Line 434: implementation note
Proposed rename
- `scored_triggered_by` (VARCHAR 255) - User from X-Forwarded-User header
+ `score_triggered_by` (VARCHAR 255) - User from X-Forwarded-User header- "scored_triggered_by": "alice@example.com"
+ "score_triggered_by": "alice@example.com"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| * `scored_triggered_by` (VARCHAR 255) - User from X-Forwarded-User header | |
| * `scored_at_us` (BIGINT) - When scoring was triggered (microseconds) | |
| * **Async fields:** | |
| * `status` (VARCHAR 50, NOT NULL) - pending | in_progress | completed | failed | |
| * `started_at_us` (BIGINT, NOT NULL) - When status became in_progress | |
| * `completed_at_us` (BIGINT, nullable) - When status became completed/failed | |
| * `score_triggered_by` (VARCHAR 255) - User from X-Forwarded-User header | |
| * `scored_at_us` (BIGINT) - When scoring was triggered (microseconds) | |
| * **Async fields:** | |
| * `status` (VARCHAR 50, NOT NULL) - pending | in_progress | completed | failed | |
| * `started_at_us` (BIGINT, NOT NULL) - When status became in_progress | |
| * `completed_at_us` (BIGINT, nullable) - When status became completed/failed |
🤖 Prompt for AI Agents
In @docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 238 -
243, Rename the grammatically awkward field `scored_triggered_by` to
`score_triggered_by` everywhere: update the schema definition (replace
`scored_triggered_by` with `score_triggered_by`), update the JSON response
example that references `scored_triggered_by`, update the Phase 1 deliverable
reference, and update the implementation note; ensure consistency in related
names (e.g., keep `score_id`, `score_analysis` style) and adjust any explanatory
text or references to the old name so all four occurrences are replaced.
| * [ ] Build multi-turn judge prompt construction with placeholder substitution | ||
| * **Turn 1 - Score Prompt:** | ||
| * Start with `JUDGE_PROMPT_SCORE` constant | ||
| * Replace `{{SESSION_CONVERSATION}}` with full conversation from History Service | ||
| * Replace `{{ALERT_DATA}}` with original alert data | ||
| * Replace `{{OUTPUT_SCHEMA}}` with: "You MUST end your response with a single line containing ONLY the total score as an integer (0-100)" | ||
| * Send to LLM, receive score analysis response | ||
| * **Score Extraction:** | ||
| * Extract `total_score` from last line using regex: `r'(\d+)\s*$'` | ||
| * Store full response (minus last line) as `score_analysis` | ||
| * **Turn 2 - Missing Tools Prompt:** | ||
| * Build prompt from `JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS` constant | ||
| * Send as continuation of conversation (LLM has context of its own scoring) | ||
| * Receive missing tools analysis response | ||
| * **Store Missing Tools:** | ||
| * Store full response as `missing_tools_analysis` (freeform text) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and get its size
wc -l docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 133
🏁 Script executed:
# Read the relevant sections to verify turn numbering
sed -n '320,370p' docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 2405
Correct turn numbering in multi-turn conversation flow.
The description identifies only 2 turns (lines 325 and 334), but line 365 references "turn 3" when storing the missing_tools_analysis. The turn numbering is inconsistent.
Current state:
- Line 325: "Turn 1 - Score Prompt:"
- Line 334: "Turn 2 - Missing Tools Prompt:"
- Line 365: "
missing_tools_analysis(freeform text from turn 3)"
Expected: Either correct line 365 to reference "turn 2" if there are only 2 turns, or add a description of Turn 3 in the Phase 1 multi-turn flow section (lines 324-340).
🤖 Prompt for AI Agents
In @docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 324 -
340, The turn numbering is inconsistent: you list only Turn 1 (Score Prompt) and
Turn 2 (Missing Tools Prompt) but later refer to "turn 3" when storing
missing_tools_analysis; update the text so the store step references "turn 2"
(i.e., change "from turn 3" to "from turn 2") OR add a new "Turn 3" subsection
describing the additional follow-up if you intend three turns; ensure the symbol
missing_tools_analysis is documented under the correct turn label to keep the
flow consistent.
| * Support conversation history across turns | ||
| * Use LLM client with conversation history support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove redundant line in Phase 3 LLM integration.
Line 344 duplicates line 343. Remove the duplicate statement.
Proposed fix
* [ ] Integrate LLM client for multi-turn conversation
* Use TARSy's default LLM configuration via existing LLM client infrastructure
* Support conversation history across turns
- * Use LLM client with conversation history support📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| * Support conversation history across turns | |
| * Use LLM client with conversation history support | |
| * [ ] Integrate LLM client for multi-turn conversation | |
| * Use TARSy's default LLM configuration via existing LLM client infrastructure | |
| * Support conversation history across turns |
🤖 Prompt for AI Agents
In @docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 343 -
344, Remove the duplicate bullet in the Phase 3 LLM integration list: delete the
redundant line that repeats "Support conversation history across turns" (the
duplicate of the previous bullet), leaving only a single instance and ensuring
the list now contains "Support conversation history across turns" and "Use LLM
client with conversation history support" once each.
| * [ ] Add POST /score endpoint with background task execution | ||
| * Endpoint: `POST /api/v1/scoring/sessions/{session_id}/score` | ||
| * Request body: `{"force_rescore": false}` (optional) | ||
| * Implementation: | ||
| * Check if score exists (unless `force_rescore=true`) | ||
| * If exists and not forcing rescore, return existing score immediately | ||
| * Otherwise, execute scoring via FastAPI `BackgroundTasks` | ||
| * Return score result (immediate if exists, or after background task completion) | ||
| * Error responses: | ||
| * `400 Bad Request` - Session not completed or invalid state | ||
| * `500 Internal Server Error` - LLM API failure or database error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find the EP document
find . -name "EP-0028-phase-1-scoring-api.md" -type fRepository: codeready-toolchain/tarsy-bot
Length of output: 131
🏁 Script executed:
# Read the entire file to see the context
wc -l ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 135
🏁 Script executed:
# Read lines around 91-93 to see the async pattern reference
sed -n '85,100p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 664
🏁 Script executed:
# Read the Phase 4 section around lines 404-414
sed -n '396,420p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 1232
🏁 Script executed:
# Check lines around 150-160 to see if there are more details about the async pattern
sed -n '150,165p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 624
🏁 Script executed:
# Check if there's a response schema or more details about what the endpoint returns
sed -n '410,445p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 1540
🏁 Script executed:
# Let me check if there are any response examples or schemas
sed -n '100,130p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 993
🏁 Script executed:
# Let me check the complete response schema to see what the endpoint actually returns
sed -n '100,145p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 1619
🏁 Script executed:
# Let me verify the implementation pattern described in the document
sed -n '85,115p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 1076
🏁 Script executed:
# Double-check by reading the exact lines mentioned in the review (91, 93)
sed -n '88,95p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.mdRepository: codeready-toolchain/tarsy-bot
Length of output: 444
Phase 4 checklist could be more explicit about immediate async return.
Lines 404-414 describe the POST /score endpoint implementation with less clarity than earlier sections about the async pattern. The phrase "Return score result (immediate if exists, or after background task completion)" could be more specific. The response schema in lines 100-115 and endpoint description in lines 88-95 make clear that the endpoint returns immediately with status: "pending" for new scoring requests, but Phase 4 should align with that explicitness.
Consider rewording the Phase 4 implementation bullets to clearly separate: (1) existing score cases that return immediately with data, and (2) new scoring cases that launch background tasks and return immediately with pending status.
🤖 Prompt for AI Agents
In @docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 404 -
414, The Phase 4 checklist is ambiguous about immediate async return for POST
/api/v1/scoring/sessions/{session_id}/score; update the implementation bullets
to explicitly split the two flows: (1) if a score already exists and
force_rescore is false, return the existing score immediately with full data,
and (2) if no score exists or force_rescore is true, enqueue the scoring work
via FastAPI BackgroundTasks and return immediately with a response indicating
status: "pending" (and any job/session id), not waiting for background
completion; make the wording align with the response schema in the earlier
section and reference the endpoint path POST
/api/v1/scoring/sessions/{session_id}/score and the request body
{"force_rescore": false}.
This is just the core API that will be followed by further EPs that will flesh out more functionality and UI integration on top of it.
Assisted-by: Claude Code
JIRA: https://issues.redhat.com/browse/SANDBOX-1531
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.