[SANDBOX-1531] Introduce a new EP for alert session scoring API. #180

metlos · 2025-12-09T12:31:09Z

This is just the core API that will be followed by further EPs that will flesh out more functionality and UI integration on top of it.

Assisted-by: Claude Code

JIRA: https://issues.redhat.com/browse/SANDBOX-1531

Summary by CodeRabbit

Documentation
- Added comprehensive design for an Alert Session Scoring API: async, LLM-based multi-dimension scoring (0–100) with freeform analysis, multi-turn scoring flow, criteria versioning, on-demand REST endpoints, WebSocket progress notifications, error handling and edge-case considerations, and a phased implementation roadmap including testing and eventual UI/dashboard integration.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

This is just the core API that will be followed by further EPs that will flesh out more functionality and UI integration on top of it.

coderabbitai · 2025-12-09T12:31:27Z

Walkthrough

Adds EP-0028 Phase 1 design: an async, LLM-based Alert Session Scoring API with multi-dimensional scoring (logical_flow, consistency, tool_relevance, synthesis_quality), criteria versioning via SHA256-hashed prompts, REST/WebSocket interactions, storage of freeform analyses, and a six-phase implementation roadmap.

Changes

Cohort / File(s)	Summary
Design documentation `docs/enhancements/pending/EP-0028-phase-1-scoring-api.md`	New comprehensive design doc for EP-0028 Phase 1: defines scoring capabilities, output schema (total_score 0–100, breakdown, freeform analysis, MissingTool/AlternativeApproach entries), hardcoded prompts with placeholder substitution, SHA256 criteria hashing, multi-turn scoring flow, storage expectations, WebSocket progress notifications, error handling patterns, user attribution, edge cases, and a six-phase implementation plan (DB/models, prompts/hash, scoring service, API endpoints, tests/docs, UI integration).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as Scoring API
    participant Service as Scoring Service
    participant LLM
    participant DB as Storage

    Client->>API: POST /score-session (session_id, alert_data)
    API->>Service: validate request, enqueue scoring job
    Service->>DB: fetch session/context (if needed)
    Service->>LLM: send constructed multi-turn prompt (SESSION_CONVERSATION, ALERT_DATA, OUTPUT_SCHEMA)
    LLM-->>Service: structured judge response (JSON per schema)
    Service->>Service: normalize/compute total_score, verify criteria SHA256
    Service->>DB: persist score record, freeform analysis, metadata
    Service-->>API: return scoring initiation/response
    API-->>Client: 200 OK or job accepted, WebSocket progress updates as available

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I nibble prompts in moonlit lines,
I count and ponder every sign.
From zero to a glowing hundred shine,
I note the flaws, the bright design.
Hop—your sessions scored, one carrot at a time.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically identifies the main change: introducing a new Enhancement Proposal (EP) for an alert session scoring API. It directly matches the content of the changeset, which is entirely focused on the EP-0028 design document.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d889b46 and 8cd113d.

📒 Files selected for processing (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

docs/enhancements/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)

docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🧠 Learnings (6)

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🪛 LanguageTool

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

[grammar] ~839-~839: Ensure spelling is correct
Context: ...on (LLM-dependent) - Score retrieval: < 100ms (database query) - Support for concurre...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

17-17: Bare URL used

(MD034, no-bare-urls)

447-447: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Backend Tests

🔇 Additional comments (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)

196-325: Verify referenced file paths and implementation status.

The EP references several Python files for Phase 1 implementation:

backend/tarsy/services/scoring_service.py

backend/tarsy/models/scoring_models.py

backend/tarsy/controllers/scoring_controller.py

backend/tarsy/repositories/scoring_repository.py

config/scoring_config.yaml

Confirm that:

Directory structure (services/, models/, controllers/, repositories/) follows established patterns in the codebase

Configuration file location (config/scoring_config.yaml) aligns with existing config patterns

Main application file (backend/tarsy/main.py) exists and can register new routes

File paths are consistent with the current project structure

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

coderabbitai · 2025-12-09T12:34:24Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+
+## Experimental Background
+
+**Note**: Initial prototype exists at https://github.com/metlos/TARSy-response-score (Go implementation). This EP proposes native integration into TARSy backend.


⚠️ Potential issue | 🟡 Minor

Format bare URL as markdown link.

Line 17 contains a bare URL that should be properly formatted as a markdown link to comply with MD034.

-**Note**: Initial prototype exists at https://github.com/metlos/TARSy-response-score (Go implementation). This EP proposes native integration into TARSy backend. +**Note**: Initial prototype exists at [https://github.com/metlos/TARSy-response-score](https://github.com/metlos/TARSy-response-score) (Go implementation). This EP proposes native integration into TARSy backend.

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

17-17: Bare URL used

(MD034, no-bare-urls)

🤖 Prompt for AI Agents

In docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around line 17, the comment contains a bare URL which violates MD034; update the line to replace the bare URL with a markdown link by wrapping an appropriate link text in square brackets followed by the URL in parentheses (e.g., use descriptive text like the repository name as the link text), ensuring the link is inline and not a raw URL.

codecov · 2025-12-09T12:39:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.97%. Comparing base (ea7067d) to head (6470690).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #180      +/-   ##
==========================================
+ Coverage   83.94%   83.97%   +0.02%     
==========================================
  Files         124      124              
  Lines       11947    11947              
==========================================
+ Hits        10029    10032       +3     
+ Misses       1918     1915       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)
62-62: Format bare URL as markdown link to comply with MD034.

Line 62 contains a bare URL that should be wrapped as a markdown link.
-**Note**: Initial prototype exists at https://github.com/metlos/TARSy-response-score (Go implementation). This EP proposes native integration into TARSy backend.
+**Note**: Initial prototype exists at [https://github.com/metlos/TARSy-response-score](https://github.com/metlos/TARSy-response-score) (Go implementation). This EP proposes native integration into TARSy backend.

🧹 Nitpick comments (4)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (4)
38-1182: Consolidate design and implementation content into explicit formal sections.

The document contains comprehensive design and implementation details but lacks formal section headers for "Design" and "Implementation Plan" as required by EP guidelines. Current design content is scattered across subsections (System Architecture, Data Design, API Design, Configuration, Security, Performance, etc.), and implementation details are embedded in Implementation Status rather than as a distinct section.

Reorganize as follows:
After "Use Cases" (line 149), add a clear section:
## Design
Then consolidate existing design subsections (System Architecture through Documentation Requirements) under this header.
Replace "Implementation Status" (line 1195) with formal sections:
## Implementation Plan

### Phases & Dependencies

[expand "Next Steps" into phased breakdown with dependencies]

## Status & Tracking

**Status:** Pending
[rest of current status info]
This provides the structure: Problem Statement → Goals → Overview → Design → Use Cases → Implementation Plan → Success Criteria → Status & Tracking, improving navigability and alignment with EP standards.

499-530: Clarify mutable vs. immutable aspects of scoring criteria.

The design states that "Scoring criteria are defined in the freeform judge_prompt field" and are "flexible and can evolve without code changes" (line 500). However, the content-hashing mechanism (lines 599-601) treats the entire configuration as immutable once hashed. This creates potential ambiguity:

Question: If criteria evolve and the hash changes, how do operators know when to re-score historical sessions? What's the recommended workflow for criteria migration?

Suggestion: Add a section clarifying the operational workflow when criteria change (e.g., "Operators re-score historical sessions with force_rescore=true to compare old vs. new scores and assess impact of criteria evolution").

This is especially important for use case #4 (Criteria Evolution Tracking, line 115-121), which relies on clear criteria change management.

622-894: Judge prompt template is comprehensive but may need validation against actual LLM behavior.

The judge prompt (lines 622-894) is extensive (~270 lines) and includes detailed scoring criteria, evaluation philosophy, and output format instructions. The design relies heavily on LLM compliance with these instructions, particularly:

Scoring calibration (lines 843-865): Assumes LLM will produce average scores of 55-75

Missing tools identification (lines 758-778): Requires LLM to distinguish between "critical" vs. "nice-to-have" tools

Alternative approaches (lines 780-808): Requires LLM to generate realistic, tool-specific steps

Consideration: Phase 1 testing should validate that the chosen judge LLM (e.g., Gemini Flash, GPT-4) consistently produces scores and feedback that align with these expectations. If variance is high, the prompt may need refinement.

Recommendation: Add a section in Testing Strategy (line 973) specifically for "Judge LLM Consistency Validation" testing the same session 3-5 times to measure score variance and feedback stability. Use case #7 (line 141-147) hints at this but doesn't formalize it as a testing requirement.

1195-1207: Expand "Next Steps" into formal Implementation Plan with phases and dependencies.

The "Next Steps" section (lines 1199-1207) lists 9 tasks sequentially but lacks explicit phase information, dependencies, and estimated effort. For an Enhancement Proposal, a more structured Implementation Plan would clarify:

Phase breakdown: Which tasks are Phase 1a (database), 1b (service), 1c (API), 1d (testing)?

Critical path: Which tasks block others (e.g., database migration must precede service implementation)?

Dependency map: Does "Write unit and integration tests" depend on "Implement API endpoints"?

Ownership: Is there a suggested owner or team for each phase?

Suggestion: Restructure "Implementation Status" section as:
## Implementation Plan

### Phased Tasks

**Phase 1a: Infrastructure**
- Create Alembic migration for all scoring tables
- (Dependency: none; blocks Phase 1b)

**Phase 1b: Service Layer**
- Implement scoring service and models
- (Dependency: Phase 1a; blocks Phase 1c)

[continue for all phases...]

## Status & Tracking

**Status:** Pending
...
This clarifies sequencing and dependencies for implementation teams.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8cd113d and b78f203.

📒 Files selected for processing (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

docs/enhancements/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)

docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🧠 Learnings (16)

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:51.044Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🪛 LanguageTool

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

[grammar] ~950-~950: Ensure spelling is correct
Context: ...on (LLM-dependent) - Score retrieval: < 100ms (database query) - Support for concurre...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

62-62: Bare URL used

(MD034, no-bare-urls)

92-92: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

100-100: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

108-108: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

116-116: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

124-124: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

134-134: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

142-142: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

558-558: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Backend Tests

🔇 Additional comments (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)

155-167: Verify referenced components exist in codebase.

The design references existing components (History Service at backend/tarsy/services/history_service.py, LLM Client at backend/tarsy/integrations/llm_client.py) and proposes new components following the pattern backend/tarsy/{services,models,controllers,repositories}. Confirm:

History Service exists and exposes the required data retrieval logic for complete session data

LLM Client exists and can be reused for judge LLM calls as proposed

Codebase directory structure follows the backend/tarsy pattern for new service components

Main application (backend/tarsy/main.py) exists and supports route registration

coderabbitai · 2025-12-09T14:09:45Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+### Primary Use Cases
+
+1. **Manual Quality Review**
+   ```


⚠️ Potential issue | 🟡 Minor

Specify language identifier for all fenced code blocks to comply with MD040.

Use cases embed code blocks without language specifications. Add text identifier to each opening fence:

1. **Manual Quality Review** - ``` + ```text Operator: Reviews completed security investigation

Apply the same change to all 7 use case blocks (lines 92, 100, 108, 116, 124, 134, 142).

Also applies to: 100-100, 108-108, 116-116, 124-124, 134-134, 142-142

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

92-92: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 92,100,108,116,124,134,142 the fenced code blocks are missing a language identifier which violates MD040; update each opening triple-backtick fence at those lines to use ```text so every use-case code block begins with ```text (apply this change to all seven specified lines).

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

alexeykazakov

Thanks for the document! I've started the review (not finished yet). Here are just a few first comments.

alexeykazakov · 2025-12-11T06:47:28Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+- **Criteria Evolution**: Content-addressed configuration enables tracking scoring criteria changes over time
+- **Flexible Scoring**: JSONB score breakdowns adapt to evolving evaluation dimensions without schema migrations
+- **Actionable Feedback**: Normalized storage of missing tools and alternative approaches enables analytics
+- **Manual Control**: Phase 1 is operator-triggered only; automation deferred to future phases


Do you think it's better to support triggering evaluation manually only for now? I wonder if automated evaluation would be actually easier...

My idea here was that a manually invokable REST endpoint would be easier to play with than something running on a schedule. Not much further thought behind it.

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

alexeykazakov · 2025-12-11T06:51:54Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+
+## Non-Goals (Future Phases)
+
+- UI integration for displaying scores


Well... I would include some simple/primitive UI in Phase 1. Otherwise we would have to use API any time we want to see the score/analysis. So we would have to build something anyway.

Co-authored-by: Alexey Kazakov <alkazako@redhat.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (5)
62-62: Format bare URL as markdown link (MD034 compliance).

Line 62 contains a bare URL that should be wrapped in markdown link syntax to comply with MD034.
-**Note**: Initial prototype exists at https://github.com/metlos/TARSy-response-score (Go implementation). This EP proposes native integration into TARSy backend.
+**Note**: Initial prototype exists at [GitHub: metlos/TARSy-response-score](https://github.com/metlos/TARSy-response-score) (Go implementation). This EP proposes native integration into TARSy backend.
92-92: Add language identifier to all use case code blocks (MD040 compliance).

Use case blocks at these lines lack language identifiers. Standardize by adding text identifier:
 1. **Manual Quality Review**
-   ```
+   ```text
    Operator: Reviews completed security investigation
Apply the same change to all seven use case blocks (lines 92, 100, 108, 116, 124, 134, 142).

Also applies to: 100-100, 108-108, 116-116, 124-124, 134-134, 142-142

558-558: Add language identifier to Example Prompt Flow code block (MD040 compliance).
 **Example Prompt Flow:**
-```
+```text
 [Criteria and instructions]
1-1200: Reorganize into formal Design and Implementation Plan sections per EP guidelines.

The document still lacks the explicit structure required by learnings. While comprehensive design content exists, it is scattered across subsections. Consolidate as follows:

Current structure issue: Content mixes design details (System Architecture, Data Design, API Design, etc.) without a consolidated "Design" header, and implementation roadmap is buried in "Implementation Status" with minimal detail.

Required restructuring:

After "Problem Statement" (line 31): Add new Design section that consolidates:

Design Overview & Principles (currently lines 38-51)

Design Goals (currently lines 52-59)

System Architecture → System Architecture subsection

Data Design → Data Design subsection

API Design → API Design subsection

Configuration → Configuration subsection

Security → Security subsection

Performance & Scalability → Performance subsection

Error Handling → Error Handling subsection

Monitoring & Observability → Monitoring subsection

After Design section: Add new Implementation Plan section that expands:

Migration & Deployment → Migration subsection

Testing Strategy → Testing subsection (move from current location)

Move "Next Steps" (lines 1199-1209) into Implementation Plan with phased breakdown

Keep: Success Criteria (lines 1183-1192)

Rename: "Implementation Status" to "Status & Tracking" with cleaner metadata

This provides the required EP structure: Problem Statement → Design → Implementation Plan → Success Criteria → Status & Tracking

Based on learnings, this reorganization is essential for AI consumption of the document.

935-944: Detail input validation strategy for API endpoints.

Line 940 mentions "Input validation on all API endpoints (session_id format, request body schema)" but provides no specifics:

What is the session_id format validation? (UUID regex, length, allowed characters?)

What schema validation is performed on request body?

Are there limits on field sizes (e.g., force_rescore as boolean)?

What about integer constraints (e.g., total_score bounds check before storage)?

Recommend specifying validation rules explicitly in API Design section (lines 438-496), possibly with examples.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b78f203 and 2dd903a.

📒 Files selected for processing (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

docs/enhancements/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)

docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🧠 Learnings (17)

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:51.044Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:51.044Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: When EP references misaligned patterns (functions that have been renamed, uses deprecated patterns, or conflicts with existing architecture), fix the issue and document the change if clear, or stop and ask for clarification if unclear

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🪛 LanguageTool

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

[grammar] ~950-~950: Ensure spelling is correct
Context: ...on (LLM-dependent) - Score retrieval: < 100ms (database query) - Support for concurre...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

62-62: Bare URL used

(MD034, no-bare-urls)

92-92: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

100-100: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

108-108: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

116-116: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

124-124: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

134-134: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

142-142: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

558-558: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Backend Tests

🔇 Additional comments (6)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (6)

170-188: Clarify History Service integration and error handling for incomplete sessions.

Line 174 references retrieving session data via "History Service (via /final-analysis endpoint logic)" but the integration details are sparse:

How is the History Service called? Direct method call or HTTP endpoint?

What if a session is not yet completed? Should return 400 (bad request) per line 484, but error handling details are missing.

What constitutes "completed session data"? Is there a status check needed?

How are partial or failed sessions handled?

Recommend adding clarification to the "Data Flow" section (lines 170-188) about:

Exact method of retrieving session data from History Service

Validation of session completion status

Error response for incomplete or non-existent sessions

448-453: Clarify force_rescore semantics and score lifecycle.

The force_rescore flag (line 451) is mentioned but behavior is underspecified:

If force_rescore=true and criteria_hash has changed, does it:

Create a new score_id?

Update the existing score_id with new criteria_hash (breaking historical tracking)?

Store both old and new scores with different score_ids?

Data flow (line 172) mentions "Return existing score" but doesn't address criteria evolution.

API response model (line 401-413) includes is_current_criteria field, suggesting multiple scores per session are possible, but this contradicts the UNIQUE constraint on session_id in line 215.

The unique constraint on session_id (line 215) suggests only one score per session, but historical tracking across criteria versions requires multiple scores per session.

Recommend clarifying:

Modify database schema: Should session_id be unique or should (session_id, criteria_hash) be unique instead?

Update data flow to explain multi-criteria scoring lifecycle

Clarify is_current_criteria computation logic

Also applies to: 451-452

155-167: Specify which components are modified vs new in main.py integration.

Line 162 mentions "Register scoring API routes" in backend/tarsy/main.py but doesn't specify:

Are existing routes in main.py? How do they currently load modules?

Is this a simple app.include_router() call, or does it require changes to initialization logic?

Are there existing FastAPI router patterns in the codebase to follow?

Recommend adding a concrete code example of the main.py integration pattern for clarity during implementation.

954-956: Clarify asynchronous scoring implementation details.

Line 955 mentions "Asynchronous scoring using background tasks (FastAPI background tasks)" but implementation details are missing:

Are background tasks the right choice, or should this be a job queue (Celery, RQ)?

How does caller get notified when async scoring completes? Polling? Webhooks? (not mentioned)

What is timeout for async task? (no timeout specified)

If async task fails, how is retry handled? (section 962-965 specifies retry for API layer, but what about background task level?)

Can user call GET /score while scoring is in progress? Return 202 Accepted?

Recommend expanding Performance section with async scoring flow and polling/notification strategy.

622-895: Judge prompt design is comprehensive and well-structured. ✓

The judge prompt template demonstrates excellent prompt engineering with clear evaluation philosophy, detailed scoring criteria, explicit language patterns, and comprehensive guidance for missing tool and alternative approach identification. The emphasis on rigor and critical evaluation aligns well with the stated design goals.

Note: The prompt is intentionally lengthy (~300 lines) to support comprehensive evaluation. Per prototype findings (line 68), session data plus prompt fits comfortably in 1M token window, so this is acceptable for Phase 1.

156-166: Correct the file path reference for the LLM integration.

The EP references backend/tarsy/integrations/llm_client.py (line 165), but this file does not exist. The LLM client is implemented as a package at backend/tarsy/integrations/llm/ with modules like client.py and manager.py. Update the EP to reference:

backend/tarsy/integrations/llm/ (or the specific module: llm/manager.py, llm/client.py)

All other referenced paths are correct or will be created with the proper directory structure:

backend/tarsy/services/scoring_service.py ✓ (directory exists)

backend/tarsy/models/scoring_models.py ✓ (directory exists)

backend/tarsy/controllers/scoring_controller.py ✓ (directory exists)

backend/tarsy/repositories/scoring_repository.py ✓ (directory exists)

backend/tarsy/services/history_service.py ✓ (file exists)

backend/tarsy/main.py ✓ (file exists)

Likely an incorrect or invalid review comment.

coderabbitai · 2025-12-11T14:37:47Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+## Security
+
+### Authentication & Authorization
+
+**Authentication**: Handled at infrastructure level by oauth2-proxy (reverse proxy)
+- Users authenticated via OAuth (GitHub, Google, etc.) or JWT tokens
+- Application trusts `X-Forwarded-User` and `X-Forwarded-Email` headers from oauth2-proxy
+- No application-level authentication middleware required
+
+**Authorization**: Not implemented at application level
+- All authenticated users have full access to all endpoints
+- No role-based access control (RBAC) currently exists
+- Authorization is a future enhancement (referenced in EP-0004 as "may be added in future phases")
+
+**User Attribution**:
+- User identification extracted from oauth2-proxy headers via `extract_author_from_request()` helper
+- Used for audit trails and tracking who triggered scoring operations
+- Follows same pattern as alert submission (see `alert_controller.py`)
+
+### Security Controls
+
+- Input validation on all API endpoints (session_id format, request body schema)
+- Rate limiting on scoring endpoints to prevent LLM API abuse (implemented at oauth2-proxy/infrastructure level)
+- Judge LLM prompts sanitized to prevent prompt injection
+- Score data treated with same sensitivity as session data (no additional PII)
+- HTTPS/WSS enforced for all communication (infrastructure level)
+


⚠️ Potential issue | 🟠 Major

Add explicit secret management and LLM API key handling strategy.

Security section lacks detail on LLM API key management:

How are LLM provider credentials (API keys) managed? (vault, environment variables, etc.)

Is key rotation supported?

Are API keys logged or exposed in error messages? (line 965 mentions logging LLM responses - truncated at 1000 chars - but are keys included?)

How are failed LLM requests logged without leaking credentials?

Recommend adding explicit guidance:

Use secure credential storage (e.g., HashiCorp Vault, AWS Secrets Manager)

Never log full prompts if they contain sensitive data

API error responses should not expose keys or internal prompts

Based on learnings, security posture gaps degrade guarantees.

🤖 Prompt for AI Agents

In docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 918–944, add an explicit LLM API key and secret management section: state that provider credentials must be stored in a managed secrets store (e.g., HashiCorp Vault, AWS Secrets Manager) with IAM-restricted access, enforce key rotation policies, and avoid embedding keys in environment/config files except as short-lived tokens; require that all logs and error handlers redact or omit API keys and full prompts (log only truncated/sanitized prompt snippets and non-sensitive metadata), ensure failed LLM request logging strips credentials and sensitive inputs, document recommended operational controls (audit logging of secret access, least-privilege service accounts, CI checks for accidental secret exposure), and reference these requirements as mandatory for production deployments.

alexeykazakov · 2025-12-15T18:56:51Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+7. Store score in database:
+   - Insert `session_scores` record with score_id, total_score, score_breakdown, etc.
+   - Insert `score_missing_tools` records (one per missing tool)
+   - Insert `score_alternative_approaches` records (one per alternative)
+   - Insert `score_alternative_approach_steps` records (one per step in each alternative)


Let's think about pros and cons of detailed DB schema vs. simplified one with a flexible JSON or even string filed as the model.
How are we going to consume the stored sections of the score report? My understanding that it's mostly for later AI analysis. In this case a JSON/string field should be good enough, right?

Pros of the proposed approach:

Easy to query score reports for individual sections.

Cons:

Requires initial parsing of the AI-generated report.

Have to deal with formatting mistakes of AI (because of parsing)

Any report template changes are very expensive: requires changes in parser, models, DB schema (including migration).

Alternative approaches.

Option 1:

We keep detailed API/internal model with breakdown:

score (int)

missing_tools

alternative_approaches

etc.

The DB model/schema is simplified and has:

score (int)

report (JSON) - serialized API/internal model (w/o score int as it's in a dedicated field)

Pros: we can always change the internal model and AP w/o changing the db schema.
Cons: Harder to migrate existing records when API changed w/o breaking de-serialization (from DB to API model).

Option 2:

All models and API:

score (int)

report (string) - an MD document with all components like missing tools, alternative approaches, etc. Generated by AI.

Pros: very flexible. We can change the format of the report. No db migration ever needed.
Cons: You can't easily query DB for score reports with specific sections without the report parsing. But do we really need this?

I personally think the best option is Option 2. We just let the AI generate the score report. We extract the score number and store the entire report as an md document (string). We will use that report as-is later on for AI analysis. Even if the format of the document is not 100% consistent it's not a big problem with AI. But it will give us a lot of flexibility with changing the template. We also don't have to deal with parsing the initial report and making sure that AI follow the format properly.

I think I'd go half-way actually :)

I have a clearer vision for the missing tools, so let me describe why we need at least a little bit of structure (name + description) there and what it would be used for.

We'd aggregate all the missing tools from all the scores and let the LLM cluster them together semantically.

Once we have that clustering, we can "link" the missing tools in the scores with their "representative tools".

We then use just the "representative tools" (and the counts of the linked missing tools from the scores) for the higher level report on the "most missing tools".

Comparing the representative missing tools with the current tool "menu" that the agent provides (again using an LLM to do a semantic comparison) can enable us to mark the tools as implemented.

This information can then for example mark scores as obsolete (we can't do much more at that point unfortunately because we cannot just re-trigger sessions with new tools - the payloads that they analyzed will have been long gone). But we can for example exclude such obsolete scores from the quality reporting because they are no longer relevant to the current state of the tooling available to TARSy.

With alternative approaches we could still maybe do with name + description, too. This would again enable us to semantically cluster them together and then maybe report on analyses where the LLM most frequently chooses non-optimal approach. I have not developed that idea too much in my mind though - so not sure if and how that would be useful.

alexeykazakov · 2025-12-15T19:50:47Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+
+The scoring system uses a YAML configuration file following TARSy's established configuration patterns (similar to `agents.yaml` and `llm_providers.yaml`). This file contains the judge prompt template and scoring parameters. The entire configuration is content-addressed using a hash to track criteria evolution.
+
+**Example Configuration:**


I would rather:

Hard code the template and access it via the prompt_builder. Pros: works OOTB. No need to provide your own template to use this feature.

Keep the configuration minimal and move it to agents.yaml:

disabled/enabled - Optional. Disabled by default.

llm_provider (llm provider includes the model name, so no need to keep it separate) - Optional. Use the system default provider by default.

I would definitely want the judge prompt to be part of the default configuration so that people don't have to provide their own (not sure what the prompt_builder is).

I agree we can simplify the config and move it to the agents.yaml but I think it is important to keep the judge prompt configurable:

so that people can play with it locally

so that when it changes (and I hope it will be evolving), we can mark scores as obsolete and let the user re-run them.

For the latter point, I think it is also useful to keep the judge prompts in the DB so that we can e.g. have multiple scores for a single session, each done by a different judge prompt so that we can compare and contrast what effect on the score the different judge prompts have.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)

9-37: Add explicit Problem Statement and Design/Implementation Plan headings per EP guidelines

This EP-*.md starts with Overview and then jumps into configuration/API/implementation details. Per the EP learnings for /docs/enhancements/pending/EP-*.md, each EP should explicitly include:

A concise Problem Statement section near the top (what needs to be solved).

A distinct Design section (how the solution works).

A clear Implementation Plan section (phased tasks).

You already have good content for all three (Overview, Implementation Tasks, Implementation Phases). Consider:

Adding a short ## Problem Statement immediately after the metadata, summarizing the lack of systematic session-quality scoring.

Introducing a ## Design heading before “Configuration Syntax” and treating Configuration/API Usage/Scoring Output Schema/Use Cases as its subsections.

Renaming “## Implementation Phases” to ## Implementation Plan (keeping phases as subheadings).

This keeps the doc aligned with the standard EP structure and improves AI consumption.
(prior review requested similar structure; this refines it further based on current content. As per coding guidelines, EP-*.md should expose these sections explicitly.)

🧹 Nitpick comments (2)

docs/enhancements/pending/tmp.md (2)
62-69: Format prototype URL as a markdown link to satisfy MD034

The prototype reference currently uses a bare URL, which markdownlint flags (MD034). Consider changing it to an inline link for consistency:
-**Note**: Initial prototype exists at https://github.com/metlos/TARSy-response-score (Go implementation). This EP proposes native integration into TARSy backend.
+**Note**: Initial prototype exists at [metlos/TARSy-response-score](https://github.com/metlos/TARSy-response-score) (Go implementation). This EP proposes native integration into TARSy backend.
91-147: Add language identifiers to fenced blocks to comply with MD040

The narrative “scenario” blocks under Use Cases and the Example Prompt Flow fence use bare ``` without a language, which triggers MD040 and reduces readability.

Suggest annotating them as plain text:
-1. **Manual Quality Review**
-   ```
+1. **Manual Quality Review**
+   ```text
    Operator: Reviews completed security investigation
    ...
    ```
And similarly for the other use-case blocks and the “Example Prompt Flow” snippet (use ```text there as well).

Also applies to: 558-566

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2dd903a and 734aac2.

📒 Files selected for processing (2)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1 hunks)
docs/enhancements/pending/tmp.md (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

docs/enhancements/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)

docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
docs/enhancements/pending/tmp.md

🧠 Learnings (19)

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
docs/enhancements/pending/tmp.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
docs/enhancements/pending/tmp.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
docs/enhancements/pending/tmp.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
docs/enhancements/pending/tmp.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
docs/enhancements/pending/tmp.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
docs/enhancements/pending/tmp.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
docs/enhancements/pending/tmp.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
docs/enhancements/pending/tmp.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
docs/enhancements/pending/tmp.md

📚 Learning: 2025-11-24T20:22:51.044Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:51.044Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: When EP references misaligned patterns (functions that have been renamed, uses deprecated patterns, or conflicts with existing architecture), fix the issue and document the change if clear, or stop and ask for clarification if unclear

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md
docs/enhancements/pending/tmp.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:23:41.441Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Use configuration over code for new agents when possible

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:23:41.441Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Applies to config/agents.yaml : Define agents in YAML configuration using template variable support `${VAR}` instead of hardcoding when possible

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🪛 LanguageTool

docs/enhancements/pending/tmp.md

[grammar] ~950-~950: Ensure spelling is correct
Context: ...on (LLM-dependent) - Score retrieval: < 100ms (database query) - Support for concurre...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)

docs/enhancements/pending/tmp.md

62-62: Bare URL used

(MD034, no-bare-urls)

92-92: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

100-100: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

108-108: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

116-116: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

124-124: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

134-134: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

142-142: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

558-558: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Backend Tests

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

coderabbitai · 2025-12-15T21:49:28Z

docs/enhancements/pending/tmp.md

+## Security
+
+### Authentication & Authorization
+
+**Authentication**: Handled at infrastructure level by oauth2-proxy (reverse proxy)
+- Users authenticated via OAuth (GitHub, Google, etc.) or JWT tokens
+- Application trusts `X-Forwarded-User` and `X-Forwarded-Email` headers from oauth2-proxy
+- No application-level authentication middleware required
+
+**Authorization**: Not implemented at application level
+- All authenticated users have full access to all endpoints
+- No role-based access control (RBAC) currently exists
+- Authorization is a future enhancement (referenced in EP-0004 as "may be added in future phases")
+
+**User Attribution**:
+- User identification extracted from oauth2-proxy headers via `extract_author_from_request()` helper
+- Used for audit trails and tracking who triggered scoring operations
+- Follows same pattern as alert submission (see `alert_controller.py`)
+
+### Security Controls
+
+- Input validation on all API endpoints (session_id format, request body schema)
+- Rate limiting on scoring endpoints to prevent LLM API abuse (implemented at oauth2-proxy/infrastructure level)
+- Judge LLM prompts sanitized to prevent prompt injection
+- Score data treated with same sensitivity as session data (no additional PII)
+- HTTPS/WSS enforced for all communication (infrastructure level)
+


⚠️ Potential issue | 🟠 Major

Clarify LLM API key and secret management in Security section

The Security section covers authn/z and general controls but does not specify how LLM provider credentials (API keys) are stored, rotated, and protected in logs/errors. For production use, the EP should explicitly require:

Storing LLM API keys in a managed secrets store (Vault/Secrets Manager/Kubernetes secrets) with least-privilege access.

Supporting key rotation (short-lived tokens or documented rotation process).

Redacting keys and full prompts from logs and error messages (log only sanitized/truncated context and non-sensitive metadata).

Adding a short “LLM Credentials & Secrets” subsection here will close this design gap and align security posture with the rest of the system.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (2)
147-150: Add 404 error response documentation to POST endpoint for consistency.

The POST /api/v1/scoring/sessions/{session_id}/score endpoint omits 404 Not Found from its Error Responses, while the GET endpoint (lines 160–163) includes it. For API contract consistency, add:
 **Error Responses:**
 
+* `404 Not Found` - Session not found
 * `400 Bad Request` - Session not completed or invalid state
 * `500 Internal Server Error` - LLM API failure or database error
39-607: Reorganize into explicit "Design" and "Implementation Plan" sections per EP structure.

The document mixes design and implementation details without formal section headers. Per learnings, EP documents must have distinct sections: Problem Statement → Goals → Design → Implementation Plan → Success Criteria → Status & Tracking.

Current issue: Design details (Configuration, API Usage, Output Schema) are scattered, and implementation roadmap (Tasks + Phases) lacks a unified section header.

Recommended restructuring:

After "Problem Statement" (new), add section: "Goals" (extract/synthesize from Overview, lines 13–23)

Add new "Design" section (after Goals, consolidates lines 39–209):

Move "Configuration Syntax" (39–86) → subsection

Move "API Usage" (89–164) → subsection

Move "Scoring Output Schema" (167–195) → subsection

Move "Use Cases" (199–209) → subsection

Add new subsections for: Security (with secret management strategy), Performance, Error Handling, Monitoring, Key Design Decisions

Add new "Implementation Plan" section (consolidates lines 212–607):

Move "Implementation Tasks" (212–416) → Phase-by-phase breakdown with task details

Integrate "Implementation Phases" (431–607) into the Implementation Plan structure

Each phase should list: goals, dependencies, task checklist, deliverables, milestones

Add explicit owners/teams if applicable

Rename "Next Steps" section (609–617) → "Status & Tracking" with metadata: last updated, current phase, blockers, next milestone

Optional: Add "Success Criteria" section (if not already defined in Overview)

This restructuring ensures: Design captures what and how (system architecture, API contracts, configuration); Implementation Plan captures phased execution (tasks, dependencies, deliverables, milestones).

🧹 Nitpick comments (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)

385-428: Clarify file path intent and add configuration option migration note.

The "Files to Modify" section (lines 385–428) lists paths for new files that will be created during implementation. This is appropriate for a design document. However, note one important detail for Phase 2:

Per learnings and past review comments, there was discussion about whether to hardcode the judge prompt in code or keep it configurable. The current design assumes config/scoring_config.yaml with user-configurable judge prompt (line 402). This is mentioned in past comments where the preference was stated: "keep the judge prompt configurable so people can play with it locally and compare scoring across different prompts."

Confirm during implementation that:

Default prompt is shipped with the application (not requiring users to provide one to get started)

Configuration path supports override: config/scoring_config.yaml is optional; fall back to hardcoded/bundled default if missing

Prompt versioning is tracked: Judge prompts in DB enable historical comparison across scoring iterations

Consider adding a note to the "Configuration Loading" section (Phase 2, lines 454–471) clarifying this intent.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 734aac2 and 835d202.

📒 Files selected for processing (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

docs/enhancements/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)

docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🧠 Learnings (19)

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:51.044Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:51.044Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: When EP references misaligned patterns (functions that have been renamed, uses deprecated patterns, or conflicts with existing architecture), fix the issue and document the change if clear, or stop and ask for clarification if unclear

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:23:41.441Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Use configuration over code for new agents when possible

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:23:41.441Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Applies to config/agents.yaml : Define agents in YAML configuration using template variable support `${VAR}` instead of hardcoding when possible

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Backend Tests

🔇 Additional comments (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)

35-35: Fix fenced code blocks to include language identifiers.

The MD040 rule requires fenced code blocks to have a language specified. The file has bare closing fences (without language identifiers) at lines 58, 71, 105, 143, 193, and 893. Add appropriate language identifiers to each fence.

Note: Line 35 is already compliant—bare URLs wrapped in angle brackets satisfy MD034 requirements and do not need to be converted to markdown link syntax.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-12-16T09:40:45Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+# EP-0028 Phase 1: Alert Session Scoring API
+
+**Status:** Pending
+**Created:** 2025-12-05
+**Phase:** Design
+
+---
+
+## Overview
+
+Add systematic quality assessment for TARSy alert analysis sessions through an LLM-based judge that critically evaluates investigation methodology and provides actionable feedback.
+
+**Core Capabilities:**
+
+1. **LLM-Based Scoring**: Judge LLM evaluates completed sessions using rigorous scoring framework
+2. **Critical Evaluation**: Methodology-focused assessment with strict standards (4 dimensions: logical flow, consistency, tool relevance, synthesis quality)
+3. **Actionable Feedback**:
+   * Identify missing MCP tools that should have been used
+   * Suggest alternative investigation approaches with detailed steps
+   * Provide improvement recommendations
+4. **Criteria Evolution**: Content-addressed configuration (SHA256 hash) tracks scoring criteria changes over time
+5. **Quality Metrics**: Quantitative scores (0-100) enable tracking improvements and identifying patterns
+6. **Non-Intrusive**: Post-session scoring with zero impact on alert processing
+7. **Manual Control**: Operator-triggered scoring only (Phase 1 scope)
+
+**Primary Use Cases:**
+
+* Agent development feedback and improvement tracking
+* MCP server prioritization based on gap analysis
+* Training data curation for high-quality sessions
+* Investigation methodology pattern discovery
+
+**Scope:** On-demand scoring API for individual sessions via REST endpoints and basic UI integration. Scheduled batch scoring, analytics aggregation and reporting are left for the future.
+
+**POC Reference:** Initial prototype at <https://github.com/metlos/tarsy-response-score> proved judge LLM provides critical scoring even for self-produced analyses.
+
+---
+
+## Configuration Syntax
+
+**Configuration File:** `config/scoring_config.yaml`
+
+```yaml
+# Scoring service settings (supports template variable substitution)
+scoring:
+  enabled: ${SCORING_ENABLED:-true}
+  llm_provider: ${SCORING_LLM_PROVIDER:-${DEFAULT_LLM_PROVIDER}}
+  llm_model: ${SCORING_LLM_MODEL:-}  # Empty = use provider default
+
+# Judge prompt template with placeholders
+# The entire configuration (including this prompt) is hashed to track criteria evolution.
+judge_prompt: |
+  A prompt explaining how to perform the score. This prompt includes placeholders
+  to be replaced with the actual data from the session. See below for more details.
+
+  The prompt used with good success in the POC is included in the "attachment"
+  section at the end of this EP.
+```
+
+**Environment Variables** (optional overrides):
+
+```bash
+# Enable/disable scoring (default: true)
+SCORING_ENABLED=true
+
+# LLM provider for scoring (default: falls back to DEFAULT_LLM_PROVIDER)
+SCORING_LLM_PROVIDER=google-default
+
+# Model for scoring (default: provider-specific default)
+SCORING_LLM_MODEL=gemini-2.0-flash-exp
+```
+
+**Placeholder System:**
+
+The judge prompt supports three placeholders that are replaced at runtime:
+
+* `{{SESSION_CONVERSATION}}` - Complete conversation from History Service (includes all MCP tool usage)
+* `{{ALERT_DATA}}` - Original alert data for reference
+* `{{OUTPUT_SCHEMA}}` - JSON schema specification (automatically injected from `JudgeOutputSchema` class)
+
+**Design Rationale:**
+
+* **Separation of Concerns**: Configuration defines how to *evaluate* (criteria), code defines how to *report* it (output schema)
+* **Parsing Reliability**: Output format guaranteed to match code expectations
+* **Criteria Evolution**: Full prompt is hashed (SHA256) to track changes; output schema changes require code updates
+
+---
+
+## API Usage
+
+**Note:** Phase 6 provides dashboard UI for these endpoints with visual score display and manual triggering.
+
+### Score Session
+
+**Endpoint:** `POST /api/v1/scoring/sessions/{session_id}/score`
+
+**Purpose:** Trigger scoring for a specific session (operator-initiated or UI-triggered)
+
+**Request Body:**
+
+```json
+{
+  "force_rescore": false  // optional: re-score even if score exists
+}
+```
+
+**Response (200 OK):**
+
+```json
+{
+  "score_id": "550e8400-e29b-41d4-a716-446655440000",
+  "session_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
+  "criteria_hash": "a3f5b2c1d4e7f9a1b3c5d7e9f1a3b5c7d9e1f3a5b7c9d1e3f5a7b9c1d3e5f7a9",
+  "total_score": 67,
+  "score_breakdown": {
+    "logical_flow": 15,
+    "consistency": 18,
+    "tool_relevance": 14,
+    "synthesis_quality": 20
+  },
+  "score_reasoning": "The investigation demonstrates adequate methodology with notable gaps...",
+  "missing_tools": [
+    {
+      "tool_name": "list-processes-in-pod",
+      "rationale": "Would have confirmed whether the suspicious binary was actively running..."
+    }
+  ],
+  "alternative_approaches": [
+    {
+      "name": "File-First Forensic Approach",
+      "description": "Systematically enumerate all suspicious files before attempting execution verification",
+      "steps": [
+        "Use list-files to discover all files in suspicious directory",
+        "Read content of each file to identify actual malware vs false positives",
+        "Correlate file timestamps with alert time window"
+      ]
+    }
+  ],
+  "scored_triggered_by": "alice@example.com",
+  "scored_at": "2025-12-05T10:30:00Z",
+  "is_current_criteria": true
+}
+```
+
+**Note:** The `score_breakdown` structure is flexible and depends on scoring criteria. Structure may evolve as criteria change.
+
+**Error Responses:**
+
+* `400 Bad Request` - Session not completed or invalid state
+* `500 Internal Server Error` - LLM API failure or database error
+
+### Get Session Score
+
+**Endpoint:** `GET /api/v1/scoring/sessions/{session_id}/score`
+
+**Purpose:** Retrieve existing score for a session
+
+**Response:** Same as Score Session endpoint
+
+**Error Responses:**
+
+* `404 Not Found` - Session not found or not yet scored
+* `500 Internal Server Error` - LLM API failure or database error
+
+---
+
+## Scoring Output Schema
+
+The judge LLM must return JSON matching this structure (defined in code as `JudgeOutputSchema`):
+
+```python
+class MissingTool(BaseModel):
+    """Missing tool that should have been used in the investigation."""
+    tool_name: str = Field(..., description="Name of the missing MCP tool")
+    rationale: str = Field(..., description="Explanation of why this tool was needed")
+
+class AlternativeApproach(BaseModel):
+    """Alternative investigation approach that could have been more efficient/rigorous."""
+    name: str = Field(..., description="Brief descriptive title of the approach")
+    description: str = Field(..., description="1-2 sentences explaining the strategy")
+    steps: List[str] = Field(..., description="Ordered list of specific actions")
+
+class JudgeOutputSchema(BaseModel):
+    """
+    Output schema for judge LLM - defined in code for parsing reliability.
+    Injected into judge prompt at {{OUTPUT_SCHEMA}} placeholder.
+    """
+    total_score: int = Field(..., ge=0, le=100, description="Overall score 0-100")
+    score_breakdown: Dict[str, Any] = Field(..., description="Flexible dimension scores (e.g., {'logical_flow': 18, ...})")
+    score_reasoning: str = Field(..., description="Detailed explanation (200+ words recommended)")
+    missing_tools: List[MissingTool] = Field(default_factory=list, description="Tools that should have been used")
+    alternative_approaches: List[AlternativeApproach] = Field(default_factory=list, description="Alternative investigation approaches")
+```
+
+**Note:** Only `total_score` is required. The `score_breakdown` dimensions can evolve as criteria change (stored as JSONB for flexibility). The nested models (`MissingTool`, `AlternativeApproach`) ensure type safety and clean database mapping.
+
+---
+
+## Use Cases
+
+**Primary Use Cases:**
+
+* **Manual Quality Review**: Operators score completed investigations to understand quality and identify improvement areas
+* **Agent Development Feedback**: Developers quantify agent improvements after code changes and identify specific gaps (e.g., "missing file-read tools")
+* **MCP Server Prioritization**: Team analyzes `score_missing_tools` frequency to make data-driven decisions on which tools to develop next
+* **Criteria Evolution Tracking**: Operators re-score historical sessions with new criteria (different `criteria_hash`) to validate criteria changes
+* **Investigation Methodology Improvement**: Operators review alternative approaches from scores to learn investigation patterns
+* **Training Data Curation**: ML engineers identify high-quality sessions (total_score >= 85) for fine-tuning datasets
+
+---
+
+## Implementation Tasks
+
+### Task 1: Database Schema
+
+**Tables to Create:**
+
+1. **`scoring_criteria_definitions`**
+   * Primary key: `criteria_hash` (VARCHAR 64, SHA256 of full config)
+   * `criteria_content` (JSONB) - Complete scoring configuration
+   * `created_at` (TIMESTAMP)
+   * Index on `created_at`
+
+2. **`session_scores`**
+   * Primary key: `score_id` (UUID)
+   * Unique constraint: `session_id`
+   * Foreign keys: `session_id` → alert_sessions, `criteria_hash` → scoring_criteria_definitions
+   * `total_score` (INTEGER, 0-100)
+   * `score_breakdown` (JSONB) - Flexible dimension scores
+   * `score_reasoning` (TEXT)
+   * `scored_triggered_by` (VARCHAR 255) - User who triggered scoring (from X-Forwarded-User header)
+   * `scored_at` (TIMESTAMP)
+   * Indexes on: `session_id`, `criteria_hash`, `total_score`, `scored_at`
+
+3. **`score_missing_tools`**
+   * Primary key: `id` (UUID)
+   * Foreign key: `score_id` → session_scores
+   * `tool_name` (VARCHAR 255)
+   * `rationale` (TEXT)
+   * Indexes on: `score_id`, `tool_name`
+
+4. **`score_alternative_approaches`**
+   * Primary key: `id` (UUID)
+   * Foreign key: `score_id` → session_scores
+   * `name` (VARCHAR 255)
+   * `description` (TEXT)
+   * Indexes on: `score_id`, `name`
+
+5. **`score_alternative_approach_steps`**
+   * Primary key: `id` (UUID)
+   * Foreign key: `approach_id` → score_alternative_approaches
+   * `step_order` (INTEGER)
+   * `step_description` (TEXT)
+   * Index on: `approach_id`
+
+**Alembic Migration:**
+
+* Create forward migration with all tables and indexes
+* Create rollback migration (drop tables in reverse dependency order)
+
+### Task 2: Data Models
+
+**Database Models (SQLModel):**
+
+* `ScoringCriteriaDefinitionDB` - Criteria storage
+* `SessionScoreDB` - Main score record with relationships
+* `MissingToolDB` - Missing tool entries
+* `AlternativeApproachDB` - Alternative approach entries
+* `AlternativeApproachStepDB` - Approach step entries
+
+**API Models (Pydantic):**
+
+* `SessionScore` - API response with nested objects
+* `MissingTool` - Missing tool representation
+* `AlternativeApproach` - Alternative approach with steps list
+* `ScoringCriteriaDefinition` - Criteria definition response
+* `JudgeOutputSchema` - For parsing LLM JSON responses
+
+**Repository Layer:**
+
+* Convert database models (with relationships) to API models (with nested objects)
+* Handle eager loading for efficiency
+* Map `is_current_criteria` by comparing hash to current config
+
+### Task 3: Scoring Service
+
+**Core Logic (`backend/tarsy/services/scoring_service.py`):**
+
+1. **Configuration Loading**:
+   * Load `config/scoring_config.yaml` with template variable substitution
+   * Compute SHA256 hash of full configuration (deterministic criteria versioning)
+   * Store criteria definition in database if not exists
+
+2. **Session Data Retrieval**:
+   * Call History Service to get complete session conversation
+   * Equivalent to `/final-analysis` endpoint data (~25k tokens)
+
+3. **Judge Prompt Construction**:
+   * Replace `{{SESSION_CONVERSATION}}` with full conversation
+   * Replace `{{ALERT_DATA}}` with original alert
+   * Replace `{{OUTPUT_SCHEMA}}` with JSON schema from `JudgeOutputSchema`
+
+4. **LLM Integration**:
+   * Send prompt to judge LLM (configurable provider/model)
+   * Retry with exponential backoff (max 3 retries: 1s, 2s, 4s)
+   * Circuit breaker: open after 5 consecutive failures
+
+5. **Response Parsing**:
+   * Parse JSON response into `JudgeOutputSchema`
+   * Validate total_score, score_breakdown, missing_tools, alternative_approaches
+
+6. **Database Storage**:
+   * Insert `session_scores` record
+   * Insert `score_missing_tools` records (bulk)
+   * Insert `score_alternative_approaches` and `score_alternative_approach_steps` (bulk)
+   * Return populated `SessionScore` API model
+
+7. **Error Handling**:
+   * Invalid JSON: Log raw response, return 500 with details
+   * Database failures: Retry once, then return 500
+   * Missing session: Return 404
+
+### Task 4: API Controller
+
+**Controller (`backend/tarsy/controllers/scoring_controller.py`):**
+
+* `POST /api/v1/scoring/sessions/{session_id}/score`
+  * Check if score exists (unless `force_rescore=true`)
+  * Execute scoring via background task (FastAPI BackgroundTasks)
+  * Return score immediately if exists, or start scoring and poll/wait
+
+* `GET /api/v1/scoring/sessions/{session_id}/score`
+  * Retrieve score from repository
+  * Compute `is_current_criteria` by comparing hash
+  * Return 404 if not scored
+
+**User Attribution:**
+
+* Extract user from `X-Forwarded-User` header (oauth2-proxy)
+* Store in `scored_triggered_by` field of session_scores table for audit trail
+
+### Task 5: Configuration Loading
+
+**Config Service Updates:**
+
+* Load `scoring_config.yaml` similar to `agents.yaml` and `llm_providers.yaml`
+* Template variable substitution (e.g., `${SCORING_LLM_PROVIDER:-${DEFAULT_LLM_PROVIDER}}`)
+* SHA256 hashing logic: Hash full resolved YAML content for deterministic `criteria_hash`
+* Store criteria definition on first use (idempotent)
+
+### Task 6: Testing
+
+**Unit Tests:**
+
+* Model validation (JudgeOutputSchema, SessionScore)
+* Configuration hashing (deterministic results)
+* Prompt placeholder substitution
+* JSON response parsing (valid and invalid cases)
+* Repository conversion (DB models → API models)
+
+**Integration Tests:**
+
+* Full scoring flow: session retrieval → LLM call → database storage
+* Mock LLM responses for consistency
+* Re-scoring with `force_rescore=true`
+* Error handling: missing session, invalid JSON, database failures
+
+**Test Coverage Target:** 80% minimum for new components, 100% for critical paths (scoring logic, database operations)
+
+---
+
+## Key Design Decisions
+
+1. **Content-Addressed Criteria**: SHA256 hash of full configuration eliminates manual version management and provides automatic obsolescence detection
+2. **JSONB Flexibility vs Normalized Feedback**: Score dimensions in flexible JSONB (criteria evolve), but missing tools and alternatives in normalized tables (enables analytics)
+3. **Output Schema in Code, Criteria in Config**: Separation of concerns - config defines *what* to evaluate, code defines *how* to report it; ensures parsing reliability
+4. **Placeholder System**: Enables optimal prompt engineering (preamble → session data → refocusing instructions → schema)
+5. **Two-Layer Model Architecture**: Database models (SQLModel with relationships) → Repository conversion → API models (Pydantic with nested objects)
+6. **Non-Intrusive Operation**: Post-session scoring, zero impact on alert processing performance
+7. **Background Task Execution**: Scoring runs asynchronously to avoid blocking API responses
+8. **Manual Control**: Phase 1 is operator-triggered only; automation deferred to future phases for cost control
+
+---
+
+## Files to Modify
+
+### Backend Models
+
+* **`backend/tarsy/models/scoring_models.py`** (new) - All database and API models
+
+### Backend Services & Repositories
+
+* **`backend/tarsy/services/scoring_service.py`** (new) - Core scoring logic
+* **`backend/tarsy/repositories/scoring_repository.py`** (new) - Database operations
+
+### Backend Controllers
+
+* **`backend/tarsy/controllers/scoring_controller.py`** (new) - REST API endpoints
+
+### Configuration
+
+* **`config/scoring_config.yaml`** (new) - Scoring configuration with judge prompt
+
+### Database
+
+* **`backend/alembic/versions/XXXX_add_scoring_tables.py`** (new) - Database migration
+
+### Main Application
+
+* **`backend/tarsy/main.py`** - Register scoring routes
+
+### Backend Tests
+
+* **`backend/tests/unit/test_scoring_service.py`** (new) - Unit tests
+* **`backend/tests/integration/test_scoring_api.py`** (new) - Integration tests
+
+### Dashboard Components (Phase 6)
+
+* **`dashboard/src/components/scoring/ScoreDetailView.tsx`** (new) - Main score view
+* **`dashboard/src/components/scoring/ScoreBreakdownCard.tsx`** (new) - Score breakdown display
+* **`dashboard/src/components/scoring/MissingToolsCard.tsx`** (new) - Missing tools list
+* **`dashboard/src/components/scoring/AlternativeApproachesCard.tsx`** (new) - Alternative approaches
+* **`dashboard/src/components/scoring/ScoreBadge.tsx`** (new) - Reusable color-coded badge
+* **`dashboard/src/components/HistoricalAlertsList.tsx`** - Add Score column
+* **`dashboard/src/components/SessionDetailPageBase.tsx`** - Add Score toggle
+* **`dashboard/src/types/`** - Add score-related TypeScript interfaces
+* **`dashboard/src/api/`** - Add scoring API client functions
+
+---
+
+## Implementation Phases
+
+### Phase 1: Database Schema & Models
+
+**Goal:** Establish data structures and persistence layer
+
+* [ ] Create Alembic migration for all scoring tables (criteria_definitions, session_scores, missing_tools, alternative_approaches, approach_steps)
+* [ ] Implement database models (SQLModel): ScoringCriteriaDefinitionDB, SessionScoreDB, MissingToolDB, AlternativeApproachDB, AlternativeApproachStepDB
+* [ ] Implement API models (Pydantic): SessionScore, MissingTool, AlternativeApproach, ScoringCriteriaDefinition, JudgeOutputSchema
+* [ ] Create repository layer with DB → API model conversion logic
+* [ ] Test database schema and model conversions
+
+**Dependencies:** None (foundational work)
+
+**Deliverables:**
+
+* Database migration script
+* Complete data model definitions
+* Repository with conversion logic
+* Unit tests for models
+
+---
+
+### Phase 2: Configuration & Hashing
+
+**Goal:** Implement configuration loading and criteria versioning
+
+* [ ] Implement config loading from `scoring_config.yaml`
+* [ ] Add template variable substitution (follows existing pattern from `agents.yaml`)
+* [ ] Implement SHA256 hashing logic for full configuration content
+* [ ] Add criteria definition storage (idempotent insert)
+* [ ] Test configuration loading and hash determinism
+
+**Dependencies:** Phase 1 (needs ScoringCriteriaDefinitionDB model)
+
+**Deliverables:**
+
+* Configuration loading service
+* Hashing implementation with tests
+* Criteria storage logic
+
+---
+
+### Phase 3: Scoring Service & LLM Integration
+
+**Goal:** Implement core scoring logic and judge LLM integration
+
+* [ ] Implement session data retrieval from History Service
+* [ ] Build judge prompt construction with placeholder substitution
+* [ ] Integrate LLM client for judge calls
+* [ ] Implement JSON response parsing (JudgeOutputSchema)
+* [ ] Add retry logic with exponential backoff (1s, 2s, 4s)
+* [ ] Implement circuit breaker pattern (open after 5 failures)
+* [ ] Implement database storage logic (scores, missing tools, alternatives)
+* [ ] Test end-to-end scoring flow with mocked LLM responses
+
+**Dependencies:** Phases 1-2 (needs models and configuration)
+
+**Deliverables:**
+
+* Complete scoring service
+* LLM integration with resilience patterns
+* Integration tests with mocked LLM
+
+---
+
+### Phase 4: API Endpoints
+
+**Goal:** Expose scoring functionality via REST API
+
+* [ ] Implement scoring controller (`backend/tarsy/controllers/scoring_controller.py`)
+* [ ] Add POST /score endpoint with background task execution
+* [ ] Add GET /score endpoint for retrieval
+* [ ] Implement error handling and validation
+* [ ] Add user attribution from oauth2-proxy headers
+* [ ] Register routes in main.py
+* [ ] Test API endpoints (success and error cases)
+
+**Dependencies:** Phase 3 (needs scoring service)
+
+**Deliverables:**
+
+* REST API endpoints
+* Error handling and validation
+* API integration tests
+
+---
+
+### Phase 5: Testing & Documentation
+
+**Goal:** Comprehensive testing and documentation
+
+* [ ] Complete unit test coverage for all components (target: 80% overall, 100% critical paths)
+* [ ] Complete integration tests for full scoring flow
+* [ ] Add API documentation (OpenAPI/Swagger via FastAPI)
+* [ ] Update CLAUDE.md with scoring system overview (if needed)
+* [ ] Create example scoring_config.yaml in repo
+* [ ] Document scoring criteria evolution workflow
+
+**Dependencies:** Phases 1-4 (all implementation complete)
+
+**Deliverables:**
+
+* Comprehensive test suite
+* API documentation
+* User-facing documentation
+* Configuration examples
+
+---
+
+### Phase 6: UI Integration
+
+**Goal:** Provide basic scoring visualization in TARSy dashboard
+
+* [ ] Create reusable ScoreBadge component with 5-tier color coding
+* [ ] Add Score column to HistoricalAlertsList with click navigation
+* [ ] Implement ScoreDetailView with score breakdown, reasoning, missing tools, and alternatives
+* [ ] Add Score toggle option to SessionDetailPageBase
+* [ ] Implement "Score Session" button with API integration
+* [ ] Add TypeScript interfaces for score data models
+* [ ] Update dashboard API client with scoring endpoints
+* [ ] Test UI with various score ranges and edge cases (unscored sessions, errors)
+
+**Dependencies:** Phases 1-5 (requires backend API endpoints)
+
+**Deliverables:**
+
+* Score visualization in session list
+* Comprehensive score detail view
+* Manual scoring trigger UI
+* TypeScript types and API integration
+
+**UI Features:**
+
+1. **Session List Enhancement:**
+   * New "Score" column with color-coded badges
+   * Color scheme (matches judge prompt scoring philosophy):
+     * 0-44: Red (failed investigation)
+     * 45-59: Orange (weak investigation)
+     * 60-74: Yellow (adequate investigation)
+     * 75-89: Light Green (good investigation)
+     * 90-100: Dark Green (near-perfect investigation)
+   * "Not Scored" badge for sessions without scores
+   * Clickable scores navigate to score detail view
+
+2. **Score Detail Page:**
+   * New "Score" toggle option (alongside Conversation/Technical)
+   * Displays:
+     * Alert detail (reuses OriginalAlertCard)
+     * Final analysis (reuses FinalAnalysisCard)
+     * Score breakdown card (new component)
+     * Score reasoning card (new component)
+     * Missing tools list (new component)
+     * Alternative approaches with steps (new component)
+
+3. **Manual Scoring Trigger:**
+   * "Score Session" button in session detail page
+   * Calls `POST /api/v1/scoring/sessions/{session_id}/score`
+   * Loading state during scoring
+   * Auto-refresh after completion
+
+**New Dashboard Components:**
+
+* `dashboard/src/components/scoring/ScoreDetailView.tsx` - Main score view
+* `dashboard/src/components/scoring/ScoreBreakdownCard.tsx` - Score breakdown display
+* `dashboard/src/components/scoring/MissingToolsCard.tsx` - Missing tools list
+* `dashboard/src/components/scoring/AlternativeApproachesCard.tsx` - Alternative approaches
+* `dashboard/src/components/scoring/ScoreBadge.tsx` - Reusable color-coded badge
+
+**Modified Dashboard Components:**
+
+* `dashboard/src/components/HistoricalAlertsList.tsx` - Add Score column
+* `dashboard/src/components/SessionDetailPageBase.tsx` - Add Score toggle
+* `dashboard/src/types/` - Add score-related TypeScript interfaces
+* `dashboard/src/api/` - Add scoring API client functions
+
+---
+
+## Next Steps
+
+1. Review and approve Phase 1 design
+2. Begin Phase 1 implementation (database schema and models)
+3. Iterate through phases with testing at each stage
+4. Deploy to staging for user acceptance testing
+5. Implement Phase 6 UI integration after backend API is validated
+6. Production deployment after validation
+
+## Attachment: The Judge Prompt from the POC
+
+```yaml
+  You are a computer security expert specializing in DevOps and Kubernetes security operations.
+
+  Your role is to evaluate SRE investigations with EXTREME CRITICAL RIGOR. You are a methodology-focused perfectionist who:
+  - Demands optimal investigation paths, not just successful outcomes
+  - Penalizes ANY logical shortcuts, premature conclusions, or incomplete exploration
+  - Holds investigations to the highest professional standards
+  - Identifies what SHOULD HAVE been done, not just what WAS done
+
+  ## EVALUATION PHILOSOPHY
+
+  You prefer CRITICISM over PRAISE. Your default stance is skeptical. When an investigation reaches a conclusion:
+  1. First ask: "Was ALL available evidence gathered?"
+  2. Then ask: "Were ALL available tools explored?"
+  3. Then ask: "Does the confidence level match the evidence quality?"
+  4. Finally ask: "What alternative approaches would have been faster or more rigorous?"
+
+  ## SCORING FRAMEWORK
+
+  You will score investigations across 4 categories, each worth 25 points (100 total):
+
+  ### 1. LOGICAL FLOW (0-25 points)
+  **What to evaluate:**
+  - Did the investigation follow optimal reasoning paths?
+  - Were steps sequenced efficiently, or was there trial-and-error waste?
+  - Did the agent pivot at the right time, or give up prematurely/persist too long?
+  - Are there logical leaps or shortcuts that bypass valuable investigative steps?
+
+  **Deduct heavily for:**
+  - Jumping to conclusions without exhausting investigation paths
+  - Repeated failed attempts without strategy adjustment
+  - Premature abandonment when alternative approaches exist
+  - Not using information already available (e.g., timestamp in alert)
+  - Trial-and-error guessing instead of systematic discovery
+
+  **Typical score range:** 10-22 points. Award 23+ only for near-flawless investigation flow.
+
+  ### 2. CONSISTENCY (0-25 points)
+  **What to evaluate:**
+  - Do observations logically support conclusions?
+  - Is confidence level justified by evidence gathered?
+  - Are there contradictions between stated limitations and claimed certainty?
+  - Does the classification match the evidence severity?
+
+  **Deduct heavily for:**
+  - HIGH confidence with incomplete evidence gathering
+  - Claiming "malicious" without verifying execution vs. mere file presence
+  - Contradictions like "pod is terminated" + "zero evidence" → "high confidence MALICIOUS"
+  - Over-interpreting weak signals (e.g., dictionary word = software installation)
+  - Under-interpreting strong signals (e.g., dismissing repeated failures)
+
+  **Typical score range:** 15-22 points. Award 23+ only for ironclad logical consistency.
+
+  ### 3. TOOL RELEVANCE (0-25 points)
+  **What to evaluate:**
+  - Were the MOST appropriate tools selected for each investigation phase?
+  - Was tool failure handled by trying alternative tools, or by giving up?
+  - Were tools used efficiently (right parameters, right sequence)?
+
+  **Deduct heavily for:**
+  - Not attempting to access logs, files, processes or other relevant evidence when tools exist
+  - Guessing parameters instead of discovering correct values first
+  - Not checking historical data when live resources are unavailable
+
+  **Typical score range:** 10-22 points. Award 23+ only if tool selection was optimal AND comprehensive.
+
+  ### 4. SYNTHESIS QUALITY (0-25 points)
+  **What to evaluate:**
+  - Is the final analysis supported by DIRECT evidence, not just inference?
+  - Does the report acknowledge gaps and limitations appropriately?
+  - Are recommendations proportional to evidence strength?
+  - Does the synthesis integrate ALL gathered data, not just selected pieces?
+
+  **Deduct heavily for:**
+  - Conclusions based on circumstantial evidence when direct evidence was accessible
+  - Severe recommendations (BAN, MALICIOUS) without verification of actual execution/harm
+  - Not acknowledging critical investigation gaps
+  - Failing to consider benign alternative explanations
+  - Ignoring contradictory evidence
+
+  **Typical score range:** 8-20 points. Award 21+ only for evidence-rich, nuanced synthesis.
+
+  ## CRITICAL EVALUATION CHECKLIST
+
+  For each investigation, systematically check:
+
+  **Evidence Quality:**
+  - [ ] Was direct evidence gathered (logs, files, processes) or only metrics?
+  - [ ] Were files READ and verified, or only detected by filename?
+  - [ ] Were processes INSPECTED, or activity inferred from resource usage?
+  - [ ] Was execution CONFIRMED, or assumed from file presence?
+
+  **Tool Completeness:**
+  - [ ] Were forensic tools used? (list-files, read-file, grep-files)
+  - [ ] Were historical tools attempted? (terminated pod logs, event history)
+  - [ ] Were cross-reference tools used? (correlating network + CPU data)
+
+  **Logical Rigor:**
+  - [ ] Did each step build on previous findings, or was there random exploration?
+  - [ ] Were failed attempts analyzed to inform next steps?
+  - [ ] Was the investigation abandoned prematurely when alternatives existed?
+  - [ ] Were time windows adjusted appropriately based on alert timestamps?
+
+  **Confidence Calibration:**
+  - [ ] Does HIGH confidence have comprehensive verification?
+  - [ ] Does MEDIUM confidence acknowledge specific gaps?
+  - [ ] Are limitations explicitly stated when evidence is incomplete?
+
+  ## LANGUAGE PATTERNS TO USE
+
+  When critiquing, use these patterns:
+
+  **Identifying problems:**
+  - "However, there are significant logical issues..."
+  - "This represents a critical logical shortcut because..."
+  - "The agent should have immediately..."
+  - "A rigorous investigation would have..."
+  - "The conclusion jumps to X despite never actually..."
+  - "Tool selection is severely inadequate because..."
+
+  **Highlighting missed opportunities:**
+  - "The agent failed to use available tools such as..."
+  - "After identifying X, the agent should have..."
+  - "The agent never attempted to..."
+  - "A more logical approach would have been..."
+
+  **Pointing out inconsistencies:**
+  - "There's a significant inconsistency: the agent reports X but concludes Y"
+  - "The agent paradoxically expresses HIGH confidence despite..."
+  - "This contradicts the earlier observation that..."
+
+  **Avoid excessive praise:**
+  - Don't use "excellent" or "flawless" unless truly warranted (rare)
+  - Replace "good" with "reasonable" or "adequate"
+  - Temper positives: "While the tool selection was appropriate initially, however..."
+
+  ## IDENTIFYING MISSING TOOLS
+
+  After reviewing the investigation, identify tools that SHOULD have been used but WEREN'T.
+
+  **What qualifies as a "missing tool"?**
+
+  Include tools that would have:
+  - Provided DIRECT evidence instead of circumstantial inference
+  - Enabled verification of assumptions that were left unverified
+  - Revealed information that was guessed or inferred
+  - Made the investigation more efficient or systematic
+  - Eliminated ambiguity in the findings
+
+  **DO NOT include:**
+  - Nice-to-have tools that wouldn't significantly change the analysis
+  - Tools that are redundant with what was already done
+  - Tools for information that was already conclusively obtained another way
+
+  **For each missing tool, provide:**
+  - **tool_name**: Specific tool that should have been used (e.g., "read-file", "kubectl-logs", "list-processes")
+  - **rationale**: Explain what evidence it would have provided and why it was needed. Be specific about what gap this would have filled.
+
+  ## CONSTRUCTING ALTERNATIVE APPROACHES
+
+  Provide 1-3 alternative investigation paths that would have been MORE EFFICIENT or MORE RIGOROUS than what was actually done.
+
+  **What makes a good alternative approach?**
+
+  An alternative approach should:
+  - Be more EFFICIENT (fewer steps to reach same or better conclusion)
+  - Be more RIGOROUS (gather more direct evidence)
+  - Address specific gaps you identified in the actual investigation
+  - Be realistic given the available tools shown in the session
+
+  **Structure each alternative as:**
+  - **name**: Brief, descriptive title (e.g., "File-First Forensic Approach", "Systematic Resource Analysis")
+  - **description**: 1-2 sentences explaining the strategy and why it's better than what was done
+  - **steps**: Ordered list of specific actions with tool names and expected outcomes
+
+  **Guidelines for steps:**
+  - Each step should be specific and actionable
+  - Include actual tool names you saw available in the session (not generic "check logs")
+  - Show logical progression: discovery → verification → conclusion
+  - Demonstrate how this avoids the problems you criticized
+  - Each step should build on the previous step's findings
+
+  **How many alternatives to provide:**
+  - Provide 0-3 alternative approaches
+  - Only include alternatives that are genuinely BETTER than what was done
+  - If the investigation was reasonably optimal, provide fewer alternatives (or even zero)
+  - Each alternative should address different weaknesses (e.g., one for efficiency, one for rigor)
+
+  ## SESSION DATA
+
+  Below is the complete conversation from the alert analysis session.
+  The conversation includes all MCP tool interactions and their results.
+
+  {{SESSION_CONVERSATION}}
+
+  Original Alert:
+  {{ALERT_DATA}}
+
+  ## YOUR TASK NOW
+
+  You have now seen the complete investigation from start to finish.
+
+  Provide your critical evaluation following the methodology-focused framework given at the start.
+
+  **BEFORE YOU SCORE, ASK YOURSELF:**
+
+  1. **Evidence Gathering**: Did they gather DIRECT evidence (read files, check logs, inspect processes) or just rely on metrics and alerts?
+
+  2. **Tool Completeness**: List ALL tools they COULD have used but DIDN'T. For each unused tool, deduct points.
+
+  3. **Logical Shortcuts**: Identify ANY place where they:
+     - Jumped to a conclusion without verification
+     - Gave up when alternatives existed
+     - Repeated failed attempts without pivoting
+     - Used trial-and-error instead of systematic discovery
+
+  4. **Confidence vs Evidence**: Does their confidence level (HIGH/MEDIUM/LOW) match the evidence they actually gathered? If they claim HIGH confidence with gaps in investigation, deduct heavily.
+
+  5. **Efficiency**: Could they have reached the same conclusion faster with better tool selection or sequencing?
+
+  **SCORING CALIBRATION REMINDER:**
+
+  - If you're scoring above 70, you're being too lenient. Re-examine for missed opportunities.
+  - If they didn't exhaust investigation paths, score should be ≤ 60
+  - If they made logical leaps without evidence, score should be ≤ 55
+  - If tool usage was incomplete, deduct 5-10 points from tool_relevance
+  - If confidence doesn't match evidence, deduct 5-10 points from consistency
+
+  **SCORING PHILOSOPHY:**
+
+  Your average score should be 55-75 out of 100. This reflects professional standards where:
+  - 90-100: Near-perfect investigation (extremely rare - reserve for exemplary cases)
+  - 75-89: Good investigation with minor issues
+  - 60-74: Adequate investigation with notable gaps
+  - 45-59: Weak investigation with major methodology problems
+  - 0-44: Failed investigation (reserve for truly incomplete work)
+
+  Remember: If you're tempted to give a high score, ask yourself:
+  - "Is any evidence missing?"
+  - "Is there ANY more efficient approach?"
+  - "Is confidence level FULLY justified by direct evidence?"
+  - "Could the investigation have been more thorough?"
+
+  If the answer to any is "yes," deduct more points.
+
+  **CRITICAL REMINDERS:**
+
+  1. Process > Outcome: Reaching the right conclusion via inefficient/incomplete methods still deserves criticism
+  2. Direct > Circumstantial: Correlation is not verification. Demand direct evidence.
+  3. Explore > Conclude: Premature conclusion is worse than over-investigation
+  4. Evidence > Confidence: High confidence requires comprehensive evidence gathering
+
+  **Your evaluation must:**
+  - Be at least 200 words in the reasoning section
+  - Use critical language ("however," "failed to," "should have," "never attempted")
+  - Explain point deductions explicitly for each category
+
+  **MISSING TOOLS REMINDER:**
+  - Review the "IDENTIFYING MISSING TOOLS" section above
+  - List ONLY tools that would have provided direct evidence or filled critical gaps
+  - For each tool: provide specific tool name and explain what evidence gap it would have filled
+  - If no critical tools are missing, provide an empty array
+
+  **ALTERNATIVE APPROACHES REMINDER:**
+  - Review the "CONSTRUCTING ALTERNATIVE APPROACHES" section above
+  - Provide 1-3 alternatives ONLY if they would be genuinely more efficient or rigorous
+  - Each alternative must include: name, description, and specific ordered steps with tool names
+  - Show how each alternative addresses specific weaknesses you criticized
+  - If the investigation was reasonably optimal, provide fewer alternatives (or zero)
+
+  ## OUTPUT FORMAT
+
+  {{OUTPUT_SCHEMA}}


⚠️ Potential issue | 🟠 Major

Add explicit secret management and LLM API key handling strategy to Design section.

The document lacks critical security guidance for LLM provider credentials:

How are LLM API keys managed (vault, environment variables, temporary tokens)?

Is key rotation supported and documented?

How are API keys protected from logs, error messages, and debugging output?

What are the operational controls (audit logging, least-privilege IAM, CI secret scanning)?

Per learnings, security posture gaps degrade guarantees. Recommend adding a new Security subsection to the Design section (once reorganized) with:

## Security ### LLM API Key Management - **Credential Storage**: LLM provider credentials (API keys) **must** be stored in a managed secrets store (e.g., HashiCorp Vault, AWS Secrets Manager, Kubernetes Secrets) with IAM-restricted access. Never embed keys in configuration files, environment variables (except short-lived CI tokens), or application code. - **Key Rotation**: Support periodic key rotation with zero-downtime reloading. Document rotation procedures and testing. - **Logging & Error Handling**: - Never log full prompts if they contain sensitive data or PII - Redact or omit API keys and credentials from all error responses - Log only truncated/sanitized prompt snippets and non-sensitive metadata - Ensure failed LLM request logging strips credentials and sensitive user inputs - **Operational Controls**: - Enable audit logging for all secret access (who accessed the key, when, from where) - Use service accounts with least-privilege IAM roles - Implement CI/CD checks to prevent accidental secret exposure in commits - Regularly scan dependencies for known vulnerabilities in LLM client libraries

coderabbitai · 2025-12-16T09:40:45Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+# EP-0028 Phase 1: Alert Session Scoring API
+
+**Status:** Pending
+**Created:** 2025-12-05
+**Phase:** Design
+
+---
+
+## Overview
+
+Add systematic quality assessment for TARSy alert analysis sessions through an LLM-based judge that critically evaluates investigation methodology and provides actionable feedback.
+
+**Core Capabilities:**
+
+1. **LLM-Based Scoring**: Judge LLM evaluates completed sessions using rigorous scoring framework
+2. **Critical Evaluation**: Methodology-focused assessment with strict standards (4 dimensions: logical flow, consistency, tool relevance, synthesis quality)
+3. **Actionable Feedback**:
+   * Identify missing MCP tools that should have been used
+   * Suggest alternative investigation approaches with detailed steps
+   * Provide improvement recommendations
+4. **Criteria Evolution**: Content-addressed configuration (SHA256 hash) tracks scoring criteria changes over time
+5. **Quality Metrics**: Quantitative scores (0-100) enable tracking improvements and identifying patterns
+6. **Non-Intrusive**: Post-session scoring with zero impact on alert processing
+7. **Manual Control**: Operator-triggered scoring only (Phase 1 scope)
+
+**Primary Use Cases:**
+
+* Agent development feedback and improvement tracking
+* MCP server prioritization based on gap analysis
+* Training data curation for high-quality sessions
+* Investigation methodology pattern discovery
+
+**Scope:** On-demand scoring API for individual sessions via REST endpoints and basic UI integration. Scheduled batch scoring, analytics aggregation and reporting are left for the future.
+
+**POC Reference:** Initial prototype at <https://github.com/metlos/tarsy-response-score> proved judge LLM provides critical scoring even for self-produced analyses.


⚠️ Potential issue | 🟠 Major

Missing explicit "Problem Statement" section per EP guidelines.

The document lacks a formal "Problem Statement" section after metadata (line 6) and before "Overview." Per learnings and coding guidelines, EP documents must include a clear Problem Statement that briefly describes what needs to be solved.

Insert a new "Problem Statement" section between the metadata separator (line 7) and the Overview (line 9). Suggested placement and content:

--- +## Problem Statement + +TARSy cannot systematically evaluate alert analysis session quality. Manual quality assessment capability is needed to: +- Critically evaluate individual alert analysis sessions across multiple dimensions +- Provide structured feedback on investigation methodology +- Identify missing investigation tools and gaps +- Enable future aggregation and trend analysis on quality metrics + ## Overview

Keep this section brief and focused on what needs to be solved (not why), as per EP document guidelines.

🤖 Prompt for AI Agents

In docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 1 to 35, add a new "Problem Statement" section between the metadata separator (after line 7) and the "Overview" (before line 9): create a short, focused paragraph that states what must be solved (e.g., need for an on-demand, non-intrusive, operator-triggered API to assess the methodological quality of completed TARSy alert investigation sessions using an LLM judge and produce quantitative scores and actionable feedback), keeping it strictly about the problem (not rationale or implementation) and consistent with EP guidelines.

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (2)
1-35: Problem Statement section marked as addressed but still missing.

Past review comments indicate a formal Problem Statement section was added (commit b78f203) between metadata and Overview, but the current document shows Overview starting immediately at line 9 without a dedicated Problem Statement section. Either:

The commit was not applied to this branch, or

The fix was reverted in a later commit.

Per learnings and EP guidelines, a formal Problem Statement section must be present after the metadata separator (line 7) and before the Overview (line 9). Restore this section to match the marked completion.
🔎 Restore the missing Problem Statement section
 ---
 
+## Problem Statement
+
+TARSy cannot systematically evaluate alert analysis session quality. Manual quality assessment capability is needed to:
+- Critically evaluate individual alert analysis sessions across multiple dimensions
+- Provide structured feedback on investigation methodology
+- Identify missing investigation tools and gaps
+- Enable future aggregation and trend analysis on quality metrics
+
 ## Overview
36-142: Design section is not explicitly consolidated; content remains scattered.

Past review comments indicate Design and Implementation Plan sections should be formally reorganized (commit 835d202), with a single consolidated Design section containing System Architecture, Data Design, API Design, Configuration, Security, Performance, Error Handling, and Monitoring subsections. The current document has:

Configuration (line 36)

API Usage (line 67)

Key Design Decisions (line 123)

But lacks a cohesive "## Design" section grouping these. The structure should be:

Problem Statement

Overview

Goals (if present)

Design ← Single section with subsections for Configuration, API Design, Data Models, Security, etc.

Implementation Plan ← Currently at line 139, which is correct placement

Success Criteria

Status & Tracking

Reorganize to consolidate design-related subsections under a single "## Design" header.
🔎 Proposed consolidation structure
 ## Problem Statement
 [...]

 ## Overview
 [...]

+## Design
+
+### Configuration
 [move lines 36-65 content here]
 
-## API Usage
+### API Usage
 [move lines 67-120 content here]
 
-## Key Design Decisions
+### Key Design Decisions
 [move lines 123-137 content here]

 ## Implementation Plan
 [keep as top-level]

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 835d202 and f1b9f42.

📒 Files selected for processing (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🧰 Additional context used

📓 Path-based instructions (1)

docs/enhancements/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)

docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🧠 Learnings (19)

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:51.044Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:51.044Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: When EP references misaligned patterns (functions that have been renamed, uses deprecated patterns, or conflicts with existing architecture), fix the issue and document the change if clear, or stop and ask for clarification if unclear

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:23:41.441Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Use configuration over code for new agents when possible

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:23:41.441Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Applies to config/agents.yaml : Define agents in YAML configuration using template variable support `${VAR}` instead of hardcoding when possible

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🪛 LanguageTool

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

[grammar] ~173-~173: Use a hyphen to join words.
Context: ...or is_current_criteria * Database to API model mapping * [ ] Test database s...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

472-472: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

708-708: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Backend Tests

🔇 Additional comments (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (1)

139-442: Implementation Plan structure is solid and actionable.

The six-phase implementation roadmap (lines 139–442) is well-organized with clear dependencies, deliverables, and task breakdowns. Each phase is concrete and implementable:

Phase 1 establishes the database layer and models

Phase 2 hardcodes prompts and hashing logic

Phase 3 implements scoring service and LLM integration (multi-turn conversation)

Phase 4 exposes REST API endpoints

Phase 5 covers testing and documentation

Phase 6 adds basic UI integration

Dependencies are explicitly tracked, enabling parallel work where appropriate. Deliverables are clear and testable. This aligns well with EP guidelines requiring step-by-step phases for complex implementations.

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

alexeykazakov

Looks good overall! I reviewed it up to the Phase 3 and left some comments.
Phase 3 (the actual agent implementation) requires more details but let's leave it for later. We can proceed with the overall design and Phases 1 & 2 and then we will discuss the Phase 3 separately. Maybe in a separate document. I would like to follow the existing agent pasterns as mush possible (re-use the BaseAgent and LLMClient which already provide a lot of built-in functionality like automatic re-tries, observability, etc). But again, we can discuss it separately.

alexeykazakov · 2025-12-22T17:51:19Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+**Error Responses:**
+
+* `400 Bad Request` - Session not completed or invalid state
+* `500 Internal Server Error` - LLM API failure or database error


It can also return:

401 - if not authenticated

404 - session not found

alexeykazakov · 2025-12-22T18:08:24Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+}
+```
+
+**Response (200 OK):**


We can't return the actual scoring to this POST request. It will take some time to process the request.
We should return 202 Accepted with the score_id only instead. The client should poll GET /api/v1/scoring/sessions/{session_id}/score to get the actual result. We need to keep in mind that if we are re-scoring then GET will return the old score until re-scoring is complete. So the client should either compare the score ID from POST with the score ID from GET or we should introduce an endpoint to get the score by score ID in addition to GET the score by session ID.

alexeykazakov · 2025-12-22T18:13:20Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+
+1. **Multi-Turn Conversation**: Single conversation context enables LLM to reference its own scoring when analyzing missing tools; reduces repetitive context loading
+2. **Freeform Text Storage**: Reduces LLM parsing errors compared to strict JSON schemas; enables natural language analysis and flexibility as criteria evolve
+3. **Number-on-Last-Line Pattern**: Simple, reliable score extraction using regex (`r'(\d+)\s*$'`) without JSON parsing complexity


We can keep it this way for now. Later we can improve the score number thing. If the score is not detected then we can send a separate score-number-only request:

Ask LLM to generate the score document with the score number in in.

If no score number detected: Ask LLM to generate the score number only for the score document.

Repeat step 2 ^^^ until we get the score number.

alexeykazakov · 2025-12-22T18:20:14Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+  * Columns:
+    * `score_id` (UUID, primary key)
+    * `session_id` (UUID, unique constraint, foreign key → alert_sessions)
+    * `criteria_hash` (VARCHAR 64) - SHA256 hash of hardcoded judge prompts


We will have two prompts, one for the main score prompt and one for the missing tools prompt. But in the model there is only one hash.
I would also rename the filed(s) so it's clear what it's about.
Something like: score_analysis_prompt_hash and missing_tools_analysis_prompt_hash
Or judge_prompt_hash and missing_tools_prompt_hash or something like this.

alexeykazakov · 2025-12-22T18:23:59Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+  * Constants:
+    * `JUDGE_PROMPT_SCORE` - Main scoring evaluation prompt (see Attachment section)
+    * `JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS` - Missing tools analysis prompt


Let's keep the prompts and the methods to build them in the prompts package: https://github.com/codeready-toolchain/tarsy-bot/tree/master/backend/tarsy/agents/prompts

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In @docs/enhancements/pending/EP-0028-phase-1-scoring-api.md:
- Around line 343-344: Remove the duplicate bullet in the Phase 3 LLM
integration list: delete the redundant line that repeats "Support conversation
history across turns" (the duplicate of the previous bullet), leaving only a
single instance and ensuring the list now contains "Support conversation history
across turns" and "Use LLM client with conversation history support" once each.
- Around line 324-340: The turn numbering is inconsistent: you list only Turn 1
(Score Prompt) and Turn 2 (Missing Tools Prompt) but later refer to "turn 3"
when storing missing_tools_analysis; update the text so the store step
references "turn 2" (i.e., change "from turn 3" to "from turn 2") OR add a new
"Turn 3" subsection describing the additional follow-up if you intend three
turns; ensure the symbol missing_tools_analysis is documented under the correct
turn label to keep the flow consistent.
- Around line 238-243: Rename the grammatically awkward field
`scored_triggered_by` to `score_triggered_by` everywhere: update the schema
definition (replace `scored_triggered_by` with `score_triggered_by`), update the
JSON response example that references `scored_triggered_by`, update the Phase 1
deliverable reference, and update the implementation note; ensure consistency in
related names (e.g., keep `score_id`, `score_analysis` style) and adjust any
explanatory text or references to the old name so all four occurrences are
replaced.
- Around line 404-414: The Phase 4 checklist is ambiguous about immediate async
return for POST /api/v1/scoring/sessions/{session_id}/score; update the
implementation bullets to explicitly split the two flows: (1) if a score already
exists and force_rescore is false, return the existing score immediately with
full data, and (2) if no score exists or force_rescore is true, enqueue the
scoring work via FastAPI BackgroundTasks and return immediately with a response
indicating status: "pending" (and any job/session id), not waiting for
background completion; make the wording align with the response schema in the
earlier section and reference the endpoint path POST
/api/v1/scoring/sessions/{session_id}/score and the request body
{"force_rescore": false}.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f1b9f42 and 81b7a14.

📒 Files selected for processing (1)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🧰 Additional context used

📓 Path-based instructions (1)

docs/enhancements/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/ep-design-consistancy-check.mdc)

docs/enhancements/**/*.md: Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking
Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications
Verify that classes, functions, method signatures, type hints, database schemas, and API endpoints described in EP documents are actually implemented in the codebase
Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents
Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct
Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🧠 Learnings (19)

📓 Common learnings

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Design section with detailed technical design focusing on HOW to implement the solution with clean, practical design

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check EP (Enhancement Proposal) documents for consistency with codebase implementation, validating structural consistency, implementation fidelity, behavioral alignment, documentation accuracy, and evolution tracking

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include an Implementation Plan section with step-by-step phases for complex implementations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include detailed explanations of WHY (keep it brief) in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include irrelevant context or noise in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Enhancement Proposal documents must include a Problem Statement section with a brief description of what needs to be solved

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Enhancement Proposal documents should be written as focused, technical, and actionable implementation guides for AI consumption

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Ensure event flows, configuration options, integration patterns, and error handling in the codebase match what is specified in EP documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:51.044Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: Fix obvious errors or typos in EP design documents (e.g., typos in code snippets, incorrect function names, syntax errors) and explicitly highlight the correction

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: When analyzing EP documents, be thorough and check all referenced files, be specific with citations (line numbers, file paths, exact quotes), distinguish between intentional improvements and genuine misalignments, and provide actionable recommendations

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Check that code comments, docstrings, and examples in EP documents are accurate and reflect the current codebase implementation, and that EP status tags are correct

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:51.044Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-implementation.mdc:0-0
Timestamp: 2025-11-24T20:22:51.044Z
Learning: When EP references misaligned patterns (functions that have been renamed, uses deprecated patterns, or conflicts with existing architecture), fix the issue and document the change if clear, or stop and ask for clarification if unclear

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Do not include project management sections for humans in Enhancement Proposal documents

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Validate that file paths mentioned in EP documents exist in the codebase and directory structure matches EP specifications

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:41.519Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design.mdc:0-0
Timestamp: 2025-11-24T20:22:41.519Z
Learning: Applies to /docs/enhancements/pending/EP-*.md : Create Enhancement Proposal (EP) documents in `/docs/enhancements/pending` following the EP-xxxx naming convention

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:22:32.529Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: .cursor/rules/ep-design-consistancy-check.mdc:0-0
Timestamp: 2025-11-24T20:22:32.529Z
Learning: Applies to docs/enhancements/**/*.md : Track implementation improvements and bug fixes in the codebase that may not be documented in the corresponding EP document

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:23:41.441Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Use configuration over code for new agents when possible

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

📚 Learning: 2025-11-24T20:23:41.441Z

Learnt from: CR
Repo: codeready-toolchain/tarsy-bot PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T20:23:41.441Z
Learning: Applies to config/agents.yaml : Define agents in YAML configuration using template variable support `${VAR}` instead of hardcoding when possible

Applied to files:

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

🪛 LanguageTool

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

[grammar] ~268-~268: Use a hyphen to join words.
Context: ...or is_current_criteria * Database to API model mapping * [ ] Test database s...

(QB_NEW_EN_HYPHEN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Backend Tests

🔇 Additional comments (5)

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md (5)

229-248: Approve database schema design with comprehensive indexing strategy.

The Phase 1 database schema is well-designed with:

Clear field definitions with proper types and nullability constraints

Comprehensive indexes supporting common query patterns (session_id, status, criteria_hash, total_score)

Unique partial constraint preventing duplicate in-progress scorings

Explicit async state tracking fields (status, started_at_us, completed_at_us, error_message)

Proper foreign key to alert_sessions table

The schema supports the designed async workflow and concurrent scoring scenarios effectively.

185-204: Excellent design decisions documentation.

The 15 design decisions (lines 185-204) are clearly articulated and well-reasoned. They effectively balance:

Simplicity (hardcoded prompts for Phase 1, freeform text storage, number-on-last-line pattern)

Auditability (SHA256 criteria versioning, user attribution, status tracking)

Async resilience (background task execution, state transitions, orphan cleanup)

Extensibility (explicit mention that automation and concurrency limits are deferred to future phases)

This demonstrates thoughtful architectural planning with clear acknowledgment of Phase 1 scope constraints.

563-799: Judge prompt design is thorough and operationally sound.

The scoring framework (lines 567-799) demonstrates:

Clear 4-dimensional scoring rubric (Logical Flow, Consistency, Tool Relevance, Synthesis Quality)

Specific deduction criteria with realistic score ranges (typical 10-22 points per category)

Self-correction mechanisms ("if you're scoring above 70, you're being too lenient")

Operational guidance on score calibration (90-100 rare, typical 55-75)

Comprehensive evaluation checklist with evidence quality, tool completeness, logical rigor

The prompt emphasizes methodology over outcome and direct evidence over inference, which aligns with the stated goal of rigorous agent development feedback.

290-301: File path pattern is correctly specified and aligns with existing codebase conventions.

The specified path backend/tarsy/services/scoring_service.py matches the existing directory structure and service naming conventions used throughout the codebase. Services like data_masking_service.py already follow this *_service.py pattern in backend/tarsy/services/.

342-345: LLM client infrastructure and conversation history support are verified; placeholder substitution implementation is pending.

The LLMClient with multi-turn conversation history support exists in the codebase (backend/tarsy/integrations/llm/client.py, LLMManager.generate_response()) and can be configured with default LLM settings via settings.llm_provider. LLMConversation model supports multi-turn dialogue with message persistence across turns, and the History Service is available for session conversation retrieval.

However, the specific placeholder substitution pattern described in lines 324–340 ({{SESSION_CONVERSATION}}, {{ALERT_DATA}}, {{OUTPUT_SCHEMA}}) and the associated prompt constants (JUDGE_PROMPT_SCORE, JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS) are not yet implemented in the codebase. These remain TODO items in the EP's Phase 1 implementation checklist.

coderabbitai · 2026-01-09T16:26:31Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+- No request timeouts on long-running LLM operations
+- Real-time progress feedback via WebSocket events
+- Consistent UX with alert processing workflow
+- Reuses existing event infrastructure (no new systems needed)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's read the EP document to understand the context cat -n docs/enhancements/pending/EP-0028-phase-1-scoring-api.md | sed -n '40,180p'

Repository: codeready-toolchain/tarsy-bot

Length of output: 6825

🏁 Script executed:

# Search for WebSocket/event infrastructure in the backend find backend -type f -name "*.py" | head -20

Repository: codeready-toolchain/tarsy-bot

Length of output: 1769

🏁 Script executed:

# Look for event-related classes and functions rg -n "class.*Event|def.*broadcast|def.*emit|def.*publish" backend/ --type py -A 1 | head -50

Repository: codeready-toolchain/tarsy-bot

Length of output: 4379

🏁 Script executed:

# Search for channel patterns rg -n "channel|session:" backend/ --type py -i | head -30

Repository: codeready-toolchain/tarsy-bot

Length of output: 3556

🏁 Script executed:

# Search for broadcast_to_channel usage and event publishing rg -n "broadcast_to_channel|publish.*event|emit.*event" backend/ --type py -B 2 -A 2 | head -100

Repository: codeready-toolchain/tarsy-bot

Length of output: 8626

🏁 Script executed:

# Look for specific channel names like "sessions" or "session:" rg -n '"sessions"|\'sessions\'|"session:|\$.*session' backend/ --type py | head -40

Repository: codeready-toolchain/tarsy-bot

Length of output: 141

🏁 Script executed:

# Check the websocket_connection_manager.py file more thoroughly cat -n backend/tarsy/services/websocket_connection_manager.py

Repository: codeready-toolchain/tarsy-bot

Length of output: 5140

🏁 Script executed:

# Search for event payload/data structure definitions rg -n "class.*Event|type.*timestamp|score_id|session_id" backend/ --type py | head -50

Repository: codeready-toolchain/tarsy-bot

Length of output: 4651

🏁 Script executed:

# Search for scoring event implementations rg -n "scoring\.(started|progress|completed|failed)|score_id|ScoringEvent" backend/ --type py | head -50

Repository: codeready-toolchain/tarsy-bot

Length of output: 55

🏁 Script executed:

# Look for scoring service or scoring-related event publishing rg -n "scoring|score" backend/tarsy/services/ --type py -l | head -20

Repository: codeready-toolchain/tarsy-bot

Length of output: 55

🏁 Script executed:

# Check if there's a scoring service that might publish these events find backend -name "*scoring*" -type f

Repository: codeready-toolchain/tarsy-bot

Length of output: 55

🏁 Script executed:

# Look at event-related files to understand event structure patterns rg -n "type.*timestamp|timestamp_us|event.*payload" backend/ --type py | head -30

Repository: codeready-toolchain/tarsy-bot

Length of output: 3350

🏁 Script executed:

# Look at event publisher implementation to understand event structure cat -n backend/tarsy/services/events/publisher.py | head -100

Repository: codeready-toolchain/tarsy-bot

Length of output: 4522

🏁 Script executed:

# Search for existing event class definitions rg -n "class.*Event.*:" backend/tarsy --type py | head -40

Repository: codeready-toolchain/tarsy-bot

Length of output: 3076

🏁 Script executed:

# Check the event models and payload structure find backend -path "*/events/*" -name "*.py" -type f | head -20

Repository: codeready-toolchain/tarsy-bot

Length of output: 509

🏁 Script executed:

# Look at the event models to understand the structure cat -n backend/tarsy/models/event_models.py | head -150

Repository: codeready-toolchain/tarsy-bot

Length of output: 6626

🏁 Script executed:

# Check the channels definition cat -n backend/tarsy/services/events/channels.py

Repository: codeready-toolchain/tarsy-bot

Length of output: 1421

🏁 Script executed:

# Look for BaseEvent to understand the base structure rg -n "class BaseEvent" backend/ --type py -A 10

Repository: codeready-toolchain/tarsy-bot

Length of output: 895

🏁 Script executed:

# Search for any scoring-related code or references in the codebase rg -n "scoring" backend/ --type py -i | head -30

Repository: codeready-toolchain/tarsy-bot

Length of output: 55

Implement scoring-specific events following the existing TARSy event pattern.

The WebSocket infrastructure supports the channel patterns mentioned (global sessions channel and session-specific session:{session_id} channels). The event payload structure (type, timestamp_us, plus additional fields) aligns with existing events like SessionCompletedEvent and SessionFailedEvent.

However, the scoring-specific event classes (scoring.started, scoring.progress, scoring.completed, scoring.failed) do not exist in the codebase and must be created. Add them to backend/tarsy/models/event_models.py following the existing pattern: each should inherit from BaseEvent, include session_id, score_id, and timestamp_us fields, and use Literal type for the type field. Implement publishing of these events in the scoring service using EventPublisher, following the pattern in backend/tarsy/services/events/publisher.py with channel determined by whether it's a global event (sessions) or session-specific detail (session:{session_id}).

coderabbitai · 2026-01-09T16:26:31Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+    * `scored_triggered_by` (VARCHAR 255) - User from X-Forwarded-User header
+    * `scored_at_us` (BIGINT) - When scoring was triggered (microseconds)
+    * **Async fields:**
+      * `status` (VARCHAR 50, NOT NULL) - pending | in_progress | completed | failed
+      * `started_at_us` (BIGINT, NOT NULL) - When status became in_progress
+      * `completed_at_us` (BIGINT, nullable) - When status became completed/failed


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's find and verify the file exists find . -name "EP-0028-phase-1-scoring-api.md" -type f

Repository: codeready-toolchain/tarsy-bot

Length of output: 131

🏁 Script executed:

# Read the relevant sections of the file to verify the field naming # Lines around 238-243 sed -n '235,245p' docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 978

🏁 Script executed:

# Lines around 141 sed -n '138,145p' docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 419

🏁 Script executed:

# Lines around 366 sed -n '363,369p' docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 426

🏁 Script executed:

# Let's also check the overall naming patterns used in the document # Look for other field names with "score" in them grep -n "score_" docs/enhancements/pending/EP-0028-phase-1-scoring-api.md | head -20

Repository: codeready-toolchain/tarsy-bot

Length of output: 1547

🏁 Script executed:

# Check for "scored_" patterns to see if there are other grammatically awkward names grep -n "scored_" docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 430

Standardize field naming for grammatical consistency.

The field scored_triggered_by is grammatically awkward (mixing past participle with action description). Rename to score_triggered_by to align with the document's naming patterns (e.g., score_id, score_analysis) and improve clarity.

Update all 4 occurrences:

Line 238: schema definition

Line 141: JSON response example

Line 366: Phase 1 deliverable reference

Line 434: implementation note

Proposed rename

- `scored_triggered_by` (VARCHAR 255) - User from X-Forwarded-User header + `score_triggered_by` (VARCHAR 255) - User from X-Forwarded-User header

- "scored_triggered_by": "alice@example.com" + "score_triggered_by": "alice@example.com"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

* `scored_triggered_by` (VARCHAR 255) - User from X-Forwarded-User header

* `scored_at_us` (BIGINT) - When scoring was triggered (microseconds)

* **Async fields:**

* `status` (VARCHAR 50, NOT NULL) - pending | in_progress | completed | failed

* `started_at_us` (BIGINT, NOT NULL) - When status became in_progress

* `completed_at_us` (BIGINT, nullable) - When status became completed/failed

* `score_triggered_by` (VARCHAR 255) - User from X-Forwarded-User header

* `scored_at_us` (BIGINT) - When scoring was triggered (microseconds)

* **Async fields:**

* `status` (VARCHAR 50, NOT NULL) - pending | in_progress | completed | failed

* `started_at_us` (BIGINT, NOT NULL) - When status became in_progress

* `completed_at_us` (BIGINT, nullable) - When status became completed/failed

🤖 Prompt for AI Agents

In @docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 238 - 243, Rename the grammatically awkward field `scored_triggered_by` to `score_triggered_by` everywhere: update the schema definition (replace `scored_triggered_by` with `score_triggered_by`), update the JSON response example that references `scored_triggered_by`, update the Phase 1 deliverable reference, and update the implementation note; ensure consistency in related names (e.g., keep `score_id`, `score_analysis` style) and adjust any explanatory text or references to the old name so all four occurrences are replaced.

coderabbitai · 2026-01-09T16:26:31Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+* [ ] Build multi-turn judge prompt construction with placeholder substitution
+  * **Turn 1 - Score Prompt:**
+    * Start with `JUDGE_PROMPT_SCORE` constant
+    * Replace `{{SESSION_CONVERSATION}}` with full conversation from History Service
+    * Replace `{{ALERT_DATA}}` with original alert data
+    * Replace `{{OUTPUT_SCHEMA}}` with: "You MUST end your response with a single line containing ONLY the total score as an integer (0-100)"
+    * Send to LLM, receive score analysis response
+  * **Score Extraction:**
+    * Extract `total_score` from last line using regex: `r'(\d+)\s*$'`
+    * Store full response (minus last line) as `score_analysis`
+  * **Turn 2 - Missing Tools Prompt:**
+    * Build prompt from `JUDGE_PROMPT_FOLLOWUP_MISSING_TOOLS` constant
+    * Send as continuation of conversation (LLM has context of its own scoring)
+    * Receive missing tools analysis response
+  * **Store Missing Tools:**
+    * Store full response as `missing_tools_analysis` (freeform text)
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and get its size wc -l docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 133

🏁 Script executed:

# Read the relevant sections to verify turn numbering sed -n '320,370p' docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 2405

Correct turn numbering in multi-turn conversation flow.

The description identifies only 2 turns (lines 325 and 334), but line 365 references "turn 3" when storing the missing_tools_analysis. The turn numbering is inconsistent.

Current state:

Line 325: "Turn 1 - Score Prompt:"

Line 334: "Turn 2 - Missing Tools Prompt:"

Line 365: "missing_tools_analysis (freeform text from turn 3)"

Expected: Either correct line 365 to reference "turn 2" if there are only 2 turns, or add a description of Turn 3 in the Phase 1 multi-turn flow section (lines 324-340).

🤖 Prompt for AI Agents

In @docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 324 - 340, The turn numbering is inconsistent: you list only Turn 1 (Score Prompt) and Turn 2 (Missing Tools Prompt) but later refer to "turn 3" when storing missing_tools_analysis; update the text so the store step references "turn 2" (i.e., change "from turn 3" to "from turn 2") OR add a new "Turn 3" subsection describing the additional follow-up if you intend three turns; ensure the symbol missing_tools_analysis is documented under the correct turn label to keep the flow consistent.

coderabbitai · 2026-01-09T16:26:31Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+  * Support conversation history across turns
+  * Use LLM client with conversation history support


⚠️ Potential issue | 🟡 Minor

Remove redundant line in Phase 3 LLM integration.

Line 344 duplicates line 343. Remove the duplicate statement.

Proposed fix

* [ ] Integrate LLM client for multi-turn conversation * Use TARSy's default LLM configuration via existing LLM client infrastructure * Support conversation history across turns - * Use LLM client with conversation history support

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

* Support conversation history across turns

* Use LLM client with conversation history support

* [ ] Integrate LLM client for multi-turn conversation

* Use TARSy's default LLM configuration via existing LLM client infrastructure

* Support conversation history across turns

🤖 Prompt for AI Agents

In @docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 343 - 344, Remove the duplicate bullet in the Phase 3 LLM integration list: delete the redundant line that repeats "Support conversation history across turns" (the duplicate of the previous bullet), leaving only a single instance and ensuring the list now contains "Support conversation history across turns" and "Use LLM client with conversation history support" once each.

coderabbitai · 2026-01-09T16:26:31Z

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

+* [ ] Add POST /score endpoint with background task execution
+  * Endpoint: `POST /api/v1/scoring/sessions/{session_id}/score`
+  * Request body: `{"force_rescore": false}` (optional)
+  * Implementation:
+    * Check if score exists (unless `force_rescore=true`)
+    * If exists and not forcing rescore, return existing score immediately
+    * Otherwise, execute scoring via FastAPI `BackgroundTasks`
+    * Return score result (immediate if exists, or after background task completion)
+  * Error responses:
+    * `400 Bad Request` - Session not completed or invalid state
+    * `500 Internal Server Error` - LLM API failure or database error


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find the EP document find . -name "EP-0028-phase-1-scoring-api.md" -type f

Repository: codeready-toolchain/tarsy-bot

Length of output: 131

🏁 Script executed:

# Read the entire file to see the context wc -l ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 135

🏁 Script executed:

# Read lines around 91-93 to see the async pattern reference sed -n '85,100p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 664

🏁 Script executed:

# Read the Phase 4 section around lines 404-414 sed -n '396,420p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 1232

🏁 Script executed:

# Check lines around 150-160 to see if there are more details about the async pattern sed -n '150,165p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 624

🏁 Script executed:

# Check if there's a response schema or more details about what the endpoint returns sed -n '410,445p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 1540

🏁 Script executed:

# Let me check if there are any response examples or schemas sed -n '100,130p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 993

🏁 Script executed:

# Let me check the complete response schema to see what the endpoint actually returns sed -n '100,145p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 1619

🏁 Script executed:

# Let me verify the implementation pattern described in the document sed -n '85,115p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 1076

🏁 Script executed:

# Double-check by reading the exact lines mentioned in the review (91, 93) sed -n '88,95p' ./docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

Repository: codeready-toolchain/tarsy-bot

Length of output: 444

Phase 4 checklist could be more explicit about immediate async return.

Lines 404-414 describe the POST /score endpoint implementation with less clarity than earlier sections about the async pattern. The phrase "Return score result (immediate if exists, or after background task completion)" could be more specific. The response schema in lines 100-115 and endpoint description in lines 88-95 make clear that the endpoint returns immediately with status: "pending" for new scoring requests, but Phase 4 should align with that explicitness.

Consider rewording the Phase 4 implementation bullets to clearly separate: (1) existing score cases that return immediately with data, and (2) new scoring cases that launch background tasks and return immediately with pending status.

🤖 Prompt for AI Agents

In @docs/enhancements/pending/EP-0028-phase-1-scoring-api.md around lines 404 - 414, The Phase 4 checklist is ambiguous about immediate async return for POST /api/v1/scoring/sessions/{session_id}/score; update the implementation bullets to explicitly split the two flows: (1) if a score already exists and force_rescore is false, return the existing score immediately with full data, and (2) if no score exists or force_rescore is true, enqueue the scoring work via FastAPI BackgroundTasks and return immediately with a response indicating status: "pending" (and any job/session id), not waiting for background completion; make the wording align with the response schema in the earlier section and reference the endpoint path POST /api/v1/scoring/sessions/{session_id}/score and the request body {"force_rescore": false}.

Introduce a new EP for alert session scoring API.

8cd113d

This is just the core API that will be followed by further EPs that will flesh out more functionality and UI integration on top of it.

metlos requested a review from alexeykazakov December 9, 2025 12:31

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

add problem statement

b78f203

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

metlos changed the title ~~Introduce a new EP for alert session scoring API.~~ [SANDBOX-1531] Introduce a new EP for alert session scoring API. Dec 9, 2025

alexeykazakov reviewed Dec 11, 2025

View reviewed changes

Update docs/enhancements/pending/EP-0028-phase-1-scoring-api.md

2dd903a

Co-authored-by: Alexey Kazakov <alkazako@redhat.com>

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

Merge branch 'master' into session-scoring

cfdb72e

alexeykazakov reviewed Dec 15, 2025

View reviewed changes

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

reformulate and shorten the EP

835d202

metlos force-pushed the session-scoring branch from 734aac2 to 835d202 Compare December 16, 2025 09:37

coderabbitai bot reviewed Dec 16, 2025

View reviewed changes

metlos and others added 2 commits December 22, 2025 11:32

further simplifications in the EP

f1b9f42

Merge branch 'master' into session-scoring

6470690

coderabbitai bot reviewed Dec 22, 2025

View reviewed changes

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md Outdated Show resolved Hide resolved

docs/enhancements/pending/EP-0028-phase-1-scoring-api.md Outdated Show resolved Hide resolved

alexeykazakov reviewed Dec 22, 2025

View reviewed changes

Update the EP to use the async workflow similar to the alert analysis.

81b7a14

coderabbitai bot reviewed Jan 9, 2026

View reviewed changes


		## Experimental Background

		Note: Initial prototype exists at https://github.com/metlos/TARSy-response-score (Go implementation). This EP proposes native integration into TARSy backend.


		## Non-Goals (Future Phases)

		- UI integration for displaying scores


		The scoring system uses a YAML configuration file following TARSy's established configuration patterns (similar to `agents.yaml` and `llm_providers.yaml`). This file contains the judge prompt template and scoring parameters. The entire configuration is content-addressed using a hash to track criteria evolution.

		Example Configuration:

		* Support conversation history across turns
		* Use LLM client with conversation history support

[SANDBOX-1531] Introduce a new EP for alert session scoring API. #180

Are you sure you want to change the base?

[SANDBOX-1531] Introduce a new EP for alert session scoring API. #180

Uh oh!

Conversation

metlos commented Dec 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexeykazakov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

metlos commented Dec 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 9, 2025 •

edited

Loading

codecov bot commented Dec 9, 2025 •

edited

Loading