Skip to content

feat: trying the FactualCorrectness!#223

Merged
amindadgar merged 3 commits intomainfrom
fix/222-incorrenct-eval-scoring
Oct 12, 2025
Merged

feat: trying the FactualCorrectness!#223
amindadgar merged 3 commits intomainfrom
fix/222-incorrenct-eval-scoring

Conversation

@amindadgar
Copy link
Member

@amindadgar amindadgar commented Oct 9, 2025

  • The other scores should be tweaked as well.
  • FactualScore link ishttps://docs.ragas.io/en/latest/concepts/metrics/available_metrics/factual_correctness/#factual-correctness

Summary by CodeRabbit

  • New Features

    • Added Factual Correctness metric to evaluation output.
    • New multi-phase evaluation: factual correctness computed for all items, core metrics run on filtered items, then results merged.
    • Added a “metrics-only” evaluation path to run metrics against pre-populated responses without re-querying.
  • Bug Fixes / Improvements

    • Consolidated cost/token reporting across multiple cost callbacks for unified summaries.
    • Better handling of NO_ANSWER references to reduce noise in core metric calculations.

- The other scores should be tweaked as well.
- FactualScore link ishttps://docs.ragas.io/en/latest/concepts/metrics/available_metrics/factual_correctness/#factual-correctness
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 9, 2025

Walkthrough

Implements a multi-phase evaluation: batched execution to populate responses, compute factual_correctness across all items, filter out rows with NO_ANSWER_REFERENCE for core metrics (faithfulness, answer_relevancy, context_precision, context_recall), then merge metrics and persist aggregated costs across multiple Cost callbacks.

Changes

Cohort / File(s) Summary of Changes
Evaluation Core Flow
evaluation/evaluation.py
Replaced single-pass evaluation with a multi-phase flow using Executor/RunConfig and EvaluationDataset: populate responses in batches, compute FactualCorrectness for all rows, filter out NO_ANSWER_REFERENCE rows, run core metrics on the filtered subset, and merge results back into the full dataset. Added _evaluate_metrics_only to run metrics on pre-populated responses.
Cost Aggregation & Persistence
evaluation/evaluation.py
_persist_cost updated to accept either a single EvaluationResult or a list; aggregates input/output tokens and total costs across multiple Cost callbacks and serializes a merged payload instead of handling a single callback.
Public Interface / Signatures & Imports
evaluation/evaluation.py
Updated `_evaluate(..., metrics_override: list[str]
Data Handling & Helpers
evaluation/evaluation.py
Added pandas-based intermediate handling for filtering/merging results, a robust context-parsing helper, use of NO_ANSWER_REFERENCE for filtering, and cost-persistence semantics that merge multiple callback payloads.
CI / Metrics Inclusion
.github/workflows/evaluation.yml
Appends factual_correctness(mode=f1) to the metrics list used when computing averages from results.csv, so averaged outputs include the new metric if present.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Orchestrator as Evaluation
  participant Executor
  participant Engine as LLM Engine
  participant EvalDS as EvaluationDataset
  participant MetricsRunner as Metrics
  participant Storage as Persist

  rect rgba(200,230,255,0.3)
  Note over Orchestrator,Executor: Phase 1 — Populate responses (batched)
  Orchestrator->>Executor: build RunConfig & batch queries
  Executor->>Engine: execute queries
  Engine-->>Executor: responses per item
  Executor-->>Orchestrator: populated EvalDS
  end

  rect rgba(235,255,200,0.3)
  Note over Orchestrator,MetricsRunner: Phase 2 — Factual correctness (all rows)
  Orchestrator->>MetricsRunner: run FactualCorrectness on EvalDS
  MetricsRunner->>Engine: evaluator LLM calls (if required)
  MetricsRunner-->>Orchestrator: factual_correctness results
  end

  rect rgba(255,230,230,0.3)
  Note over Orchestrator,EvalDS: Phase 3 — Filter & core metrics
  Orchestrator->>EvalDS: filter out `NO_ANSWER_REFERENCE` rows
  Orchestrator->>MetricsRunner: run core metrics on filtered EvalDS
  MetricsRunner-->>Orchestrator: core metric results
  end

  rect rgba(240,240,240,0.4)
  Note over Orchestrator,Storage: Phase 4 — Merge & persist
  Orchestrator->>Orchestrator: merge factual + core metrics into full table
  Orchestrator->>Storage: _persist_cost(results:list|single) -> aggregated cost payload
  Storage-->>Orchestrator: persisted
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • WIP: Added the evaluation process! #200 — Overlaps on evaluation/evaluation.py changes around evaluation orchestration, cost persistence, and _parse_contexts; likely touches similar code paths and helpers.

Poem

I hop through rows, nibbling each clue,
Truth for all first, then metrics for a few.
I gather tokens, costs in a heap,
Merge them back where results sleep.
A rabbit’s hop—efficient and new. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title Check ❓ Inconclusive The title is somewhat related but is vague and informal, using “trying” instead of stating the actual change. It fails to clearly summarize the main addition of the FactualCorrectness metric and leaves ambiguity about the precise scope. Adopting a more descriptive phrasing would help teammates quickly understand the primary goal of this pull request. Consider renaming the title to clearly describe the change. For example, “Add FactualCorrectness metric to evaluation workflow” would succinctly convey the main update. This would improve clarity for anyone scanning the repository history.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/222-incorrenct-eval-scoring

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b8a0b9 and 1cddc2e.

📒 Files selected for processing (1)
  • .github/workflows/evaluation.yml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: evaluate
  • GitHub Check: ci / test / Test
  • GitHub Check: ci / lint / Lint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Added functionality to evaluate metrics on datasets with populated responses without re-querying the engine.
- Implemented filtering for core metrics to exclude rows where both reference and response are marked as NO_ANSWER_REFERENCE.
- Updated cost persistence to handle multiple evaluation results and improved logging for better traceability.
- Refactored the evaluation method to allow for metric overrides, enhancing flexibility in evaluation configurations.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
evaluation/evaluation.py (5)

82-82: Remove extraneous f prefix from logging statement.

The f-string has no placeholders.

Apply this diff:

-        logging.info(f"Running queries to get responses for all rows...")
+        logging.info("Running queries to get responses for all rows...")

171-171: Consider using iterable unpacking for concatenation.

While the current concatenation is functional, iterable unpacking would be more Pythonic and slightly more efficient.

Apply this diff:

         merged = df_fc.merge(
-            df_core[["user_input", "reference", "response"] + core_cols],
+            df_core[[*["user_input", "reference", "response"], *core_cols]],
             on=["user_input", "reference", "response"],
             how="left",
         )

176-176: Remove extraneous f prefixes from logging statements.

These f-strings have no placeholders.

Apply this diff:

-        logging.info(f"Persisting results to results.csv")
+        logging.info("Persisting results to results.csv")
         merged.to_csv("results.csv", index=False)
 
-        logging.info(f"Persisting cost information to results_cost.json...")
+        logging.info("Persisting cost information to results_cost.json...")
         self._persist_cost([result_fc, result_core], "results_cost.json")

Also applies to: 179-179


215-216: Log exceptions instead of silently passing.

Silently catching and ignoring exceptions when computing token usage makes debugging difficult and could hide issues with cost tracking.

Apply this diff:

             try:
                 tokens = cb.total_tokens()
                 if tokens is not None:
                     if total_tokens_obj is None:
                         total_tokens_obj = tokens
                     else:
                         # merge
                         total_tokens_obj.input_tokens += tokens.input_tokens
                         total_tokens_obj.output_tokens += tokens.output_tokens
-            except Exception:
-                pass
+            except Exception:
+                logging.exception("Failed computing total tokens from cost callback")

269-293: Consider extracting metric mapping to reduce code duplication.

The name_to_metric dictionary construction is duplicated between _evaluate() and _evaluate_metrics_only(). Extracting this logic to a helper method would improve maintainability.

Consider refactoring like this:

def _get_metrics(self, metrics_override: list[str] | None = None):
    """Build and return the requested metrics."""
    evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=self.model))
    name_to_metric = {
        "faithfulness": Faithfulness(llm=evaluator_llm),
        "answer_relevancy": AnswerRelevancy(llm=evaluator_llm),
        "context_precision": ContextPrecision(llm=evaluator_llm),
        "context_recall": ContextRecall(llm=evaluator_llm),
        "factual_correctness": FactualCorrectness(llm=evaluator_llm),
    }
    if metrics_override is None:
        return list(name_to_metric.values())
    else:
        return [name_to_metric[m] for m in metrics_override]

def _evaluate(self, wrapped_engine, evaluation_dataset, metrics_override: list[str] | None = None) -> EvaluationResult:
    metrics = self._get_metrics(metrics_override)
    result = evaluate(
        query_engine=wrapped_engine,
        metrics=metrics,
        dataset=evaluation_dataset,
        token_usage_parser=get_token_usage_for_openai,
    )
    return result

def _evaluate_metrics_only(self, evaluation_dataset, metrics_override: list[str] | None = None) -> EvaluationResult:
    """
    Evaluate metrics on a dataset that already has responses populated.
    Does not re-query the engine.
    """
    metrics = self._get_metrics(metrics_override)
    result = ragas_evaluate(
        dataset=evaluation_dataset,
        metrics=metrics,
        token_usage_parser=get_token_usage_for_openai,
    )
    return result
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74548d3 and 1b8a0b9.

📒 Files selected for processing (1)
  • evaluation/evaluation.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
evaluation/evaluation.py (1)
worker/tasks.py (1)
  • query_data_sources (61-140)
🪛 Ruff (0.13.3)
evaluation/evaluation.py

82-82: f-string without any placeholders

Remove extraneous f prefix

(F541)


171-171: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


176-176: f-string without any placeholders

Remove extraneous f prefix

(F541)


179-179: f-string without any placeholders

Remove extraneous f prefix

(F541)


215-216: try-except-pass detected, consider logging the exception

(S110)


215-215: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: evaluate
  • GitHub Check: ci / test / Test
🔇 Additional comments (3)
evaluation/evaluation.py (3)

170-174: Verify merge logic handles duplicate user_input+reference+response combinations.

The merge uses ["user_input", "reference", "response"] as join keys. If the filtered dataset contains duplicate combinations of these three fields, the merge could produce unexpected duplicate rows or incorrect metric associations.

Consider adding a check before the merge to ensure uniqueness:

+        # Verify uniqueness of join keys in both dataframes
+        if df_fc[["user_input", "reference", "response"]].duplicated().any():
+            logging.warning("Duplicate join keys found in factual_correctness results")
+        if df_core[["user_input", "reference", "response"]].duplicated().any():
+            logging.warning("Duplicate join keys found in core metrics results")
+        
         merged = df_fc.merge(
             df_core[["user_input", "reference", "response"] + core_cols],
             on=["user_input", "reference", "response"],
             how="left",
         )

247-267: LGTM!

The addition of metrics_override parameter provides flexibility in selecting which metrics to evaluate. The implementation correctly maps metric names to their instances and handles both full evaluation and subset evaluation cases.


82-106: Verify Executor.results() ordering
The code assumes exec.results() returns results in the same order as submissions, mapping directly to evaluation_dataset_all.samples. Confirm that Executor.results() preserves submission order or adapt the logic to align by task name (e.g., query-{i}).

- Updated the evaluation metrics to include 'factual_correctness(mode=f1)', enhancing the assessment capabilities of the evaluation workflow.
- This addition allows for a more comprehensive analysis of the evaluation results.
@amindadgar amindadgar merged commit d864a43 into main Oct 12, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect scoring for out-of-context (OOC) questions in RAG evaluation — low scores instead of high

1 participant