feat: trying the FactualCorrectness! by amindadgar · Pull Request #223 · TogetherCrew/hivemind-bot

amindadgar · 2025-10-09T11:25:06Z

The other scores should be tweaked as well.
FactualScore link ishttps://docs.ragas.io/en/latest/concepts/metrics/available_metrics/factual_correctness/#factual-correctness

Summary by CodeRabbit

New Features
- Added Factual Correctness metric to evaluation output.
- New multi-phase evaluation: factual correctness computed for all items, core metrics run on filtered items, then results merged.
- Added a “metrics-only” evaluation path to run metrics against pre-populated responses without re-querying.
Bug Fixes / Improvements
- Consolidated cost/token reporting across multiple cost callbacks for unified summaries.
- Better handling of NO_ANSWER references to reduce noise in core metric calculations.

- The other scores should be tweaked as well. - FactualScore link ishttps://docs.ragas.io/en/latest/concepts/metrics/available_metrics/factual_correctness/#factual-correctness

coderabbitai · 2025-10-09T11:25:15Z

Walkthrough

Implements a multi-phase evaluation: batched execution to populate responses, compute factual_correctness across all items, filter out rows with NO_ANSWER_REFERENCE for core metrics (faithfulness, answer_relevancy, context_precision, context_recall), then merge metrics and persist aggregated costs across multiple Cost callbacks.

Changes

Cohort / File(s)	Summary of Changes
Evaluation Core Flow `evaluation/evaluation.py`	Replaced single-pass evaluation with a multi-phase flow using `Executor`/`RunConfig` and `EvaluationDataset`: populate responses in batches, compute `FactualCorrectness` for all rows, filter out `NO_ANSWER_REFERENCE` rows, run core metrics on the filtered subset, and merge results back into the full dataset. Added `_evaluate_metrics_only` to run metrics on pre-populated responses.
Cost Aggregation & Persistence `evaluation/evaluation.py`	`_persist_cost` updated to accept either a single `EvaluationResult` or a list; aggregates input/output tokens and total costs across multiple Cost callbacks and serializes a merged payload instead of handling a single callback.
Public Interface / Signatures & Imports `evaluation/evaluation.py`	Updated `_evaluate(..., metrics_override: list[str]
Data Handling & Helpers `evaluation/evaluation.py`	Added pandas-based intermediate handling for filtering/merging results, a robust context-parsing helper, use of `NO_ANSWER_REFERENCE` for filtering, and cost-persistence semantics that merge multiple callback payloads.
CI / Metrics Inclusion `.github/workflows/evaluation.yml`	Appends `factual_correctness(mode=f1)` to the metrics list used when computing averages from `results.csv`, so averaged outputs include the new metric if present.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Orchestrator as Evaluation
  participant Executor
  participant Engine as LLM Engine
  participant EvalDS as EvaluationDataset
  participant MetricsRunner as Metrics
  participant Storage as Persist

  rect rgba(200,230,255,0.3)
  Note over Orchestrator,Executor: Phase 1 — Populate responses (batched)
  Orchestrator->>Executor: build RunConfig & batch queries
  Executor->>Engine: execute queries
  Engine-->>Executor: responses per item
  Executor-->>Orchestrator: populated EvalDS
  end

  rect rgba(235,255,200,0.3)
  Note over Orchestrator,MetricsRunner: Phase 2 — Factual correctness (all rows)
  Orchestrator->>MetricsRunner: run FactualCorrectness on EvalDS
  MetricsRunner->>Engine: evaluator LLM calls (if required)
  MetricsRunner-->>Orchestrator: factual_correctness results
  end

  rect rgba(255,230,230,0.3)
  Note over Orchestrator,EvalDS: Phase 3 — Filter & core metrics
  Orchestrator->>EvalDS: filter out `NO_ANSWER_REFERENCE` rows
  Orchestrator->>MetricsRunner: run core metrics on filtered EvalDS
  MetricsRunner-->>Orchestrator: core metric results
  end

  rect rgba(240,240,240,0.4)
  Note over Orchestrator,Storage: Phase 4 — Merge & persist
  Orchestrator->>Orchestrator: merge factual + core metrics into full table
  Orchestrator->>Storage: _persist_cost(results:list|single) -> aggregated cost payload
  Storage-->>Orchestrator: persisted
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

WIP: Added the evaluation process! #200 — Overlaps on evaluation/evaluation.py changes around evaluation orchestration, cost persistence, and _parse_contexts; likely touches similar code paths and helpers.

Poem

I hop through rows, nibbling each clue,
Truth for all first, then metrics for a few.
I gather tokens, costs in a heap,
Merge them back where results sleep.
A rabbit’s hop—efficient and new. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title Check	❓ Inconclusive	The title is somewhat related but is vague and informal, using “trying” instead of stating the actual change. It fails to clearly summarize the main addition of the FactualCorrectness metric and leaves ambiguity about the precise scope. Adopting a more descriptive phrasing would help teammates quickly understand the primary goal of this pull request.	Consider renaming the title to clearly describe the change. For example, “Add FactualCorrectness metric to evaluation workflow” would succinctly convey the main update. This would improve clarity for anyone scanning the repository history.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/222-incorrenct-eval-scoring

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b8a0b9 and 1cddc2e.

📒 Files selected for processing (1)

.github/workflows/evaluation.yml (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: evaluate
GitHub Check: ci / test / Test
GitHub Check: ci / lint / Lint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- Added functionality to evaluate metrics on datasets with populated responses without re-querying the engine. - Implemented filtering for core metrics to exclude rows where both reference and response are marked as NO_ANSWER_REFERENCE. - Updated cost persistence to handle multiple evaluation results and improved logging for better traceability. - Refactored the evaluation method to allow for metric overrides, enhancing flexibility in evaluation configurations.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

evaluation/evaluation.py (5)

82-82: Remove extraneous f prefix from logging statement.

The f-string has no placeholders.

Apply this diff:

-        logging.info(f"Running queries to get responses for all rows...")
+        logging.info("Running queries to get responses for all rows...")

171-171: Consider using iterable unpacking for concatenation.

While the current concatenation is functional, iterable unpacking would be more Pythonic and slightly more efficient.

Apply this diff:

         merged = df_fc.merge(
-            df_core[["user_input", "reference", "response"] + core_cols],
+            df_core[[*["user_input", "reference", "response"], *core_cols]],
             on=["user_input", "reference", "response"],
             how="left",
         )

176-176: Remove extraneous f prefixes from logging statements.

These f-strings have no placeholders.

Apply this diff:

-        logging.info(f"Persisting results to results.csv")
+        logging.info("Persisting results to results.csv")
         merged.to_csv("results.csv", index=False)
 
-        logging.info(f"Persisting cost information to results_cost.json...")
+        logging.info("Persisting cost information to results_cost.json...")
         self._persist_cost([result_fc, result_core], "results_cost.json")

Also applies to: 179-179

215-216: Log exceptions instead of silently passing.

Silently catching and ignoring exceptions when computing token usage makes debugging difficult and could hide issues with cost tracking.

Apply this diff:

             try:
                 tokens = cb.total_tokens()
                 if tokens is not None:
                     if total_tokens_obj is None:
                         total_tokens_obj = tokens
                     else:
                         # merge
                         total_tokens_obj.input_tokens += tokens.input_tokens
                         total_tokens_obj.output_tokens += tokens.output_tokens
-            except Exception:
-                pass
+            except Exception:
+                logging.exception("Failed computing total tokens from cost callback")

269-293: Consider extracting metric mapping to reduce code duplication.

The name_to_metric dictionary construction is duplicated between _evaluate() and _evaluate_metrics_only(). Extracting this logic to a helper method would improve maintainability.

Consider refactoring like this:

def _get_metrics(self, metrics_override: list[str] | None = None):
    """Build and return the requested metrics."""
    evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=self.model))
    name_to_metric = {
        "faithfulness": Faithfulness(llm=evaluator_llm),
        "answer_relevancy": AnswerRelevancy(llm=evaluator_llm),
        "context_precision": ContextPrecision(llm=evaluator_llm),
        "context_recall": ContextRecall(llm=evaluator_llm),
        "factual_correctness": FactualCorrectness(llm=evaluator_llm),
    }
    if metrics_override is None:
        return list(name_to_metric.values())
    else:
        return [name_to_metric[m] for m in metrics_override]

def _evaluate(self, wrapped_engine, evaluation_dataset, metrics_override: list[str] | None = None) -> EvaluationResult:
    metrics = self._get_metrics(metrics_override)
    result = evaluate(
        query_engine=wrapped_engine,
        metrics=metrics,
        dataset=evaluation_dataset,
        token_usage_parser=get_token_usage_for_openai,
    )
    return result

def _evaluate_metrics_only(self, evaluation_dataset, metrics_override: list[str] | None = None) -> EvaluationResult:
    """
    Evaluate metrics on a dataset that already has responses populated.
    Does not re-query the engine.
    """
    metrics = self._get_metrics(metrics_override)
    result = ragas_evaluate(
        dataset=evaluation_dataset,
        metrics=metrics,
        token_usage_parser=get_token_usage_for_openai,
    )
    return result

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74548d3 and 1b8a0b9.

📒 Files selected for processing (1)

evaluation/evaluation.py (5 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

evaluation/evaluation.py (1)

worker/tasks.py (1)

query_data_sources (61-140)

🪛 Ruff (0.13.3)

evaluation/evaluation.py

82-82: f-string without any placeholders

Remove extraneous f prefix

(F541)

171-171: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

176-176: f-string without any placeholders

Remove extraneous f prefix

(F541)

179-179: f-string without any placeholders

Remove extraneous f prefix

(F541)

215-216: try-except-pass detected, consider logging the exception

(S110)

215-215: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: evaluate
GitHub Check: ci / test / Test

🔇 Additional comments (3)

evaluation/evaluation.py (3)
170-174: Verify merge logic handles duplicate user_input+reference+response combinations.

The merge uses ["user_input", "reference", "response"] as join keys. If the filtered dataset contains duplicate combinations of these three fields, the merge could produce unexpected duplicate rows or incorrect metric associations.

Consider adding a check before the merge to ensure uniqueness:
+        # Verify uniqueness of join keys in both dataframes
+        if df_fc[["user_input", "reference", "response"]].duplicated().any():
+            logging.warning("Duplicate join keys found in factual_correctness results")
+        if df_core[["user_input", "reference", "response"]].duplicated().any():
+            logging.warning("Duplicate join keys found in core metrics results")
+        
         merged = df_fc.merge(
             df_core[["user_input", "reference", "response"] + core_cols],
             on=["user_input", "reference", "response"],
             how="left",
         )
247-267: LGTM!

The addition of metrics_override parameter provides flexibility in selecting which metrics to evaluate. The implementation correctly maps metric names to their instances and handles both full evaluation and subset evaluation cases.

82-106: Verify Executor.results() ordering
The code assumes exec.results() returns results in the same order as submissions, mapping directly to evaluation_dataset_all.samples. Confirm that Executor.results() preserves submission order or adapt the logic to align by task name (e.g., query-{i}).

- Updated the evaluation metrics to include 'factual_correctness(mode=f1)', enhancing the assessment capabilities of the evaluation workflow. - This addition allows for a more comprehensive analysis of the evaluation results.

feat: trying the FactualCorrectness!

74548d3

- The other scores should be tweaked as well. - FactualScore link ishttps://docs.ragas.io/en/latest/concepts/metrics/available_metrics/factual_correctness/#factual-correctness

amindadgar linked an issue Oct 9, 2025 that may be closed by this pull request

Incorrect scoring for out-of-context (OOC) questions in RAG evaluation — low scores instead of high #222

Closed

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

feat: add factual correctness metric to evaluation process

1cddc2e

- Updated the evaluation metrics to include 'factual_correctness(mode=f1)', enhancing the assessment capabilities of the evaluation workflow. - This addition allows for a more comprehensive analysis of the evaluation results.

amindadgar merged commit d864a43 into main Oct 12, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: trying the FactualCorrectness!#223

feat: trying the FactualCorrectness!#223
amindadgar merged 3 commits intomainfrom
fix/222-incorrenct-eval-scoring

amindadgar commented Oct 9, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amindadgar commented Oct 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

amindadgar commented Oct 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading