📈 Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

🌟 Introduction

DeepVerifier enables self-evolving Deep Research Agents (DRAs) by verifying an agent’s draft answer, generating rubric-guided feedback, and iterating—yielding an inference-time scaling effect without additional training.
We build an automatically constructed DRA Failure Taxonomy (5 major classes, 13 subclasses) and derive structured rubrics to make verification and feedback more targeted and reliable.
Across challenging benchmarks (e.g., GAIA / XBench-DeepSearch / BrowseComp), DeepVerifier improves verification quality and supports multi-round refinement for stronger final task accuracy.

✨ Features

🧠 Verification via Asymmetry + Decomposition: breaks hard verification into small, source-checkable questions.
📜 Rubric-Guided Feedback: taxonomy-derived rubrics produce actionable, structured corrections (not just “judge” scores).
🔌 Plug-and-Play Test-Time Self-Evolution: integrates into existing agent pipelines as a verifier + feedback module.
📦 DeepVerifier-4K Release: a curated SFT dataset (4,646 pairs) to train stronger reflection and self-critique in open models.

🚀 Usage

1. Overview

All related code is in System/ckv3/DeepVerifier/verifier.py
Datasets are in data/dataset
Please install necessary dependencies following Cognitive Kernel-Pro.

2. Run Scripts

There are three running modes

Run verification
Run CK Agent
Analyze the outputs of 1 or 2 and calculate accuracy

2.1 Run Verification

Input is ck_agent trajectory (jsonl), output is verifier trajectory (jsonl)

There are three different verifiers

LLM Verifier
Simple Agent Verifier
Deep Verifer

To run those verifiers, you can first cd run_scipts/, then

Export env variables (Open AI keys or Azure keys)

export OPENAI_API_KEY="YOUR_API_KEY"
export OPENAI_ENDPOINT="YOUR_ENDPOINT"
export OPENAI_API_VERSION="YOUR_API_VERSION"

export AZURE_OPENAI_ENDPOINT="YOUR_ENDPOINT"  
export AZURE_OPENAI_API_KEY="YOUR_API_KEY"
export AZURE_OPENAI_API_VERSION="YOUR_API_VERSION"
export AWS_ACCESS_KEY="YOUR_KEY"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_KEY"

In verify.sh: modify the default_input and default_output parameters in line 49-50, and --project_path in line 67.
Run the following command:

bash verify.sh 0 deep_verifier # for deep verifier
bash verify.sh 0 llm_verifier # for llm verifier
bash verify.sh 0 simple_verifier # for simple verifier
# the 0 is the web port number to host the headless browser, 0 is 3000, 1 is 3001, x is 300x.

2.2 Run CK Aent

Input is gaia query (jsonl) or **ck_agent trajectory **(jsonl), output is also ck_agent trajectory (jsonl)

To run those verifiers, you can first cd run_scipts/, then

In verify.sh: modify the default_input and default_output parameters in line 49-50, and other parameters in line 57-67:

--verify bool # Whether to use DeepVerifier to verify the ck_agent's answer and retry if the answer is incorrect.
--provide_feedback # Whether to provide feedback to ck_agent in the next try when the answer is incorrect.
--max_retries int # The maximum number of retries for ck_agent when the answer is incorrect.
# all retries will be recorded in the ck_agent's trajectory in the "attempts" field.

Run the following command:

bash verify.sh 0 ck_agent

2.3 Run Analysis

Input is a ck_agent trajectory (jsonl) or verifier trajectory (jsonl), output is a csv file and the printed out accuracy

bash verify.sh analyze path/to/the/jsonl_file

3. Utilities

bash run_script/split.sh: split a jsonl into multiple part for parallel running

bash run_script/merge.sh: merge multiple parts into a single file

Friendly links to other works from Tencent AI Lab

Deep Research Agent framework: Cognitive Kernel-Pro
Agent Self-Evolving Research, including WebEvolver, WebCoT, WebVoyager, OpenWebVoyager, WebAggregatorQA.

Citation

@misc{wan2026inference,
      title={Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification}, 
      author={Wan, Yuxuan and Fang, Tianqing and Li, Zaitang and Huo, Yintong and Wang, Wenxuan and Mi, Haitao and Yu, Dong and Lyu, Michael R},
      year={2026},
      eprint={2601.15808},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.15808}, 
}

@misc{fang2025cognitivekernelpro,
      title={Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training}, 
      author={Tianqing Fang and Zhisong Zhang and Xiaoyang Wang and Rui Wang and Can Qin and Yuxuan Wan and Jun-Yu Ma and Ce Zhang and Jiaqi Chen and Xiyun Li and Hongming Zhang and Haitao Mi and Dong Yu},
      year={2025},
      eprint={2508.00414},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.00414}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
System		System
assets		assets
run_scripts		run_scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📈 Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

🌟 Introduction

✨ Features

🚀 Usage

1. Overview

2. Run Scripts

2.1 Run Verification

2.2 Run CK Aent

2.3 Run Analysis

3. Utilities

Friendly links to other works from Tencent AI Lab

Citation

About

Uh oh!

Releases

Packages

Languages

yxwan123/DeepVerifier

Folders and files

Latest commit

History

Repository files navigation

📈 Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

🌟 Introduction

✨ Features

🚀 Usage

1. Overview

2. Run Scripts

2.1 Run Verification

2.2 Run CK Aent

2.3 Run Analysis

3. Utilities

Friendly links to other works from Tencent AI Lab

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages