This repo contains the code for the NeurIPS 2025 paper: Conformal Linguistic Calibration: Trading-off between Factuality and Specificity.
Install the required packages using pip:
pip install -r requirements.txtNote: This repository depends on private/external GitHub repositories (tasker and langchain-interface). Ensure you have SSH access configured for GitHub to install these dependencies.
(Optional) Make sure that $(pwd) is in your PYTHONPATH. You can do this by running:
export PYTHONPATH=$(pwd):$PYTHONPATHconfigsThis directory contains runnable configurations for different steps of the experiments. Each configuration file is a YAML file that specifies the hyperparameters for an experiments. You can use the scripts in the scripts directory to run these configurations:
python scripts/run_task.py --config-path configs/<config_file>.yamlSome of the tasks requires an OpenAI-compatible server (VLLM) running locally. You may need a separate environment for running the VLLM server.
The project is structured around a task-based architecture where each step of the experiment is a distinct "Task" defined in src/tasks.
src/tasks/: Contains the python logic for each task. Each task inherits fromBaseTask(from thetaskerlibrary).configs/: YAML files that define the parameters (inputs, outputs, model settings) for each task. Thedependencieskey in a config file often points to the config of the preceding step.scripts/run_task.py: The generic entry point to run any task defined in a config file.
Note on Outputs: All task outputs will be written under the data/task_outputs/ directory.
First get your raw data ready here.
The experiments generally follow this linear sequence of tasks. You run each step using the run_task.py script:
- Answer Sampling (
simpleqa/sample_answer.yaml): Samples multiple phrase-level answers for the generated questions. - Summarization (
simpleqa/simpleqa_answer_summarization.yaml): Summarizes the sampled answers to reduce redundancy. - Clustering (
simpleqa/simpleqa_attach_answer_to_clusters.yaml->simpleqa/simpleqa_iterative_clustering.yaml): Groups answers into semantic clusters to identify distinct meanings.- Attach answers to cluster is to estimate multiplicity of each unique answer, this will help us identifgy the major answer, as well as allow us to estimate coverage of possible answers later.
- Iterative clustering is to help group answers into semantically similar nested clusters, so that the declarative rewriter can better cover the different meanings.
- Backoff Claim Generation (
simpleqa/simpleqa_backoff.yaml): Generates backoff claims based on clustered answers to ensure coverage of all answer meanings. - Evaluating Factuality (
simpleqa/simpleqa_scoring.yaml): Scores the generated claims for factuality using a separate evaluation model. - LTT Risk Control (
simpleqa/simpleqa_ltt.yaml): Applies the Learn-Then-Test (CLC) method to balance factuality and specificity based on the LTT risk levels.
After this step, one should have the necessary output to reproduce the results in Section 4.1 (SimpelQA) of the paper.
First, download the open dev-set of Natural Questions from here and preprocess it into the format required by the repo (which just requires untar).
nq/sample_questions.yaml: Samples questions from the NQ dataset.nq/sample_answers.yamlnq/nq_answer_summarization.yamlnq/nq_attach_answer_to_clusters.yaml->nq/nq_iterative_clustering.yamlnq/nq_backoff.yamlnq/nq_scoring.yamlnq/nq_ltt.yaml
Similar as before.
The following tasks can be used to reproduce the LLM fine-tuning experiments in 4.3 (You need to create yoru own dataset and config files for training on other datasets).
- SFT: (
prepare_claim_rewriter_dataset.yaml->train_claim_rewriter.yaml): Prepares data and trains the claim rewriter model (e.g., Llama-3) based on the clustered data. - *PO tuning: (
prepare_dpo_claim_rewriter_dataset.yaml->dpo_train_claim_rewriter.yaml): Further fine-tunes the claim rewriter using preference optimization (PO) techniques. Notice that if you want to do SFT + DPO, there's also a configuration file that chains both steps:dp_train_claim_rewriter_from_sft.yaml. - Also available for ORPO training:
orpo_train_claim_rewriter.yaml.
Example command to run the first step:
python scripts/run_task.py --config-path configs/question_generation.yamlCheck the dependencies field in each YAML file to see which previous task's output it requires.
4.2 uses some lately released models on huggingface, and does not adhere to the general task-based structure of the repo. They can be directly run with the following scripts:
python scripts/clc_rewrite.py --input-path <input_path> --output-path <output_path>
python scripts/clc_info_scoring.py --input-path <input_path>
Notice that the input files are the outputs from the ltt tasks, or similar rewrites from the trained models.
@inproceedings{
jiang2025conformal,
title={Conformal Linguistic Calibration: Trading-off between Factuality and Specificity},
author={Zhengping Jiang and Anqi Liu and Benjamin Van Durme},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=MWF1ZzYnxJ}
}