Skip to content

zipJiang/CLC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo contains the code for the NeurIPS 2025 paper: Conformal Linguistic Calibration: Trading-off between Factuality and Specificity.

Setup

Install the required packages using pip:

pip install -r requirements.txt

Note: This repository depends on private/external GitHub repositories (tasker and langchain-interface). Ensure you have SSH access configured for GitHub to install these dependencies.

(Optional) Make sure that $(pwd) is in your PYTHONPATH. You can do this by running:

export PYTHONPATH=$(pwd):$PYTHONPATH

General Structure of the repo

configs

This directory contains runnable configurations for different steps of the experiments. Each configuration file is a YAML file that specifies the hyperparameters for an experiments. You can use the scripts in the scripts directory to run these configurations:

python scripts/run_task.py --config-path configs/<config_file>.yaml

Some of the tasks requires an OpenAI-compatible server (VLLM) running locally. You may need a separate environment for running the VLLM server.

Architecture & Experiment Pipeline

The project is structured around a task-based architecture where each step of the experiment is a distinct "Task" defined in src/tasks.

Core Components

  • src/tasks/: Contains the python logic for each task. Each task inherits from BaseTask (from the tasker library).
  • configs/: YAML files that define the parameters (inputs, outputs, model settings) for each task. The dependencies key in a config file often points to the config of the preceding step.
  • scripts/run_task.py: The generic entry point to run any task defined in a config file.

Note on Outputs: All task outputs will be written under the data/task_outputs/ directory.

Standard Pipeline Flow (SimpleQA)

First get your raw data ready here. The experiments generally follow this linear sequence of tasks. You run each step using the run_task.py script:

  1. Answer Sampling (simpleqa/sample_answer.yaml): Samples multiple phrase-level answers for the generated questions.
  2. Summarization (simpleqa/simpleqa_answer_summarization.yaml): Summarizes the sampled answers to reduce redundancy.
  3. Clustering (simpleqa/simpleqa_attach_answer_to_clusters.yaml -> simpleqa/simpleqa_iterative_clustering.yaml): Groups answers into semantic clusters to identify distinct meanings.
    1. Attach answers to cluster is to estimate multiplicity of each unique answer, this will help us identifgy the major answer, as well as allow us to estimate coverage of possible answers later.
    2. Iterative clustering is to help group answers into semantically similar nested clusters, so that the declarative rewriter can better cover the different meanings.
  4. Backoff Claim Generation (simpleqa/simpleqa_backoff.yaml): Generates backoff claims based on clustered answers to ensure coverage of all answer meanings.
  5. Evaluating Factuality (simpleqa/simpleqa_scoring.yaml): Scores the generated claims for factuality using a separate evaluation model.
  6. LTT Risk Control (simpleqa/simpleqa_ltt.yaml): Applies the Learn-Then-Test (CLC) method to balance factuality and specificity based on the LTT risk levels.

After this step, one should have the necessary output to reproduce the results in Section 4.1 (SimpelQA) of the paper.

Natural Questions (NQ) Pipeline

First, download the open dev-set of Natural Questions from here and preprocess it into the format required by the repo (which just requires untar).

  1. nq/sample_questions.yaml: Samples questions from the NQ dataset.
  2. nq/sample_answers.yaml
  3. nq/nq_answer_summarization.yaml
  4. nq/nq_attach_answer_to_clusters.yaml -> nq/nq_iterative_clustering.yaml
  5. nq/nq_backoff.yaml
  6. nq/nq_scoring.yaml
  7. nq/nq_ltt.yaml

Similar as before.

Pipeline for LLM fine-tuning:

The following tasks can be used to reproduce the LLM fine-tuning experiments in 4.3 (You need to create yoru own dataset and config files for training on other datasets).

  1. SFT: (prepare_claim_rewriter_dataset.yaml -> train_claim_rewriter.yaml): Prepares data and trains the claim rewriter model (e.g., Llama-3) based on the clustered data.
  2. *PO tuning: (prepare_dpo_claim_rewriter_dataset.yaml -> dpo_train_claim_rewriter.yaml): Further fine-tunes the claim rewriter using preference optimization (PO) techniques. Notice that if you want to do SFT + DPO, there's also a configuration file that chains both steps: dp_train_claim_rewriter_from_sft.yaml.
  3. Also available for ORPO training: orpo_train_claim_rewriter.yaml.

Example command to run the first step:

python scripts/run_task.py --config-path configs/question_generation.yaml

Understanding Task Dependencies

Check the dependencies field in each YAML file to see which previous task's output it requires.

Special Instructions for Running Experiments for Section 4.2

4.2 uses some lately released models on huggingface, and does not adhere to the general task-based structure of the repo. They can be directly run with the following scripts:

python scripts/clc_rewrite.py --input-path <input_path> --output-path <output_path>

python scripts/clc_info_scoring.py --input-path <input_path>

Notice that the input files are the outputs from the ltt tasks, or similar rewrites from the trained models.

Citation

 @inproceedings{

jiang2025conformal,

title={Conformal Linguistic Calibration: Trading-off between Factuality and Specificity},

author={Zhengping Jiang and Anqi Liu and Benjamin Van Durme},

booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},

year={2025},

url={https://openreview.net/forum?id=MWF1ZzYnxJ}

}

About

Official Repo for "Conformal Linguistic Calibration: Trading-off between Factuality and Specificity"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages