ConfDrivenInference

Repository for the paper "Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference".

Prerequisite

(Optional) Create a virtual environment.
Install necessary packages via pip install -r requirements.txt.
Download datasets via Huggingface, a Huggingface token is required. Set the env variable HF_TOKEN or alternatively put your key in hf_token.key file. After setting up the token, you can run python huggingface_download.py to download MMLU and PopQA datasets.

Run Evaluation

First, we need to run evaluation by only using P(T) confidence. For example, the following command evaluates LLaMA 3 3B model on MMLU dataset. If you want to evaluate on PopQA, make sure to change prompt path (prompt/popqa-chat.txt) and task type (generative).

python run_eval.py --model_name llama_3b \
                   --dataset_name mmlu \
                   --dataset_split test \
                   --prompt_path prompt/mmlu.txt \
                   --task_type multiple_choice

The evaluation result will be stored in ./eval_results/{dataset_name}/{model}.tsv.

Training P(IK) classifier

After obtaining the evaluation result (ground truth for training the classifier), we still need to get the input of the model, which is the hidden states.

python extract_hidden_states.py --model_name llama_3b \
                                --dataset_name mmlu \
                                --dataset_split test \
                                --prompt_path prompt/mmlu.txt

Now we can train the MLP using the following command:

python train_classifier.py --model_name llama_3b \
                           --dataset_name mmlu \
                           --dataset_split test \
                           --dataset_group_by subject

For PopQA, note that --dataset_group_by flag should be prop instead.

The trained model will be stored in ./mlp/{dataset_name}/{model_name}.

Analyze Results

Use analysis.py to view the aggregated results:

python analysis.py --models llama_3b,llama_8b \
                   --dataset_name mmlu \
                   --use_ik

Example Output

Model: llama_3b
Overall (1430) Accuracy: 0.6392
Confident (587) Accuracy: 0.8620
Not Confident (843) Accuracy: 0.4840

Model: llama_8b
Overall (1430) Accuracy: 0.6986
Confident (734) Accuracy: 0.9019
Not Confident (696) Accuracy: 0.4842

Total evaluated samples: 1430

| Models               | Accuracy |     Cost |
| -------------------- | -------- | -------- |
| llama_3b             |   0.6392 |     4290 |
| llama_8b             |   0.6986 |    11440 |
| llama_3b -> llama_8b |   0.6902 |    10662 |

Query Distribution:
source
llama_8b    843
llama_3b    587

Notes

The results vary due to the random seed while training MLP. Your results might be better or worse compared to the results provided in the paper.
--use_ik: This flag tells the script if you want to use P(IK) classifier's result.
--use_whole_dataset: Set this flag to evaluate the whole dataset (primarily for testing P(T)'s effect).
--confidence_threshold=0.9: Customize the confidence threshold.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
prompt		prompt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis.ipynb		analysis.ipynb
analysis.py		analysis.py
e2e_eval.ipynb		e2e_eval.ipynb
e2e_eval.py		e2e_eval.py
extract_hidden_states.ipynb		extract_hidden_states.ipynb
extract_hidden_states.py		extract_hidden_states.py
huggingface_download.ipynb		huggingface_download.ipynb
huggingface_download.py		huggingface_download.py
prompt_builder.py		prompt_builder.py
requirements.txt		requirements.txt
run_eval.ipynb		run_eval.ipynb
run_eval.py		run_eval.py
train_classifier.ipynb		train_classifier.ipynb
train_classifier.py		train_classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ConfDrivenInference

Prerequisite

Run Evaluation

Training P(IK) classifier

Analyze Results

Example Output

Notes

About

Uh oh!

Releases

Packages

Languages

License

NYCU-NLP-Lab/ConfDrivenInference

Folders and files

Latest commit

History

Repository files navigation

ConfDrivenInference

Prerequisite

Run Evaluation

Training P(IK) classifier

Analyze Results

Example Output

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages