Skip to content

MSR14/FineSurE-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FineSurE: Fine-grained Summarization Evaluation using LLMs (ACL'24-main, Long Paper)

Here is the link to the paper we are reproducing on arXiv: [link]

The structure of the project:

  • dataset: Contains the FRANK and REALSumm datasets in JSON format.
  • reproduce: Includes the code for reproducing the results from FineSurE as shown in Table 1 and Table 2.
  • finesure: Contains the implementation of the FineSurE method, used for evaluating summaries generated by language models.

Highlight

FineSurE is a multi-dimensional, fine-grained automated evaluation framework for text summarization. It covers there distinctive evaluation dimensions, namely faithfulness, completeness, and conciseness. These dimensions are crucial to assess the capability of modern language models in summarization, as they are susceptible to incorrect statement, information omission, and verbosity. FineSurE (Finegrained Summarization Evaluation) using LLMs, a novel automated approach designed to evaluate the summarization quality at a fine-grained level based on summary sentences or keyfacts and the key-fact alignment. This model uses human generated key facts for evaluation.

FineSurE framework breaks down a complicate evaluation process into two simple human-like evaluation tasks using LLMs.

  • Fact Checking: The process turns the fact-checking problem into a nine-category categorization challenge. These consist of the seven factuality mistakes, as well as the extra categories ”other error” for errors not included in the seven and ”no error” for situations in which no errors were found. Consequently, the LLM is supposed to produce the error type assigned to one of the nine categories for every phrase, along with a succinct explanation, given a pair of input text and model summary.

  • Keyfact Alignment: Keyfact matching, which comprises two sequential tasks con- firming whether each keyfact is inferred from the summary and, if so, providing the line numbers for all the relevant summary sentences is how the approach resolves the alignment issue. The output should therefore be the binary label and the list of line numbers of all summary sentences that match for each keyfact given a pair of keyfact lists and model summaries.

Running FineSurE on Model Summareis

A sample datasets with 10 examples for fact-checking and keyfact-alignment tasks, respectively are being created

Replace the openai api key with your api key in finesure/fact-checking.py and finesure/keyfact-alignmnet.py to execute the files for scores.

Runnining Command:

python finesure/fact-checking.py [input-path] [output-folder]

# example code for fact checking on sampled data.
python finesure/fact-checking.py dataset/frank/frank-data-sample-10.json result/fact-checking

Runnining Command:

python finesure/keyfact-alignment.py [input-path] [keyfact-path] [output-folder]

# example code for keyfact alignment on sampled data.
python finesure/keyfact-alignment.py dataset/realsumm/realsumm-data-sample-10.json dataset/realsumm/human-keyfact-list.json result/keyfact-alignment

Logs:

The results are saved in the result directory. See the results on examples below:

  • Fact Checking Task:
[Evaluation Results]
* sentence-level factuality error ratio per model (lower is better)
bert_sum	0.0%
bus	33.3%
pgn	16.7%
s2s	83.3%
bart	33.3%

* summary-level faithfulness score per model (higher is better)
bert_sum	100.0%
bus	66.7%
pgn	83.3%
s2s	16.7%
bart	75.0%

* system-level model ranking (left is better)
['bert_sum', 'pgn', 'bart', 'bus', 's2s']
  • Keyfact Alignment Task:
[Evaluation Results]

* completeness score per model (higher is better)
unilm_out_v2	45.5%
t5_out_large	59.0%

* completeness model ranking (left is better)
['t5_out_large', 'unilm_out_v2']

* conciseness score per model (higher is better)
unilm_out_v2	76.0%
t5_out_large	81.7%

* conciseness model ranking (left is better)
['t5_out_large', 'unilm_out_v2']

Reproduce the Main Table of the Paper

cd reproduce
python reproduce-main-results.py results/frank-result-by-gpt4-w-finesure.json results/realsumm-result-by-gpt4-w-finesure.json

For generating the datasets for robustness, refer robustness.py file

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages