This repository is about the paper
âCompanyRehearsal: Retrieval-Augmented Financial QA with Knowledge Graph Groundingâ.
It includes the datasets used in the study.
The project explores Retrieval-Augmented Generation (RAG) techniques that leverage knowledge graphs (KGs) or past earnings call Q&A to enhance factual accuracy and reasoning in financial domains, particularly in earnings call transcripts (ECC) scenario.
We provides ECC QA pairs, knowledge graph used in the expeiments, and financial terminology resources designed to support research on financial question answering.
Please cite the following references if you use the released data.
@inproceedings{shih2025company,
title={Company-Specific Knowledge Matters: Retrieval-Augmented Generation for Earnings Call Answer Rehearsal},
author={Shih, Yung-Yu and Chen, Yun-Nung and Chen, Chung-Chi},
booktitle={Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
pages={5243--5247},
year={2025}
}
Contains all raw and processed data used in the RAG experiments.
Knowledge graphs used for retrieval and reasoning tasks.
-
cause_effect_sentence_pairs.txtâ Sentence-level causeâeffect relationships.Column Description Cause sentence The initiating or source statement (e.g., '2Q18 earnings met expectations') Effect sentence The resulting or affected statement (e.g., 'FY18 EPS estimate remains unchanged at $4.10') -
cause_effect_term_pairs.csvâ Term-level causeâeffect pairs, formatted as (head â cause, tail â effect, weight). Each row represents a directed relationship from a cause term to an effect term, along with its frequency count in the corpus.Column Description Cause The source or initiating term (e.g., growth)Effect The resulting or affected term (e.g., revenue)Count The number of times the causeâeffect pair appears in the dataset
QuestionâAnswer pairs derived from financial earnings call transcripts.
all_qa/â QA pairs collected across all companies.cs_qa(aapl)/â QA pairs extracted from Apple earnings call transcripts, covering 2022 Q1 to 2024 Q3 (11 sessions in total).
Each file (e.g., A_q4_2020.txt) contains structured JSON data with detailed QA annotations, including term-level mappings and semantic analysis results.
Example: A_q4_2020.txt
{
"0": {
"question": "A couple of questions from me, Mike, maybe on -- first on the guidance part here. I guess, the Q1 guidance of 4.5% to 5.5% core, does it have -- does it assume any co-tailwinds, because, I guess, if you look at Q4, I mean 6% core, any reason why the core should slow down sequentially?",
"answer": [
"Yeah. Let me start, before Bob. So again, thanks for the earlier comments, Vijay. So how we characterize our Q1 guide is positive, but we use a very prudent approach...",
"Yeah, Vijay. I think a couple of things. The thing that I would say is, we didn't end the year with emptying the tank out and feel really good about that..."
],
"KG_terms": ["management", "demand", "shareholder", "results", "market", "backlog", "uncertainty", "visibility", "prudent"],
"question_terms": ["core", "mean", "guidance", "down", "slow", "part"],
"answer_terms": ["year", "upside", "order", "business", "quarter", "recovery", "visibility", "prudent"],
"in_A_in_Q": [],
"in_A_not_in_Q": ["year", "upside", "order", "business", "recovery", "visibility", "prudent"]
}
}A curated list of domain-specific financial terms used to enhance retrieval accuracy, entity linking, and knowledge grounding in financial text analysis.
Each line represents a single term.
Example entries: liquidity coverage ratio, cost of goods sold(COGS), Treasuries, NCO (Net Charge-Offs), P/S, ...