Skip to content

Company-Specific Knowledge Matters: Retrieval-Augmented Generation for Earnings Call Answer Rehearsal

Notifications You must be signed in to change notification settings

MiuLab/CompanyRehearsal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

CompanyRehearsal: Retrieval-Augmented Financial QA with Knowledge Graph Grounding

This repository is about the paper
“CompanyRehearsal: Retrieval-Augmented Financial QA with Knowledge Graph Grounding”. It includes the datasets used in the study. The project explores Retrieval-Augmented Generation (RAG) techniques that leverage knowledge graphs (KGs) or past earnings call Q&A to enhance factual accuracy and reasoning in financial domains, particularly in earnings call transcripts (ECC) scenario.

We provides ECC QA pairs, knowledge graph used in the expeiments, and financial terminology resources designed to support research on financial question answering.

Please cite the following references if you use the released data.

@inproceedings{shih2025company,
  title={Company-Specific Knowledge Matters: Retrieval-Augmented Generation for Earnings Call Answer Rehearsal},
  author={Shih, Yung-Yu and Chen, Yun-Nung and Chen, Chung-Chi},
  booktitle={Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
  pages={5243--5247},
  year={2025}
}

📁 Repository Structure

data/

Contains all raw and processed data used in the RAG experiments.

knowledge_graph/

Knowledge graphs used for retrieval and reasoning tasks.

  • cause_effect_sentence_pairs.txt — Sentence-level cause–effect relationships.

    Column Description
    Cause sentence The initiating or source statement (e.g., '2Q18 earnings met expectations')
    Effect sentence The resulting or affected statement (e.g., 'FY18 EPS estimate remains unchanged at $4.10')
  • cause_effect_term_pairs.csv — Term-level cause–effect pairs, formatted as (head → cause, tail → effect, weight). Each row represents a directed relationship from a cause term to an effect term, along with its frequency count in the corpus.

    Column Description
    Cause The source or initiating term (e.g., growth)
    Effect The resulting or affected term (e.g., revenue)
    Count The number of times the cause–effect pair appears in the dataset

earnings_call_qa/

Question–Answer pairs derived from financial earnings call transcripts.

  • all_qa/ — QA pairs collected across all companies.
  • cs_qa(aapl)/ — QA pairs extracted from Apple earnings call transcripts, covering 2022 Q1 to 2024 Q3 (11 sessions in total).

Each file (e.g., A_q4_2020.txt) contains structured JSON data with detailed QA annotations, including term-level mappings and semantic analysis results.

Example: A_q4_2020.txt

{
  "0": {
    "question": "A couple of questions from me, Mike, maybe on -- first on the guidance part here. I guess, the Q1 guidance of 4.5% to 5.5% core, does it have -- does it assume any co-tailwinds, because, I guess, if you look at Q4, I mean 6% core, any reason why the core should slow down sequentially?",
    "answer": [
      "Yeah. Let me start, before Bob. So again, thanks for the earlier comments, Vijay. So how we characterize our Q1 guide is positive, but we use a very prudent approach...",
      "Yeah, Vijay. I think a couple of things. The thing that I would say is, we didn't end the year with emptying the tank out and feel really good about that..."
    ],
    "KG_terms": ["management", "demand", "shareholder", "results", "market", "backlog", "uncertainty", "visibility", "prudent"],
    "question_terms": ["core", "mean", "guidance", "down", "slow", "part"],
    "answer_terms": ["year", "upside", "order", "business", "quarter", "recovery", "visibility", "prudent"],
    "in_A_in_Q": [],
    "in_A_not_in_Q": ["year", "upside", "order", "business", "recovery", "visibility", "prudent"]
  }
}

financial_terms.txt

A curated list of domain-specific financial terms used to enhance retrieval accuracy, entity linking, and knowledge grounding in financial text analysis.
Each line represents a single term.

Example entries: liquidity coverage ratio, cost of goods sold(COGS), Treasuries, NCO (Net Charge-Offs), P/S, ...

About

Company-Specific Knowledge Matters: Retrieval-Augmented Generation for Earnings Call Answer Rehearsal

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published