Skip to content

wuwenlong123/MultiRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultiRAG: Multi-source Retrieval-Augmented Generation Framework

Project Introduction

MultiRAG is a multi-source retrieval-augmented generation framework designed to solve the problems of multi-source knowledge fusion and reasoning. The framework achieves efficient and accurate multi-source knowledge fusion and reasoning through the construction of Multi-source Line Graphs (MLG), Multi-level Confidence Calculation (MCC), and Multi-source Knowledge Linear Graph Path (MKLGP) algorithm.

Directory Structure

Multi-RAG/
├── data/                # Stores raw datasets and preprocessed data
├── src/                 # Core code
│   ├── data_processing/ # Data preprocessing (format conversion, knowledge extraction)
│   ├── mka/             # Multi-source knowledge aggregation (MLG construction, subgraph matching)
│   ├── mcc/             # Multi-level confidence calculation
│   ├── mklgp/           # MKLGP algorithm implementation
│   └── evaluation/      # Evaluation metric calculation
├── experiments/         # Experiment configurations and result files
├── requirements.txt     # Dependency configuration
└── README.md            # Project description

Environment Requirements

Hardware Requirements

  • CPU: 8 cores or more (Intel i7/Ryzen 7 or higher)
  • Memory: 32GB (64GB+ recommended to avoid OOM when processing large multi-source data)
  • Storage: 100GB+ (for storing datasets, model weights, and preprocessed files)

Software Requirements

  • Python 3.10
  • CUDA 11.6
  • PyTorch 2.0.1
  • Transformers 4.37.2
  • Other dependencies listed in requirements.txt

Dataset Download

1. FusionDatasets

# Download FusionDatasets
wget https://lunadong.com/fusiondatasets
unzip fusiondatasets -d data/raw/

2. HotpotQA

# Download HotpotQA validation set
wget https://raw.githubusercontent.com/hotpotqa/hotpotqa/master/hotpot_dev_distractor_v1.json
mkdir -p data/raw/hotpotqa
mv hotpot_dev_distractor_v1.json data/raw/hotpotqa/

3. 2WikiMultiHopQA

# Clone 2WikiMultiHopQA repository
git clone https://github.com/Alab-NII/2wikimultihop.git
mkdir -p data/raw/2wikimultihop
cp 2wikimultihop/data/dev.json data/raw/2wikimultihop/
rm -rf 2wikimultihop

Usage

1. Data Preprocessing

Format Conversion

python src/data_processing/format_converter.py

Knowledge Extraction

python src/data_processing/knowledge_extractor.py

2. Run Complete Experiment

python experiments/run_multirag.py

Core Module Description

1. Multi-source Knowledge Aggregation (MKA)

  • MLG Construction: Represents multi-source knowledge as line graphs, where nodes are triples and edges are shared entities
  • Subgraph Matching: Matches homologous subgraphs (SVs) and isolated vertices (LVs) based on query entities

2. Multi-level Confidence Calculation (MCC)

  • Graph-level Confidence: Calculated based on node similarity within subgraphs
  • Node-level Confidence: Calculated based on consistency scores and authority scores
  • Subgraph Filtering: Filters low-confidence subgraphs and nodes

3. MKLGP Algorithm

  • Prompt Construction: Builds prompts based on filtered high-confidence subgraphs
  • Answer Generation: Generates accurate answers using Llama3-8B-Instruct

Evaluation Metrics

  • F1 Score: Measures the accuracy and completeness of answers
  • Recall@K: Measures the proportion of correct answers among the top K answers
  • Precision: Measures the proportion of correct parts in generated answers
  • Recall: Measures the proportion of correct answers covered by generated answers

Experimental Results

Expected Performance

  • Multi-source query datasets: F1 score ≥10% higher than baseline models
  • HotpotQA: Recall@5 ≥62.7%, Precision ≥59.3%
  • Efficiency: Query time one order of magnitude lower than baseline models

Notes

  1. Data Preparation: Ensure that datasets in the correct format are placed in the data/raw/ directory
  2. Model Weights: Llama3-8B-Instruct model weights need to be downloaded
  3. VRAM Requirements: 16GB+ VRAM GPU is recommended for processing large-scale data
  4. Path Settings: Ensure all file paths are set correctly, especially data and model paths

Citation

If you use this project in your research, please cite the relevant papers.

License

This project is licensed under the MIT License.

Contact

For questions or suggestions, please contact the project maintainers.

About

Code of MultiRAG, ICDE 2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages