MultiRAG: Multi-source Retrieval-Augmented Generation Framework

Project Introduction

MultiRAG is a multi-source retrieval-augmented generation framework designed to solve the problems of multi-source knowledge fusion and reasoning. The framework achieves efficient and accurate multi-source knowledge fusion and reasoning through the construction of Multi-source Line Graphs (MLG), Multi-level Confidence Calculation (MCC), and Multi-source Knowledge Linear Graph Path (MKLGP) algorithm.

Directory Structure

Multi-RAG/
├── data/                # Stores raw datasets and preprocessed data
├── src/                 # Core code
│   ├── data_processing/ # Data preprocessing (format conversion, knowledge extraction)
│   ├── mka/             # Multi-source knowledge aggregation (MLG construction, subgraph matching)
│   ├── mcc/             # Multi-level confidence calculation
│   ├── mklgp/           # MKLGP algorithm implementation
│   └── evaluation/      # Evaluation metric calculation
├── experiments/         # Experiment configurations and result files
├── requirements.txt     # Dependency configuration
└── README.md            # Project description

Environment Requirements

Hardware Requirements

CPU: 8 cores or more (Intel i7/Ryzen 7 or higher)
Memory: 32GB (64GB+ recommended to avoid OOM when processing large multi-source data)
Storage: 100GB+ (for storing datasets, model weights, and preprocessed files)

Software Requirements

Python 3.10
CUDA 11.6
PyTorch 2.0.1
Transformers 4.37.2
Other dependencies listed in requirements.txt

Dataset Download

1. FusionDatasets

# Download FusionDatasets
wget https://lunadong.com/fusiondatasets
unzip fusiondatasets -d data/raw/

2. HotpotQA

# Download HotpotQA validation set
wget https://raw.githubusercontent.com/hotpotqa/hotpotqa/master/hotpot_dev_distractor_v1.json
mkdir -p data/raw/hotpotqa
mv hotpot_dev_distractor_v1.json data/raw/hotpotqa/

3. 2WikiMultiHopQA

# Clone 2WikiMultiHopQA repository
git clone https://github.com/Alab-NII/2wikimultihop.git
mkdir -p data/raw/2wikimultihop
cp 2wikimultihop/data/dev.json data/raw/2wikimultihop/
rm -rf 2wikimultihop

Usage

1. Data Preprocessing

Format Conversion

python src/data_processing/format_converter.py

Knowledge Extraction

python src/data_processing/knowledge_extractor.py

2. Run Complete Experiment

python experiments/run_multirag.py

Core Module Description

1. Multi-source Knowledge Aggregation (MKA)

MLG Construction: Represents multi-source knowledge as line graphs, where nodes are triples and edges are shared entities
Subgraph Matching: Matches homologous subgraphs (SVs) and isolated vertices (LVs) based on query entities

2. Multi-level Confidence Calculation (MCC)

Graph-level Confidence: Calculated based on node similarity within subgraphs
Node-level Confidence: Calculated based on consistency scores and authority scores
Subgraph Filtering: Filters low-confidence subgraphs and nodes

3. MKLGP Algorithm

Prompt Construction: Builds prompts based on filtered high-confidence subgraphs
Answer Generation: Generates accurate answers using Llama3-8B-Instruct

Evaluation Metrics

F1 Score: Measures the accuracy and completeness of answers
Recall@K: Measures the proportion of correct answers among the top K answers
Precision: Measures the proportion of correct parts in generated answers
Recall: Measures the proportion of correct answers covered by generated answers

Experimental Results

Expected Performance

Multi-source query datasets: F1 score ≥10% higher than baseline models
HotpotQA: Recall@5 ≥62.7%, Precision ≥59.3%
Efficiency: Query time one order of magnitude lower than baseline models

Notes

Data Preparation: Ensure that datasets in the correct format are placed in the data/raw/ directory
Model Weights: Llama3-8B-Instruct model weights need to be downloaded
VRAM Requirements: 16GB+ VRAM GPU is recommended for processing large-scale data
Path Settings: Ensure all file paths are set correctly, especially data and model paths

Citation

If you use this project in your research, please cite the relevant papers.

License

This project is licensed under the MIT License.

Contact

For questions or suggestions, please contact the project maintainers.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
data/raw		data/raw
docs		docs
experiments		experiments
src		src
README.md		README.md
README_CN.md		README_CN.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiRAG: Multi-source Retrieval-Augmented Generation Framework

Project Introduction

Directory Structure

Environment Requirements

Hardware Requirements

Software Requirements

Dataset Download

1. FusionDatasets

2. HotpotQA

3. 2WikiMultiHopQA

Usage

1. Data Preprocessing

Format Conversion

Knowledge Extraction

2. Run Complete Experiment

Core Module Description

1. Multi-source Knowledge Aggregation (MKA)

2. Multi-level Confidence Calculation (MCC)

3. MKLGP Algorithm

Evaluation Metrics

Experimental Results

Expected Performance

Notes

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MultiRAG: Multi-source Retrieval-Augmented Generation Framework

Project Introduction

Directory Structure

Environment Requirements

Hardware Requirements

Software Requirements

Dataset Download

1. FusionDatasets

2. HotpotQA

3. 2WikiMultiHopQA

Usage

1. Data Preprocessing

Format Conversion

Knowledge Extraction

2. Run Complete Experiment

Core Module Description

1. Multi-source Knowledge Aggregation (MKA)

2. Multi-level Confidence Calculation (MCC)

3. MKLGP Algorithm

Evaluation Metrics

Experimental Results

Expected Performance

Notes

Citation

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages