SetForge: AI-Powered Dataset Generation Pipeline

SetForge is a sophisticated, two-part data processing pipeline designed for a research project under the MIT License. Its primary mission is to generate high-quality, instruction-formatted Question-Answer (Q&A) datasets for fine-tuning Large Language Models (LLMs), specifically focusing on providing guidance about Indian universities to Bangladeshi students.

🎯 Mission & Vision

The core objective of SetForge is to create a culturally-aware and contextually accurate AI counselor. By fine-tuning models like Mistral 7B on the dataset generated by this pipeline, we aim to produce an AI that can outperform general-purpose models like GPT-4 on this specific domain, offering nuanced and practical advice to students.

✅ Generated Dataset

The primary output of this pipeline is a high-quality dataset hosted on the Hugging Face Hub.

➡️ View the dataset: millat/indian_university_guidance_for_bangladeshi_students

🛠️ Associated Tooling

The initial data collection for this project was performed using WebScrape, a purpose-built Chrome extension for extracting clean, structured content from educational websites and PDFs.

✨ Features

The pipeline is split into two distinct, synergistic parts: The Knowledge Forge and The Q&A Forge.

Part 1: The Knowledge Forge (Data Structuring)

Automated Content Extraction: Intelligently extracts main content from raw HTML, discarding boilerplate.
AI-Powered Document Triage: Uses an LLM to segment documents into logical, topic-based chunks.
Topic-Specific Structuring: Applies dedicated prompts and JSON schemas to each chunk for highly structured, relevant information extraction.
Resilient & Resumable: Features robust error handling, retries with exponential backoff, a dead-letter queue, and checkpointing to resume failed runs.
Knowledge Aggregation: Merges all structured topic chunks into a single, coherent JSON file per source document.

Part 2: The Q&A Forge (Dataset Generation)

Context-Aware Generation: Uses the full context of a structured JSON file to generate holistic Q&A pairs with rich metadata.
Instruction-Formatted Output: Creates Q&A pairs that include a question, a direct answer, context, source, and metadata.
High-Throughput Processing: Employs asyncio for concurrent processing.
Scalable & Resumable: Appends records to a .jsonl file and uses checkpoints to track progress.

Part 3: The QA Forge (Quality Assurance)

Automated Deduplication: Detects and removes duplicates and near-duplicates using semantic similarity.
Quality Validation: Enforces thresholds for semantic relevance, extractive overlap, cultural sensitivity, and practicality.
Metadata Enrichment: Computes quality scores, adds flags for manual review, and preserves provenance.
Final Packaging: Outputs a clean, fine-tuning-ready .jsonl dataset.

🏛️ System Architecture & Workflow

The entire process is a sequential flow from raw data to a fine-tuning dataset, orchestrated by the three pipelines.

graph TD
    subgraph "Part 1: The Knowledge Forge"
        A["Raw Data (.html, .txt)"] --> B{"Content Extraction & Cleaning"};
        B --> C["Cleaned Text"];
        C --> D{"LLM-based Triage & Splitting"};
        D --> E["Topic-based Text Chunks"];
        E --> F{"Topic-Specific Structuring via LLM"};
        F --> G["Structured JSON Chunks"];
        G --> H{"Knowledge Aggregation"};
        H --> I["Structured JSON Knowledge Base"];
    end

    subgraph "Part 2: The Q&A Forge"
        J["Structured JSON Knowledge Base"] --> K{"Concurrent File Processing"};
        K --> L{"Context-Aware Q&A Generation via LLM"};
        L --> M{"Validate & Parse Response"};
        M --> N["Append to Raw .jsonl Dataset"];
    end
    
    subgraph "Part 3: The QA Forge"
        O["Raw .jsonl Dataset"] --> P{"Deduplication & Quality Checks"};
        P --> Q["Final Dataset"];
    end

    I --> J;
    N --> O;

🚀 Getting Started

Follow these steps to set up and run the SetForge pipeline.

1. Prerequisites

Python 3.9+
Access to a Google AI Studio API key.

2. Installation

# Clone the repository
git clone <your-repo-url>
cd SetForge

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install the required dependencies
pip install -r requirements.txt

3. Configuration

Configuration is centralized in config.yaml.

API Keys: Add your Google AI Studio API key(s) to the api_providers list. You can add multiple keys, and the system will rotate through them.

api_providers:
  - name: "studio_tier_1"
    provider: "google_ai_studio"
    model: "gemini-1.5-flash" # Or your preferred model
    api_key: "YOUR_API_KEY_HERE"
    tier: "paid" # or "free"
    rpm: 60
    tpm: 1000000

Data Directories: The data_config section defines the input and output directories for each stage of the pipeline. The defaults are recommended.
```
data_config:
  raw_dir: "data_raw"
  cleaned_dir: "data_cleaned"
  structured_dir: "data_structured"
  qa_dir: "data_qa"
```

Environment Variables: For the main runner (run.py), you need to set a few environment variables.

export GEMINI_API_KEY_1="YOUR_API_KEY_HERE"
export VERTEX_AI_PROJECT="dummy-project"
export VERTEX_AI_LOCATION="us-central1"
export VERTEX_AI_MODEL="dummy-model"

⚙️ Usage

You can run each part of the pipeline separately.

Running Part 1: The Knowledge Forge

Place your raw .html or .txt files in the data_raw directory.

# This command will clean the raw files and structure them into JSON.
python3 run.py --mode=production --steps clean structure

Running Part 2: The Q&A Forge

This pipeline runs after Part 1 has successfully generated files in the data_structured directory.

# This command will generate Q&A pairs from all new structured files.
python3 run_qa_pipeline.py

# To run in a test mode on a small sample of files:
python3 run_qa_pipeline.py --test

The raw dataset will be created at data_qa/qna_dataset.jsonl.

Running Part 3: The Quality Assurance Forge

This final step cleans the generated dataset.

# This command will remove duplicates and save the final dataset.
python3 run_quality_assurance.py

The final, clean dataset will be saved to dataset/dataset.jsonl.

📂 Project Structure

.
├── config.yaml               # Main configuration file
├── data_raw/                 # Input for raw data
├── data_cleaned/             # Output of the cleaning process
├── data_structured/          # Output of the Knowledge Forge (Part 1)
├── data_qa/                  # Output of the Q&A Forge (Part 2)
├── dataset/                  # Final, cleaned dataset output (Part 3)
├── docs/                     # Detailed documentation
├── run.py                    # Main entry point for the Part 1 pipeline
├── run_qa_pipeline.py        # Entry point for the Part 2 pipeline
├── run_quality_assurance.py  # Entry point for the Part 3 pipeline
├── src/
│   ├── components/           # Core logic components (e.g., DocumentSplitter)
│   ├── pipeline/             # Pipeline-specific modules (e.g., QAGenerator)
│   ├── prompts/              # Prompt templates for LLM interaction
│   ├── schemas/              # JSON schemas for data structuring
│   └── utils/                # Utility modules (e.g., APIClientManager)
└── requirements.txt          # Project dependencies

🤝 Contributing

This is a research project, and contributions are welcome. Please open an issue to discuss any changes or submit a pull request.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.cursor/rules		.cursor/rules
.github		.github
checkpoints		checkpoints
data		data
data_cleaned		data_cleaned
data_structured		data_structured
dataset		dataset
docs		docs
output		output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.env		config.env
config.example.yaml		config.example.yaml
config.yaml		config.yaml
config_manager.py		config_manager.py
evaluate_dataset.py		evaluate_dataset.py
main_generator.py		main_generator.py
production_pipeline.py		production_pipeline.py
requirements.txt		requirements.txt
run.py		run.py
run_qa_pipeline.py		run_qa_pipeline.py
run_quality_assurance.py		run_quality_assurance.py
run_structuring_pipeline.py		run_structuring_pipeline.py
schema.json		schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SetForge: AI-Powered Dataset Generation Pipeline

🎯 Mission & Vision

✅ Generated Dataset

🛠️ Associated Tooling

✨ Features

Part 1: The Knowledge Forge (Data Structuring)

Part 2: The Q&A Forge (Dataset Generation)

Part 3: The QA Forge (Quality Assurance)

🏛️ System Architecture & Workflow

🚀 Getting Started

1. Prerequisites

2. Installation

3. Configuration

⚙️ Usage

Running Part 1: The Knowledge Forge

Running Part 2: The Q&A Forge

Running Part 3: The Quality Assurance Forge

📂 Project Structure

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Languages

License

codermillat/SetForge

Folders and files

Latest commit

History

Repository files navigation

SetForge: AI-Powered Dataset Generation Pipeline

🎯 Mission & Vision

✅ Generated Dataset

🛠️ Associated Tooling

✨ Features

Part 1: The Knowledge Forge (Data Structuring)

Part 2: The Q&A Forge (Dataset Generation)

Part 3: The QA Forge (Quality Assurance)

🏛️ System Architecture & Workflow

🚀 Getting Started

1. Prerequisites

2. Installation

3. Configuration

⚙️ Usage

Running Part 1: The Knowledge Forge

Running Part 2: The Q&A Forge

Running Part 3: The Quality Assurance Forge

📂 Project Structure

🤝 Contributing

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages