Skip to content

A production-ready tool for generating high-quality Q&A datasets for Mistral 7B fine-tuning, specifically designed for Bangladeshi educational content and Indian university guidance.

License

Notifications You must be signed in to change notification settings

codermillat/SetForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SetForge: AI-Powered Dataset Generation Pipeline

License: MIT Python Version Hugging Face datasets

SetForge is a sophisticated, two-part data processing pipeline designed for a research project under the MIT License. Its primary mission is to generate high-quality, instruction-formatted Question-Answer (Q&A) datasets for fine-tuning Large Language Models (LLMs), specifically focusing on providing guidance about Indian universities to Bangladeshi students.


🎯 Mission & Vision

The core objective of SetForge is to create a culturally-aware and contextually accurate AI counselor. By fine-tuning models like Mistral 7B on the dataset generated by this pipeline, we aim to produce an AI that can outperform general-purpose models like GPT-4 on this specific domain, offering nuanced and practical advice to students.

βœ… Generated Dataset

The primary output of this pipeline is a high-quality dataset hosted on the Hugging Face Hub.

➑️ View the dataset: millat/indian_university_guidance_for_bangladeshi_students

πŸ› οΈ Associated Tooling

The initial data collection for this project was performed using WebScrape, a purpose-built Chrome extension for extracting clean, structured content from educational websites and PDFs.

✨ Features

The pipeline is split into two distinct, synergistic parts: The Knowledge Forge and The Q&A Forge.

Part 1: The Knowledge Forge (Data Structuring)

  • Automated Content Extraction: Intelligently extracts main content from raw HTML, discarding boilerplate.
  • AI-Powered Document Triage: Uses an LLM to segment documents into logical, topic-based chunks.
  • Topic-Specific Structuring: Applies dedicated prompts and JSON schemas to each chunk for highly structured, relevant information extraction.
  • Resilient & Resumable: Features robust error handling, retries with exponential backoff, a dead-letter queue, and checkpointing to resume failed runs.
  • Knowledge Aggregation: Merges all structured topic chunks into a single, coherent JSON file per source document.

Part 2: The Q&A Forge (Dataset Generation)

  • Context-Aware Generation: Uses the full context of a structured JSON file to generate holistic Q&A pairs with rich metadata.
  • Instruction-Formatted Output: Creates Q&A pairs that include a question, a direct answer, context, source, and metadata.
  • High-Throughput Processing: Employs asyncio for concurrent processing.
  • Scalable & Resumable: Appends records to a .jsonl file and uses checkpoints to track progress.

Part 3: The QA Forge (Quality Assurance)

  • Automated Deduplication: Detects and removes duplicates and near-duplicates using semantic similarity.
  • Quality Validation: Enforces thresholds for semantic relevance, extractive overlap, cultural sensitivity, and practicality.
  • Metadata Enrichment: Computes quality scores, adds flags for manual review, and preserves provenance.
  • Final Packaging: Outputs a clean, fine-tuning-ready .jsonl dataset.

πŸ›οΈ System Architecture & Workflow

The entire process is a sequential flow from raw data to a fine-tuning dataset, orchestrated by the three pipelines.

graph TD
    subgraph "Part 1: The Knowledge Forge"
        A["Raw Data (.html, .txt)"] --> B{"Content Extraction & Cleaning"};
        B --> C["Cleaned Text"];
        C --> D{"LLM-based Triage & Splitting"};
        D --> E["Topic-based Text Chunks"];
        E --> F{"Topic-Specific Structuring via LLM"};
        F --> G["Structured JSON Chunks"];
        G --> H{"Knowledge Aggregation"};
        H --> I["Structured JSON Knowledge Base"];
    end

    subgraph "Part 2: The Q&A Forge"
        J["Structured JSON Knowledge Base"] --> K{"Concurrent File Processing"};
        K --> L{"Context-Aware Q&A Generation via LLM"};
        L --> M{"Validate & Parse Response"};
        M --> N["Append to Raw .jsonl Dataset"];
    end
    
    subgraph "Part 3: The QA Forge"
        O["Raw .jsonl Dataset"] --> P{"Deduplication & Quality Checks"};
        P --> Q["Final Dataset"];
    end

    I --> J;
    N --> O;
Loading

πŸš€ Getting Started

Follow these steps to set up and run the SetForge pipeline.

1. Prerequisites

  • Python 3.9+
  • Access to a Google AI Studio API key.

2. Installation

# Clone the repository
git clone <your-repo-url>
cd SetForge

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install the required dependencies
pip install -r requirements.txt

3. Configuration

Configuration is centralized in config.yaml.

  • API Keys: Add your Google AI Studio API key(s) to the api_providers list. You can add multiple keys, and the system will rotate through them.
    api_providers:
      - name: "studio_tier_1"
        provider: "google_ai_studio"
        model: "gemini-1.5-flash" # Or your preferred model
        api_key: "YOUR_API_KEY_HERE"
        tier: "paid" # or "free"
        rpm: 60
        tpm: 1000000
  • Data Directories: The data_config section defines the input and output directories for each stage of the pipeline. The defaults are recommended.
    data_config:
      raw_dir: "data_raw"
      cleaned_dir: "data_cleaned"
      structured_dir: "data_structured"
      qa_dir: "data_qa"
  • Environment Variables: For the main runner (run.py), you need to set a few environment variables.
    export GEMINI_API_KEY_1="YOUR_API_KEY_HERE"
    export VERTEX_AI_PROJECT="dummy-project"
    export VERTEX_AI_LOCATION="us-central1"
    export VERTEX_AI_MODEL="dummy-model"

βš™οΈ Usage

You can run each part of the pipeline separately.

Running Part 1: The Knowledge Forge

Place your raw .html or .txt files in the data_raw directory.

# This command will clean the raw files and structure them into JSON.
python3 run.py --mode=production --steps clean structure

Running Part 2: The Q&A Forge

This pipeline runs after Part 1 has successfully generated files in the data_structured directory.

# This command will generate Q&A pairs from all new structured files.
python3 run_qa_pipeline.py

# To run in a test mode on a small sample of files:
python3 run_qa_pipeline.py --test

The raw dataset will be created at data_qa/qna_dataset.jsonl.

Running Part 3: The Quality Assurance Forge

This final step cleans the generated dataset.

# This command will remove duplicates and save the final dataset.
python3 run_quality_assurance.py

The final, clean dataset will be saved to dataset/dataset.jsonl.

πŸ“‚ Project Structure

.
β”œβ”€β”€ config.yaml               # Main configuration file
β”œβ”€β”€ data_raw/                 # Input for raw data
β”œβ”€β”€ data_cleaned/             # Output of the cleaning process
β”œβ”€β”€ data_structured/          # Output of the Knowledge Forge (Part 1)
β”œβ”€β”€ data_qa/                  # Output of the Q&A Forge (Part 2)
β”œβ”€β”€ dataset/                  # Final, cleaned dataset output (Part 3)
β”œβ”€β”€ docs/                     # Detailed documentation
β”œβ”€β”€ run.py                    # Main entry point for the Part 1 pipeline
β”œβ”€β”€ run_qa_pipeline.py        # Entry point for the Part 2 pipeline
β”œβ”€β”€ run_quality_assurance.py  # Entry point for the Part 3 pipeline
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ components/           # Core logic components (e.g., DocumentSplitter)
β”‚   β”œβ”€β”€ pipeline/             # Pipeline-specific modules (e.g., QAGenerator)
β”‚   β”œβ”€β”€ prompts/              # Prompt templates for LLM interaction
β”‚   β”œβ”€β”€ schemas/              # JSON schemas for data structuring
β”‚   └── utils/                # Utility modules (e.g., APIClientManager)
└── requirements.txt          # Project dependencies

🀝 Contributing

This is a research project, and contributions are welcome. Please open an issue to discuss any changes or submit a pull request.

πŸ“œ License

This project is licensed under the MIT License. See the LICENSE file for details.

About

A production-ready tool for generating high-quality Q&A datasets for Mistral 7B fine-tuning, specifically designed for Bangladeshi educational content and Indian university guidance.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages