SetForge is a sophisticated, two-part data processing pipeline designed for a research project under the MIT License. Its primary mission is to generate high-quality, instruction-formatted Question-Answer (Q&A) datasets for fine-tuning Large Language Models (LLMs), specifically focusing on providing guidance about Indian universities to Bangladeshi students.
The core objective of SetForge is to create a culturally-aware and contextually accurate AI counselor. By fine-tuning models like Mistral 7B on the dataset generated by this pipeline, we aim to produce an AI that can outperform general-purpose models like GPT-4 on this specific domain, offering nuanced and practical advice to students.
The primary output of this pipeline is a high-quality dataset hosted on the Hugging Face Hub.
β‘οΈ View the dataset: millat/indian_university_guidance_for_bangladeshi_students
The initial data collection for this project was performed using WebScrape, a purpose-built Chrome extension for extracting clean, structured content from educational websites and PDFs.
The pipeline is split into two distinct, synergistic parts: The Knowledge Forge and The Q&A Forge.
- Automated Content Extraction: Intelligently extracts main content from raw HTML, discarding boilerplate.
- AI-Powered Document Triage: Uses an LLM to segment documents into logical, topic-based chunks.
- Topic-Specific Structuring: Applies dedicated prompts and JSON schemas to each chunk for highly structured, relevant information extraction.
- Resilient & Resumable: Features robust error handling, retries with exponential backoff, a dead-letter queue, and checkpointing to resume failed runs.
- Knowledge Aggregation: Merges all structured topic chunks into a single, coherent JSON file per source document.
- Context-Aware Generation: Uses the full context of a structured JSON file to generate holistic Q&A pairs with rich metadata.
- Instruction-Formatted Output: Creates Q&A pairs that include a question, a direct answer, context, source, and metadata.
- High-Throughput Processing: Employs
asynciofor concurrent processing. - Scalable & Resumable: Appends records to a
.jsonlfile and uses checkpoints to track progress.
- Automated Deduplication: Detects and removes duplicates and near-duplicates using semantic similarity.
- Quality Validation: Enforces thresholds for semantic relevance, extractive overlap, cultural sensitivity, and practicality.
- Metadata Enrichment: Computes quality scores, adds flags for manual review, and preserves provenance.
- Final Packaging: Outputs a clean, fine-tuning-ready
.jsonldataset.
The entire process is a sequential flow from raw data to a fine-tuning dataset, orchestrated by the three pipelines.
graph TD
subgraph "Part 1: The Knowledge Forge"
A["Raw Data (.html, .txt)"] --> B{"Content Extraction & Cleaning"};
B --> C["Cleaned Text"];
C --> D{"LLM-based Triage & Splitting"};
D --> E["Topic-based Text Chunks"];
E --> F{"Topic-Specific Structuring via LLM"};
F --> G["Structured JSON Chunks"];
G --> H{"Knowledge Aggregation"};
H --> I["Structured JSON Knowledge Base"];
end
subgraph "Part 2: The Q&A Forge"
J["Structured JSON Knowledge Base"] --> K{"Concurrent File Processing"};
K --> L{"Context-Aware Q&A Generation via LLM"};
L --> M{"Validate & Parse Response"};
M --> N["Append to Raw .jsonl Dataset"];
end
subgraph "Part 3: The QA Forge"
O["Raw .jsonl Dataset"] --> P{"Deduplication & Quality Checks"};
P --> Q["Final Dataset"];
end
I --> J;
N --> O;
Follow these steps to set up and run the SetForge pipeline.
- Python 3.9+
- Access to a Google AI Studio API key.
# Clone the repository
git clone <your-repo-url>
cd SetForge
# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install the required dependencies
pip install -r requirements.txtConfiguration is centralized in config.yaml.
- API Keys: Add your Google AI Studio API key(s) to the
api_providerslist. You can add multiple keys, and the system will rotate through them.api_providers: - name: "studio_tier_1" provider: "google_ai_studio" model: "gemini-1.5-flash" # Or your preferred model api_key: "YOUR_API_KEY_HERE" tier: "paid" # or "free" rpm: 60 tpm: 1000000
- Data Directories: The
data_configsection defines the input and output directories for each stage of the pipeline. The defaults are recommended.data_config: raw_dir: "data_raw" cleaned_dir: "data_cleaned" structured_dir: "data_structured" qa_dir: "data_qa"
- Environment Variables: For the main runner (
run.py), you need to set a few environment variables.export GEMINI_API_KEY_1="YOUR_API_KEY_HERE" export VERTEX_AI_PROJECT="dummy-project" export VERTEX_AI_LOCATION="us-central1" export VERTEX_AI_MODEL="dummy-model"
You can run each part of the pipeline separately.
Place your raw .html or .txt files in the data_raw directory.
# This command will clean the raw files and structure them into JSON.
python3 run.py --mode=production --steps clean structureThis pipeline runs after Part 1 has successfully generated files in the data_structured directory.
# This command will generate Q&A pairs from all new structured files.
python3 run_qa_pipeline.py
# To run in a test mode on a small sample of files:
python3 run_qa_pipeline.py --testThe raw dataset will be created at data_qa/qna_dataset.jsonl.
This final step cleans the generated dataset.
# This command will remove duplicates and save the final dataset.
python3 run_quality_assurance.pyThe final, clean dataset will be saved to dataset/dataset.jsonl.
.
βββ config.yaml # Main configuration file
βββ data_raw/ # Input for raw data
βββ data_cleaned/ # Output of the cleaning process
βββ data_structured/ # Output of the Knowledge Forge (Part 1)
βββ data_qa/ # Output of the Q&A Forge (Part 2)
βββ dataset/ # Final, cleaned dataset output (Part 3)
βββ docs/ # Detailed documentation
βββ run.py # Main entry point for the Part 1 pipeline
βββ run_qa_pipeline.py # Entry point for the Part 2 pipeline
βββ run_quality_assurance.py # Entry point for the Part 3 pipeline
βββ src/
β βββ components/ # Core logic components (e.g., DocumentSplitter)
β βββ pipeline/ # Pipeline-specific modules (e.g., QAGenerator)
β βββ prompts/ # Prompt templates for LLM interaction
β βββ schemas/ # JSON schemas for data structuring
β βββ utils/ # Utility modules (e.g., APIClientManager)
βββ requirements.txt # Project dependencies
This is a research project, and contributions are welcome. Please open an issue to discuss any changes or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.