Dataset Generator

Web scraping and automatic dataset generation tool for question-answer datasets with advanced export capabilities and LLM integration.

🎯 Objective

Create quality datasets for training AI models by automatically scraping reliable sources and generating contextualized question-answer pairs. Export datasets to multiple formats including Langfuse for training data management.

⚡ Quick Start

# Configuration
cp .env.example .env
# Edit .env with your API keys

# Launch
make start

🏗️ Architecture

This project is designed with a modular architecture that separates concerns into distinct components:

Scraper: Retrieval of web content from specified URLs
LLM Client: Interaction with language models to generate question-answer pairs
Data Manager: Data management and dataset storage with multiple export formats
Pipeline: Orchestration of the complete dataset generation process
Export Module: Advanced dataset export to various platforms (Langfuse, JSON, CSV, etc.)

✨ Key Features

Multi-source Scraping: Support for various web sources and content types
AI-Powered QA Generation: Leverage state-of-the-art LLMs for intelligent question-answer pair creation
Multi-language Support: Generate datasets in French, English, Spanish, and German
Langfuse Integration: Direct export to Langfuse for dataset management and training workflows
Multiple Export Formats: JSON, CSV, JSONL, and platform-specific formats
Quality Control: Automated validation and filtering of generated content
Batch Processing: Efficient handling of large-scale data generation
API Interface: RESTful API for programmatic access and integration

🔄 Workflow

Scraping: Retrieving raw web data from multiple sources
Cleaning: Processing and normalizing text to extract relevant content
QA Generation: Creating high-quality question-answer pairs via LLMs with configurable prompts
Quality Assurance: Automated validation and filtering of generated datasets
Export: Multi-format export including Langfuse integration for seamless training workflows
Storage: Persistent storage with metadata tracking and version control

📊 Export Options

Langfuse: Direct integration for training data management
JSON/JSONL: Standard formats for data interchange
CSV: Tabular format for analysis and review
Custom Formats: Extensible export system for specific requirements

🔧 Configuration

The tool supports extensive configuration options for:

LLM model selection and parameters
Export format preferences
Quality thresholds and validation rules
Batch processing settings
API rate limiting and retry policies

🌍 Supported Languages

French (fr): French language dataset generation
English (en): English language dataset generation
Spanish (es): Spanish language dataset generation
German (de): German language dataset generation

🧪 Testing & Coverage

This project maintains high test coverage to ensure code quality and reliability.

# Run tests with coverage (HTML report)
make test

# Run tests for CI (XML report, enforces 70% minimum)
make test-ci

# Run pre-commit hooks (includes tests on push)
uv run prek run --all-files

Coverage Reports

Local: After running tests, view htmlcov/index.html for detailed coverage report
CI/CD: Coverage reports are automatically generated and uploaded on every PR
Codecov: View detailed coverage on Codecov

Current coverage threshold: 70% minimum required for CI to pass

Name		Name	Last commit message	Last commit date
Latest commit History 285 Commits
.github		.github
apps		apps
jobs		jobs
release		release
scripts		scripts
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prek.md		.prek.md
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
QUICK_START.md		QUICK_START.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Generator

🎯 Objective

⚡ Quick Start

🏗️ Architecture

✨ Key Features

🔄 Workflow

📊 Export Options

🔧 Configuration

🌍 Supported Languages

🧪 Testing & Coverage

Coverage Reports

About

Uh oh!

Releases 15

Uh oh!

Contributors 5

Uh oh!

Languages

KevinDeBenedetti/dataset-generator

Folders and files

Latest commit

History

Repository files navigation

Dataset Generator

🎯 Objective

⚡ Quick Start

🏗️ Architecture

✨ Key Features

🔄 Workflow

📊 Export Options

🔧 Configuration

🌍 Supported Languages

🧪 Testing & Coverage

Coverage Reports

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 15

Uh oh!

Contributors 5

Uh oh!

Languages