Skip to content

TeichAI/datagen

Repository files navigation

DataGen - By TeichAI

A easy to use CLI to generate JSONL datasets from a TXT file using LLMs.

Install

npm i -g @teichai/datagen

Or install locally and run via npx:

npm i -D @teichai/datagen
npx datagen --help

Run tests:

npm test

Usage

Set your OpenRouter API key:

export API_KEY="your_openrouter_key"

Create a prompts file where each line is a prompt:

Explain the CAP theorem in simple terms.
Write a Python function to reverse a linked list.

Run:

datagen --model openai/gpt-4o-mini --prompts prompts.txt

Configuration File

You can also use a YAML config file:

model: openai/gpt-4o-mini
prompts: ./prompts.txt
out: ./dataset.jsonl
concurrent: 5
openrouter:
  providerSort: throughput

Run with:

datagen --config config.yaml

Note: On startup, datagen does a quick best-effort check for a newer npm version and prints an upgrade command if available. Disable with DATAGEN_DISABLE_UPDATE_CHECK=1.

Development (build + run once):

API_KEY="your_openrouter_key" npm run dev -- --model openai/gpt-4o-mini --prompts prompts.txt

Options

  • --help: show the help message and exit.
  • --version: print the CLI version and exit.
  • --config: set a config file
  • --model <name>: required model name.
  • --prompts <file>: required prompts file.
  • --out <file>: output JSONL (default dataset.jsonl).
  • --api <baseUrl>: API base (default OpenRouter).
  • --system <text>: optional system prompt.
  • --store-system true|false: store system message in output (default true).
  • --concurrent <num>: number of in-flight requests (default 1).
  • --openrouter.provider <slugs>: comma-separated provider slugs to try in order (OpenRouter only).
  • --openrouter.providerSort <price|throughput|latency>: provider routing sort (OpenRouter only).
  • --reasoningEffort <none|minimal|low|medium|high|xhigh>: pass through as reasoning.effort.
  • --no-progress: disable the progress bar.
  • --timeout <ms>: request timeout in milliseconds.

About

A easy to use CLI to generate JSONL datasets from a TXT file using LLMs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors 2

  •  
  •