Add SDG starter kit #31

roclark · 2025-10-27T21:26:44Z

Create a Starter Kit which deploys an LLM as an Endpoint for synthetic data generation and sends sample requests to the deployed model.

Created on behalf of Anna Ollerenshaw at NVIDIA.

Create a Starter Kit which deploys an LLM as an Endpoint for synthetic data generation and sends sample requests to the deployed model. Signed-Off-By: Robert Clark <roclark@nvidia.com>

greptile-apps

Greptile Overview

Greptile Summary

This PR adds a new Jupyter notebook starter kit (synthetic-qa-data-gen-with-nemotron.ipynb) to the lepton/starterkits directory. The notebook demonstrates end-to-end synthetic data generation using DGX Cloud Lepton Endpoints with the Nemotron-nano-9b-v2 model. It guides users through deploying an LLM endpoint, generating subtopics and questions about CUDA programming, creating paired responses, generating math problems, and saving the results to JSONL format for downstream training tasks like reward modeling or DPO. The notebook leverages AsyncOpenAI for concurrent API calls, implements semaphore-based rate limiting, and includes detailed prompt templates for each generation stage. It fits into the existing starterkits collection by providing a practical example of synthetic data workflows on DGX Cloud Lepton infrastructure, complementing other starter kits that focus on fine-tuning, evaluation, and model deployment patterns.

Important Files Changed

Filename	Score	Overview
lepton/starterkits/synthetic-qa-data-gen-with-nemotron.ipynb	3/5	New starter kit notebook demonstrating synthetic data generation pipeline with Nemotron-nano-9b-v2 via Lepton Endpoints, including async request handling and JSONL output

Confidence score: 3/5

This PR introduces pedagogical content with several implementation patterns that may cause runtime issues or silent data loss if users run it without modification.
Score reflects five distinct concerns: (1) empty SAVE_DIRECTORY defaults to current directory, risking data overwrite; (2) wait_for_endpoint may return URL before endpoint is fully ready due to loop logic; (3) generate_response creates a new semaphore on every call when sem=None, breaking concurrency control; (4) zip(..., strict=False) silently drops data if list lengths mismatch; (5) bare exception handlers in generate_math can hide critical failures and leave incomplete output.
Pay close attention to lepton/starterkits/synthetic-qa-data-gen-with-nemotron.ipynb, particularly the configuration cell (lines 84-91), wait_for_endpoint function (lines 138-157), generate_response function (lines 521-532), the zip operation (lines 566-576), and exception handling in generate_math (lines 709-712).

Sequence Diagram

sequenceDiagram
    participant User
    participant Notebook
    participant LeptonCLI as "Lepton CLI"
    participant Endpoint as "Model Endpoint"
    participant OpenAIClient as "OpenAI Client"
    participant LLM as "Nemotron LLM"
    participant FileSystem as "File System"

    User->>Notebook: "Configure environment variables"
    User->>LeptonCLI: "lep login -c $LEPTON_KEY"
    LeptonCLI-->>User: "Authenticated"
    
    User->>LeptonCLI: "lep endpoint create"
    LeptonCLI->>Endpoint: "Deploy model with vLLM"
    Notebook->>Endpoint: "Wait for endpoint to be ready"
    Endpoint-->>Notebook: "Endpoint URL"
    
    User->>OpenAIClient: "Initialize client with endpoint URL"
    OpenAIClient-->>User: "Client ready"
    
    User->>OpenAIClient: "Generate subtopics from topic"
    OpenAIClient->>LLM: "POST /v1/chat/completions (TOPIC_GENERATION_PROMPT)"
    LLM-->>OpenAIClient: "Subtopics list"
    OpenAIClient-->>Notebook: "Parse and store subtopics"
    
    User->>Notebook: "Generate questions for each subtopic"
    loop "For each subtopic (async)"
        Notebook->>OpenAIClient: "generate_questions(subtopic)"
        OpenAIClient->>LLM: "POST /v1/chat/completions (QUESTION_PROMPT)"
        LLM-->>OpenAIClient: "Questions list"
        OpenAIClient-->>Notebook: "Store questions"
    end
    
    User->>Notebook: "Generate responses for questions"
    loop "For each question (async with semaphore)"
        Notebook->>OpenAIClient: "generate_response(question)"
        OpenAIClient->>LLM: "POST /v1/chat/completions (RESPONSE_PROMPT)"
        LLM-->>OpenAIClient: "Response A & B"
        OpenAIClient-->>Notebook: "Store response pair"
    end
    
    Notebook->>FileSystem: "Write synthetic_data.jsonl"
    FileSystem-->>User: "Dataset saved"
    
    opt "Generate math problems"
        User->>Notebook: "Generate math problems"
        Notebook->>OpenAIClient: "generate_math(topic)"
        OpenAIClient->>LLM: "POST /v1/chat/completions (MATH_PROMPT)"
        LLM-->>OpenAIClient: "Math problems and solutions"
        OpenAIClient-->>Notebook: "Parse problem-solution pairs"
        Notebook-->>User: "Display math problems"
    end

_{1 file reviewed, 7 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-27T21:29:01Z