Add SDG starter kit v2 #32

aollerenshaw · 2025-11-06T10:49:55Z

Create a Starter Kit which deploys an LLM as an Endpoint for synthetic data generation and sends sample requests to the deployed model. Created with Robert Clark at NVIDIA

Create a Starter Kit which deploys an LLM as an Endpoint for synthetic data generation and sends sample requests to the deployed model. Signed-Off-By: Robert Clark <roclark@nvidia.com>

greptile-apps

Greptile Overview

Greptile Summary

This PR adds a new Jupyter notebook starter kit that demonstrates synthetic data generation for chat QA datasets using NVIDIA's Nemotron model on DGX Cloud Lepton. The notebook provides an end-to-end pipeline from deploying model endpoints to generating structured question-answer pairs for training reward models or DPO optimization. It includes two generation modes: general QA generation and math problem generation, showcasing versatility for different synthetic data use cases.

The notebook integrates with the existing DGX Cloud Lepton starter kit collection by following the established pattern of providing complete, executable workflows. It demonstrates key platform capabilities including endpoint deployment, async request handling with concurrency controls, and structured data export. The implementation uses modern async OpenAI client patterns with proper semaphore-based rate limiting to handle large-scale data generation efficiently.

Important Files Changed

Filename	Score	Overview
lepton/starterkits/synthetic-qa-data-gen-with-nemotron.ipynb	3/5	New comprehensive starter kit for synthetic QA data generation with endpoint deployment and async request handling

Confidence score: 3/5

This PR requires careful review due to logical issues in critical functions that could cause infinite loops or runtime errors
Score reflects well-structured notebook design but deducted points for endpoint waiting logic bug, fragile response parsing, and outdated exception handling patterns
Pay close attention to the wait_for_endpoint function logic and exception handling patterns in the new notebook

Sequence Diagram

sequenceDiagram
    participant User
    participant "Jupyter Notebook"
    participant "DGX Cloud Lepton"
    participant "vLLM Endpoint"
    participant "Nemotron Model"
    participant "OpenAI Client"
    participant "File System"

    User->>+"Jupyter Notebook": "Configure environment variables and credentials"
    "Jupyter Notebook"->>+"DGX Cloud Lepton": "Authenticate with LEPTON_KEY"
    "DGX Cloud Lepton"-->>-"Jupyter Notebook": "Authentication successful"
    
    "Jupyter Notebook"->>+"DGX Cloud Lepton": "Create endpoint with vLLM container"
    "DGX Cloud Lepton"->>+"vLLM Endpoint": "Deploy Nemotron model container"
    "vLLM Endpoint"->>+"Nemotron Model": "Load nvidia/nvidia-nemotron-nano-9b-v2"
    "Nemotron Model"-->>-"vLLM Endpoint": "Model ready"
    "vLLM Endpoint"-->>-"DGX Cloud Lepton": "Endpoint deployed"
    "DGX Cloud Lepton"-->>-"Jupyter Notebook": "Endpoint URL and status"

    "Jupyter Notebook"->>+"OpenAI Client": "Initialize client with endpoint URL"
    "OpenAI Client"-->>-"Jupyter Notebook": "Client ready"

    User->>+"Jupyter Notebook": "Generate limerick test request"
    "Jupyter Notebook"->>+"OpenAI Client": "Send test completion request"
    "OpenAI Client"->>+"vLLM Endpoint": "POST /v1/chat/completions"
    "vLLM Endpoint"->>+"Nemotron Model": "Generate limerick about GPU computing"
    "Nemotron Model"-->>-"vLLM Endpoint": "Stream response with reasoning"
    "vLLM Endpoint"-->>-"OpenAI Client": "Streamed completion chunks"
    "OpenAI Client"-->>-"Jupyter Notebook": "Display streamed output"

    User->>+"Jupyter Notebook": "Generate subtopics for main topic"
    "Jupyter Notebook"->>+"OpenAI Client": "Send subtopic generation request"
    "OpenAI Client"->>+"vLLM Endpoint": "POST with TOPIC_GENERATION_PROMPT_TEMPLATE"
    "vLLM Endpoint"->>+"Nemotron Model": "Generate 5 subtopics for Wales"
    "Nemotron Model"-->>-"vLLM Endpoint": "Return comma-separated subtopics"
    "vLLM Endpoint"-->>-"OpenAI Client": "Subtopics response"
    "OpenAI Client"-->>-"Jupyter Notebook": "Parse and store subtopic list"

    "Jupyter Notebook"->>+"OpenAI Client": "Generate questions for each subtopic (async batch)"
    loop "For each subtopic"
        "OpenAI Client"->>+"vLLM Endpoint": "POST with QUESTION_PROMPT_TEMPLATE"
        "vLLM Endpoint"->>+"Nemotron Model": "Generate 5 questions for subtopic"
        "Nemotron Model"-->>-"vLLM Endpoint": "Return newline-separated questions"
        "vLLM Endpoint"-->>-"OpenAI Client": "Questions response"
    end
    "OpenAI Client"-->>-"Jupyter Notebook": "Collect all questions into single list"

    "Jupyter Notebook"->>+"OpenAI Client": "Generate responses for questions (concurrent with semaphore)"
    loop "For each question (max 5 concurrent)"
        "OpenAI Client"->>+"vLLM Endpoint": "POST with RESPONSE_PROMPT_TEMPLATE"
        "vLLM Endpoint"->>+"Nemotron Model": "Generate Response A and Response B"
        "Nemotron Model"-->>-"vLLM Endpoint": "Return formatted responses"
        "vLLM Endpoint"-->>-"OpenAI Client": "Response pair"
    end
    "OpenAI Client"-->>-"Jupyter Notebook": "Parse and structure question-response pairs"

    "Jupyter Notebook"->>+"File System": "Save synthetic_data.jsonl"
    "File System"-->>-"Jupyter Notebook": "File saved successfully"

    User->>+"Jupyter Notebook": "Generate math problems"
    "Jupyter Notebook"->>+"OpenAI Client": "Send math problem generation request"
    "OpenAI Client"->>+"vLLM Endpoint": "POST with MATH_PROMPT_TEMPLATE"
    "vLLM Endpoint"->>+"Nemotron Model": "Generate 5 algebra problems with solutions"
    "Nemotron Model"-->>-"vLLM Endpoint": "Return tagged problem-solution pairs"
    "vLLM Endpoint"-->>-"OpenAI Client": "Math problems response"
    "OpenAI Client"-->>-"Jupyter Notebook": "Parse with regex and display results"

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-06T10:51:35Z

lepton/starterkits/synthetic-qa-data-gen-with-nemotron.ipynb

+    "        if ready and endpoint_url:\n",
+    "            print(\"Endpoint deployed!\")\n",
+    "            print(f\"URL: {endpoint_url}\")\n",
+    "            \n",
+    "        print(f\"Waiting for endpoint {endpoint_name} to be ready...\")\n",
+    "        time.sleep(interval)\n",


logic: Infinite loop issue - function returns endpoint_url when ready but then continues to wait loop

Suggested change

" if ready and endpoint_url:\n",

" print(\"Endpoint deployed!\")\n",

" print(f\"URL: {endpoint_url}\")\n",

" \n",

" print(f\"Waiting for endpoint {endpoint_name} to be ready...\")\n",

" time.sleep(interval)\n",

if ready and endpoint_url:

print("Endpoint deployed!")

print(f"URL: {endpoint_url}")

return endpoint_url

print(f"Waiting for endpoint {endpoint_name} to be ready...")

time.sleep(interval)

roclark and others added 2 commits October 27, 2025 14:23

Add SDG starter kit

18235e3

Create a Starter Kit which deploys an LLM as an Endpoint for synthetic data generation and sends sample requests to the deployed model. Signed-Off-By: Robert Clark <roclark@nvidia.com>

added some style and logic fixes

a467d60

aollerenshaw marked this pull request as draft November 6, 2025 10:51

greptile-apps bot reviewed Nov 6, 2025

View reviewed changes

roclark added the starter-kits Submission of a new Starter Kit notebook label Nov 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SDG starter kit v2 #32

Add SDG starter kit v2 #32

Uh oh!

aollerenshaw commented Nov 6, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add SDG starter kit v2 #32

Are you sure you want to change the base?

Add SDG starter kit v2 #32

Uh oh!

Conversation

aollerenshaw commented Nov 6, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 3/5

Sequence Diagram

Uh oh!

greptile-apps bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants