-
Notifications
You must be signed in to change notification settings - Fork 12
Add SDG starter kit #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Create a Starter Kit which deploys an LLM as an Endpoint for synthetic data generation and sends sample requests to the deployed model. Signed-Off-By: Robert Clark <roclark@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds a new Jupyter notebook starter kit (synthetic-qa-data-gen-with-nemotron.ipynb) to the lepton/starterkits directory. The notebook demonstrates end-to-end synthetic data generation using DGX Cloud Lepton Endpoints with the Nemotron-nano-9b-v2 model. It guides users through deploying an LLM endpoint, generating subtopics and questions about CUDA programming, creating paired responses, generating math problems, and saving the results to JSONL format for downstream training tasks like reward modeling or DPO. The notebook leverages AsyncOpenAI for concurrent API calls, implements semaphore-based rate limiting, and includes detailed prompt templates for each generation stage. It fits into the existing starterkits collection by providing a practical example of synthetic data workflows on DGX Cloud Lepton infrastructure, complementing other starter kits that focus on fine-tuning, evaluation, and model deployment patterns.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| lepton/starterkits/synthetic-qa-data-gen-with-nemotron.ipynb | 3/5 | New starter kit notebook demonstrating synthetic data generation pipeline with Nemotron-nano-9b-v2 via Lepton Endpoints, including async request handling and JSONL output |
Confidence score: 3/5
- This PR introduces pedagogical content with several implementation patterns that may cause runtime issues or silent data loss if users run it without modification.
- Score reflects five distinct concerns: (1) empty
SAVE_DIRECTORYdefaults to current directory, risking data overwrite; (2)wait_for_endpointmay return URL before endpoint is fully ready due to loop logic; (3)generate_responsecreates a new semaphore on every call whensem=None, breaking concurrency control; (4)zip(..., strict=False)silently drops data if list lengths mismatch; (5) bare exception handlers ingenerate_mathcan hide critical failures and leave incomplete output. - Pay close attention to
lepton/starterkits/synthetic-qa-data-gen-with-nemotron.ipynb, particularly the configuration cell (lines 84-91),wait_for_endpointfunction (lines 138-157),generate_responsefunction (lines 521-532), the zip operation (lines 566-576), and exception handling ingenerate_math(lines 709-712).
Sequence Diagram
sequenceDiagram
participant User
participant Notebook
participant LeptonCLI as "Lepton CLI"
participant Endpoint as "Model Endpoint"
participant OpenAIClient as "OpenAI Client"
participant LLM as "Nemotron LLM"
participant FileSystem as "File System"
User->>Notebook: "Configure environment variables"
User->>LeptonCLI: "lep login -c $LEPTON_KEY"
LeptonCLI-->>User: "Authenticated"
User->>LeptonCLI: "lep endpoint create"
LeptonCLI->>Endpoint: "Deploy model with vLLM"
Notebook->>Endpoint: "Wait for endpoint to be ready"
Endpoint-->>Notebook: "Endpoint URL"
User->>OpenAIClient: "Initialize client with endpoint URL"
OpenAIClient-->>User: "Client ready"
User->>OpenAIClient: "Generate subtopics from topic"
OpenAIClient->>LLM: "POST /v1/chat/completions (TOPIC_GENERATION_PROMPT)"
LLM-->>OpenAIClient: "Subtopics list"
OpenAIClient-->>Notebook: "Parse and store subtopics"
User->>Notebook: "Generate questions for each subtopic"
loop "For each subtopic (async)"
Notebook->>OpenAIClient: "generate_questions(subtopic)"
OpenAIClient->>LLM: "POST /v1/chat/completions (QUESTION_PROMPT)"
LLM-->>OpenAIClient: "Questions list"
OpenAIClient-->>Notebook: "Store questions"
end
User->>Notebook: "Generate responses for questions"
loop "For each question (async with semaphore)"
Notebook->>OpenAIClient: "generate_response(question)"
OpenAIClient->>LLM: "POST /v1/chat/completions (RESPONSE_PROMPT)"
LLM-->>OpenAIClient: "Response A & B"
OpenAIClient-->>Notebook: "Store response pair"
end
Notebook->>FileSystem: "Write synthetic_data.jsonl"
FileSystem-->>User: "Dataset saved"
opt "Generate math problems"
User->>Notebook: "Generate math problems"
Notebook->>OpenAIClient: "generate_math(topic)"
OpenAIClient->>LLM: "POST /v1/chat/completions (MATH_PROMPT)"
LLM-->>OpenAIClient: "Math problems and solutions"
OpenAIClient-->>Notebook: "Parse problem-solution pairs"
Notebook-->>User: "Display math problems"
end
1 file reviewed, 7 comments
| "\n", | ||
| "BASE_MODEL = \"nvidia/nvidia-nemotron-nano-9b-v2\" # Optionally use a different model for requests\n", | ||
| "ACCESS_TOKEN = \"my-access-token\" # Set the password for authenticating endpoint requests\n", | ||
| "SAVE_DIRECTORY = \"\" # Specify the absolute path to save the generated data. To save on shared storage, must be a mounted storage path\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Empty SAVE_DIRECTORY will cause os.path.join to save to the current directory root. Consider a default like "./output" or add validation to ensure users specify a path. Should this fail early if empty, or is saving to the current working directory acceptable?
| "def wait_for_endpoint(endpoint_name: str, interval: int = 10) -> str:\n", | ||
| " command = [\"lep\", \"endpoint\", \"status\", \"-n\", endpoint_name, \"--detail\"]\n", | ||
| " while True:\n", | ||
| " result = subprocess.run(command, capture_output=True, text=True, check=True)\n", | ||
| " for line in result.stdout.split(\"\\n\"):\n", | ||
| " if line.startswith(\"State\"):\n", | ||
| " _, state = line.strip().rsplit(\" \", maxsplit=1)\n", | ||
| " if \"LeptonDeploymentState.Ready\" in state:\n", | ||
| " print(\"Endpoint deployed!\")\n", | ||
| " else:\n", | ||
| " break\n", | ||
| " url_match = re.search(r'https://[\\w\\d\\.\\-]+', line)\n", | ||
| " if url_match:\n", | ||
| " print(f\"URL: {url_match[0]}\")\n", | ||
| " return url_match[0]\n", | ||
| " print(f\"Waiting for endpoint {endpoint_name} to be ready...\")\n", | ||
| " time.sleep(interval)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: wait_for_endpoint runs subprocess.run with check=True but doesn't handle CalledProcessError. If the CLI command fails, this will crash instead of retrying or logging.
| " url_match = re.search(r'https://[\\w\\d\\.\\-]+', line)\n", | ||
| " if url_match:\n", | ||
| " print(f\"URL: {url_match[0]}\")\n", | ||
| " return url_match[0]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Returns inside the for line loop. If url_match is found before the state check, endpoint may not actually be ready yet.
| "outputs": [], | ||
| "source": [ | ||
| "responses = await generate_subtopics(client, topic=topic, n_subtopics=n_subtopics)\n", | ||
| "nonreasoning_answer = re.sub(r'.*</think>', \"\", responses.choices[0].message.content, flags=re.DOTALL).strip()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Regex assumes reasoning is enclosed in </think>. If model outputs differently (or no reasoning), this will strip nothing or strip incorrectly.
| "source": [ | ||
| "async def generate_response(client, question, sem=None):\n", | ||
| " prompt = RESPONSE_PROMPT_TEMPLATE.format(question=question)\n", | ||
| " async with sem or asyncio.Semaphore(1): \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: async with sem or asyncio.Semaphore(1) creates a new semaphore on each call if sem=None. This defeats concurrency control—pass a real semaphore or use a default at the function level.
| "source": [ | ||
| "question_response_pair_list = []\n", | ||
| "\n", | ||
| "for question, response_set in zip(question_list_formatted, question_response_list, strict=False):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: strict=False in zip silently truncates if lists differ in length. This may hide data-loss bugs if generation failed for some questions.
| " except Exception as e:\n", | ||
| " print(f\"Attempt {attempt+1}/{n_retries} failed: {e}\")\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Bare except Exception silently swallows all errors including KeyboardInterrupt derivatives. Consider catching specific exceptions or re-raising after logging.
Create a Starter Kit which deploys an LLM as an Endpoint for synthetic data generation and sends sample requests to the deployed model.
Created on behalf of Anna Ollerenshaw at NVIDIA.