Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,8 @@ cython_debug/
# Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
# exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
# refer to https://docs.cursor.com/context/ignore-files
.*/Outputs_TTS/
Outputs_TTS/
.cursorignore
.cursorindexingignore

Expand Down
67 changes: 67 additions & 0 deletions examples/TTSwithVerification/MULTIPROCESS_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Multi-Process vLLM Setup for Best-of-K Baseline

This directory contains scripts and code for running the best-of-K baseline with multi-process vLLM serving.

## Setup

### 1. Start vLLM with 4 processes (2 GPUs each)

```bash
bash start_vllm_multiprocess.sh
```

This launches 4 vLLM OpenAI-compatible API servers:
- **Process 1**: GPUs 0-1, Port 8000
- **Process 2**: GPUs 2-3, Port 8001
- **Process 3**: GPUs 4-5, Port 8002
- **Process 4**: GPUs 6-7, Port 8003

Each process uses `tensor-parallel-size 2` for distributed inference.

### 2. Run the baseline

In a separate terminal:

```bash
# Test with 1 example
python bestofk_baseline.py --task game24 --num_examples 1 --k 4 --use_critic

# Run on maze dataset
python bestofk_baseline.py --task maze --num_examples 10 --k 4

# Run on spatialmap dataset
python bestofk_baseline.py --task spatialmap --num_examples 5 --k 4
```

Or use the test script:
```bash
bash run_multiprocess_test.sh game24 5
```

## Load Balancing

- Requests are distributed **round-robin** across the 4 vLLM instances
- Each generation request goes to the next available port (8000 → 8001 → 8002 → 8003 → 8000 ...)
- Critic evaluation requests use separate round-robin tracking (independent counter)
- This ensures even load distribution across all 4 GPU pairs

## Stopping vLLM

```bash
pkill -9 -f "vllm.entrypoints.openai.api_server"
```

## Configuration

Edit `start_vllm_multiprocess.sh` to change:
- `MODEL`: Model name (default: `Qwen/QwQ-32B`)
- `MAX_TOKENS`: Maximum sequence length (default: 8192)
- `GPU_MEMORY`: GPU memory utilization (default: 0.4)
- `TENSOR_PARALLEL`: Must be ≤ 2 for this 8-GPU setup

## Benefits

- **Better throughput**: 4 independent processes handle requests in parallel
- **Fault tolerance**: If one process crashes, others continue
- **GPU utilization**: Balanced load across all 8 GPUs (2 GPUs per process)
- **Reduced latency**: Each process has dedicated GPU resources
41 changes: 41 additions & 0 deletions examples/TTSwithVerification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,39 @@ The Z3 solver handles diagonal directions (`Northwest`, `Northeast`, `Southwest`

---

# Best-of-K Baseline

A simple best-of-K baseline that generates K independent reasoning traces per example and selects the best based on:
1. **Ground-truth matching** (default): Greedy selection of first correct answer among K samples
2. **Critic model evaluation** (optional): Use a separate critic LLM to evaluate correctness without access to ground truth

This baseline demonstrates that with sufficient sampling, even simple CoT can achieve good performance.

## Usage

```bash
# Best-of-K with ground-truth evaluation
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 10 --k 4

# Best-of-K with critic model evaluation
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 10 --k 4 --use_critic --critic_model Qwen/Qwen3-30B-A3B-Thinking-2507 --critic_port 8001
```

### Parameters

| Argument | Description | Default |
|----------|-------------|---------|
| `--task` | Task: `game24`, `maze`, or `spatialmap` | required |
| `--k` | Number of samples per example | `4` |
| `--use_critic` | Use critic model for evaluation instead of ground truth | `False` |
| `--critic_model` | Model to use for critic evaluation | MAIN_MODEL |
| `--critic_port` | vLLM server port for critic model | `8001` |
| `--num_examples`, `-n` | Number of examples to run | varies |
| `--main_model` | Model for generation | `Qwen/Qwen3-30B-A3B-Thinking-2507` |
| `--port` | vLLM server port for main model | `8000` |

---

## Example Scripts

Each script runs a full evaluation: loading a dataset, building structured prompts, running inference with step verification, and computing accuracy/token statistics.
Expand All @@ -169,6 +202,14 @@ python ./examples/TTSwithVerification/maze_stepverifier.py -n 1

# SpatialMap with step verification
python ./examples/TTSwithVerification/spatialmap_stepverifier.py -n 1

# Best-of-K baseline (standard CoT, no monitors)
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 1 --k 4
python ./examples/TTSwithVerification/bestofk_baseline.py --task maze -n 1 --k 4
python ./examples/TTSwithVerification/bestofk_baseline.py --task spatialmap -n 1 --k 4

# Best-of-K with critic model evaluation
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 1 --k 4 --use_critic
```

### Common arguments
Expand Down
Loading