microsoft · prateekiiest · Feb 17, 2026 · Feb 20, 2026 · Feb 21, 2026
diff --git a/.gitignore b/.gitignore
@@ -198,6 +198,8 @@ cython_debug/
 #  Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
 #  exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
 #  refer to https://docs.cursor.com/context/ignore-files
+.*/Outputs_TTS/
+Outputs_TTS/
 .cursorignore
 .cursorindexingignore
 

diff --git a/examples/TTSwithVerification/MULTIPROCESS_README.md b/examples/TTSwithVerification/MULTIPROCESS_README.md
@@ -0,0 +1,67 @@
+# Multi-Process vLLM Setup for Best-of-K Baseline
+
+This directory contains scripts and code for running the best-of-K baseline with multi-process vLLM serving.
+
+## Setup
+
+### 1. Start vLLM with 4 processes (2 GPUs each)
+
+```bash
+bash start_vllm_multiprocess.sh
+```
+
+This launches 4 vLLM OpenAI-compatible API servers:
+- **Process 1**: GPUs 0-1, Port 8000
+- **Process 2**: GPUs 2-3, Port 8001  
+- **Process 3**: GPUs 4-5, Port 8002
+- **Process 4**: GPUs 6-7, Port 8003
+
+Each process uses `tensor-parallel-size 2` for distributed inference.
+
+### 2. Run the baseline
+
+In a separate terminal:
+
+```bash
+# Test with 1 example
+python bestofk_baseline.py --task game24 --num_examples 1 --k 4 --use_critic
+
+# Run on maze dataset
+python bestofk_baseline.py --task maze --num_examples 10 --k 4
+
+# Run on spatialmap dataset
+python bestofk_baseline.py --task spatialmap --num_examples 5 --k 4
+```
+
+Or use the test script:
+```bash
+bash run_multiprocess_test.sh game24 5
+```
+
+## Load Balancing
+
+- Requests are distributed **round-robin** across the 4 vLLM instances
+- Each generation request goes to the next available port (8000 → 8001 → 8002 → 8003 → 8000 ...)
+- Critic evaluation requests use separate round-robin tracking (independent counter)
+- This ensures even load distribution across all 4 GPU pairs
+
+## Stopping vLLM
+
+```bash
+pkill -9 -f "vllm.entrypoints.openai.api_server"
+```
+
+## Configuration
+
+Edit `start_vllm_multiprocess.sh` to change:
+- `MODEL`: Model name (default: `Qwen/QwQ-32B`)
+- `MAX_TOKENS`: Maximum sequence length (default: 8192)
+- `GPU_MEMORY`: GPU memory utilization (default: 0.4)
+- `TENSOR_PARALLEL`: Must be ≤ 2 for this 8-GPU setup
+
+## Benefits
+
+- **Better throughput**: 4 independent processes handle requests in parallel
+- **Fault tolerance**: If one process crashes, others continue
+- **GPU utilization**: Balanced load across all 8 GPUs (2 GPUs per process)
+- **Reduced latency**: Each process has dedicated GPU resources
diff --git a/examples/TTSwithVerification/README.md b/examples/TTSwithVerification/README.md
@@ -156,6 +156,39 @@ The Z3 solver handles diagonal directions (`Northwest`, `Northeast`, `Southwest`
 
 ---
 
+# Best-of-K Baseline
+
+A simple best-of-K baseline that generates K independent reasoning traces per example and selects the best based on:
+1. **Ground-truth matching** (default): Greedy selection of first correct answer among K samples
+2. **Critic model evaluation** (optional): Use a separate critic LLM to evaluate correctness without access to ground truth
+
+This baseline demonstrates that with sufficient sampling, even simple CoT can achieve good performance.
+
+## Usage
+
+```bash
+# Best-of-K with ground-truth evaluation
+python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 10 --k 4
+
+# Best-of-K with critic model evaluation
+python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 10 --k 4 --use_critic --critic_model Qwen/Qwen3-30B-A3B-Thinking-2507 --critic_port 8001
+```
+
+### Parameters
+
+| Argument | Description | Default |
+|----------|-------------|---------|
+| `--task` | Task: `game24`, `maze`, or `spatialmap` | required |
+| `--k` | Number of samples per example | `4` |
+| `--use_critic` | Use critic model for evaluation instead of ground truth | `False` |
+| `--critic_model` | Model to use for critic evaluation | MAIN_MODEL |
+| `--critic_port` | vLLM server port for critic model | `8001` |
+| `--num_examples`, `-n` | Number of examples to run | varies |
+| `--main_model` | Model for generation | `Qwen/Qwen3-30B-A3B-Thinking-2507` |
+| `--port` | vLLM server port for main model | `8000` |
+
+---
+
 ## Example Scripts
 
 Each script runs a full evaluation: loading a dataset, building structured prompts, running inference with step verification, and computing accuracy/token statistics.
@@ -169,6 +202,14 @@ python ./examples/TTSwithVerification/maze_stepverifier.py -n 1
 
 # SpatialMap with step verification
 python ./examples/TTSwithVerification/spatialmap_stepverifier.py -n 1
+
+# Best-of-K baseline (standard CoT, no monitors)
+python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 1 --k 4
+python ./examples/TTSwithVerification/bestofk_baseline.py --task maze -n 1 --k 4
+python ./examples/TTSwithVerification/bestofk_baseline.py --task spatialmap -n 1 --k 4
+
+# Best-of-K with critic model evaluation
+python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 1 --k 4 --use_critic
 ```
 
 ### Common arguments