Source code for Pareto Optimal Code Generation. This contains the implementation for training and evaluating code verification systems using transformer-based outcome reward models with staged verification.
This codebase implements Pareto-optimal code generation using outcome reward models (ORMs) and staged verification strategies. It provides tools for:
- Training and evaluating code verification models
- Scoring code solutions using various methods (binary logit, classification, reward modeling)
- Processing and executing code across multiple benchmark datasets
- Comprehensive evaluation metrics and analysis
src/
├── evaluation/ # Evaluation suite and benchmarks
│ ├── configs.py # Evaluation configurations
│ ├── evaluator.py # Core evaluation logic
│ ├── execution.py # Code execution handling
│ ├── filter_functions.py # Solution filtering
│ └── prompts.py # Prompt templates and handling
├── modeling.py # Model architectures and configurations
├── preprocessing.py # Data preparation and formatting
├── scoring.py # Solution scoring implementations
├── metrics.py # Evaluation metrics
├── utils.py # Utility functions
└── training/ # Training pipeline components
- Model configuration and initialization
- Support for various transformer architectures (GPT-2, Pythia, Qwen, etc.)
- Precision and dropout handling
- Model loading and setup utilities
Multiple scoring methods for code verification:
- Binary logit scoring over sequences
- Classification-based scoring
- Reward model scoring
- Log probability scoring
- Single token scoring variants
- Formats problems and solutions for model input
- Customizable templates using Jinja2
- Support for in-context learning examples
- Configurable delimiters and formatting options
Comprehensive evaluation system supporting:
- Multiple benchmark datasets (HumanEval, MBPP, GSM8K, CodeContests)
- Code execution and testing
- Solution filtering and deduplication
- Metric computation and analysis
- Configurable evaluation pipelines
Training pipeline components for:
- Model training and fine-tuning
- Data handling and batching
- Training configuration
-
Flexible Scoring Methods
- Multiple scoring approaches for different verification strategies
- Configurable scoring parameters and thresholds
- Support for both sequence-level and token-level scoring
-
Comprehensive Evaluation
- Support for major code evaluation benchmarks
- Safe code execution environment
- Detailed metrics and analysis tools
- Configurable evaluation suites
-
Efficient Processing
- Solution deduplication
- Batched processing
- Parallel execution support
- Memory-efficient handling of large datasets
-
Model Support
- Compatible with HuggingFace Transformers
- Support for multiple model architectures
- Configurable precision and performance options
The system is designed to be used by ML researchers and engineers working on code verification systems. Key usage patterns include:
- Model Evaluation
from src.evaluation import evaluate_suite, EvalSuite
from src.modeling import ModelConfig
from src.scoring import ScoringConfig
# Configure evaluation
eval_config = EvalSuite(
tasks=["humaneval", "mbpp"],
num_workers=4,
timeout=30
)
# Set up model and scoring
model_config = ModelConfig(name="pythia-160m")
scoring_config = ScoringConfig(scoring_method="binary_logit")
# Run evaluation
results = evaluate_suite(
model_config=model_config,
scoring_config=scoring_config,
eval_suite=eval_config
)- Custom Scoring
from src.scoring import load_scoring_method
from src.preprocessing import PreprocessorConfig, Preprocessor
# Configure preprocessing
preproc_config = PreprocessorConfig(
problem="# Question\n{{ problem }}",
program="# Solution\n{{ program }}",
outcome="{{ outcome }}"
)
# Initialize scoring
scoring_fn = load_scoring_method(
scoring_cfg=scoring_config,
tokenizer=tokenizer,
pass_choice_str="[Yes]",
fail_choice_str="[No]"
)
# Score solutions
scores = scoring_fn(dataset, model, max_tokens_per_batch=1024)- Evaluation Pipeline
from src.evaluation import Evaluator
from src.evaluation.execution import process_and_execute_raw_preds
# Initialize evaluator
evaluator = Evaluator(
seed=42,
scoring_cfg=scoring_config,
preprocessor_cfg=preproc_config,
num_workers=4,
max_tokens_per_batch=1024
)
# Run evaluation
timings, scores = evaluator(
dataset=dataset,
model=model,
tokenizer=tokenizer,
suite=eval_config
)The system provides comprehensive evaluation metrics including:
- Pass@k scores
- Ranking metrics
- Execution timing statistics
- Solution quality metrics
- Filtering effectiveness measures
Results are saved in a structured format for further analysis and comparison.