Adaptive Immune Receptor Repertoire Sequence Simulator
Generate realistic BCR & TCR repertoires with full ground-truth annotations in Python.
Benchmarking sequence aligners, studying somatic hypermutation, or training ML models on immune repertoires requires large, perfectly-annotated datasets — not noisy snippets of real sequencing data.
GenAIRR is a plug-and-play, fully-extensible simulation engine that produces realistic immunoglobulin and TCR sequences while giving you complete ground-truth labels for every position, mutation, and gene segment.
| Category | Highlights |
|---|---|
| Realistic Simulation | Context-aware S5F mutations, indels, allele-specific trimming, NP-region modelling |
| Composable Pipelines | Chain together built-in & custom steps into simulation pipelines |
| Multi-Chain Support | Heavy chain, kappa/lambda light chains, and TCR-beta out of the box |
| Research-ready Output | Full ground-truth annotations, JSON/pandas export, deterministic seeds |
| Docs & Tutorials | Step-by-step guides, Jupyter notebooks, API reference |
# Python >= 3.9
pip install GenAIRRfrom GenAIRR import simulate, HUMAN_IGH_OGRDB, S5F
result = simulate(HUMAN_IGH_OGRDB, S5F(0.003, 0.25))
print(result.sequence)CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGGGACCCTGTCCCTCACCTGCGCTG...
Generate multiple sequences at once:
results = simulate(HUMAN_IGH_OGRDB, S5F(0.003, 0.25), n=100)For complete control over the simulation, use the Pipeline API:
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
steps.DistillMutationRate(),
]
)
sim = pipeline.execute()
print(sim.get_dict()){
'sequence': 'CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCG...',
'v_call': ['IGHVF3-G8*04'],
'd_call': ['IGHD6-6*01'],
'j_call': ['IGHJ4*02'],
'productive': True,
'mutation_rate': 0.0027,
'mutations': {142: 'T>C'},
'v_sequence_start': 0,
'v_sequence_end': 293,
'd_sequence_start': 298,
'd_sequence_end': 316,
'j_sequence_start': 323,
'j_sequence_end': 367,
# ... and more fields
}Every output includes the full sequence, V/D/J gene calls, mutation positions, region boundaries, and quality metrics — ready for downstream analysis.
A production-ready pipeline that simulates sequences with biological corrections and sequencing artifacts:
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
# Core: generate sequence with somatic hypermutation
steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
# Correct ground-truth positions after trimming ambiguities
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
# Calculate final mutation rate
steps.DistillMutationRate(),
# Simulate sequencing artifacts
steps.CorruptSequenceBeginning(), # 5' end degradation
steps.EnforceSequenceLength(), # read-length limit
steps.InsertNs(), # ambiguous base calls
steps.ShortDValidation(), # D-region QC
steps.InsertIndels(), # sequencing indels
]
)
result = pipeline.execute()from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, Uniform
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[steps.SimulateSequence(Uniform(0, 0), productive=True)]
)
naive_seq = pipeline.execute()from GenAIRR import Pipeline, steps, HUMAN_IGK_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGK_OGRDB, # kappa light chain (no D segment)
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.DistillMutationRate(),
]
)from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(
S5F(0.003, 0.25),
productive=True,
specific_v=HUMAN_IGH_OGRDB.v_alleles['IGHVF1-G1'][0],
specific_d=HUMAN_IGH_OGRDB.d_alleles['IGHD1-1'][0],
specific_j=HUMAN_IGH_OGRDB.j_alleles['IGHJ1'][0]
)
]
)import pandas as pd
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
steps.DistillMutationRate(),
]
)
# Generate 1000 sequences as a DataFrame
df = pd.DataFrame([pipeline.execute().get_dict() for _ in range(1000)])
df.to_csv('simulated_repertoire.csv', index=False)| Model | Description | When to use |
|---|---|---|
S5F |
Context-dependent somatic hypermutation based on empirical 5-mer frequencies | Realistic antibody maturation studies |
Uniform |
Uniform random mutations | Baselines, ablation experiments |
| Custom | Implement BaseMutationModel |
Your own evolutionary scenarios |
from GenAIRR import S5F, Uniform
# Realistic context-aware SHM
s5f = S5F(min_mutation_rate=0.01, max_mutation_rate=0.05)
# Simple uniform mutations
uniform = Uniform(min_mutation_rate=0.01, max_mutation_rate=0.05)| Config | Chain | Source |
|---|---|---|
HUMAN_IGH_OGRDB |
Heavy chain (BCR) | OGRDB |
HUMAN_IGH_EXTENDED |
Heavy chain extended | OGRDB |
HUMAN_IGK_OGRDB |
Kappa light chain | OGRDB |
HUMAN_IGL_OGRDB |
Lambda light chain | OGRDB |
HUMAN_TCRB_IMGT |
TCR-beta | IMGT |
from GenAIRR import set_seed, get_seed, reset_seed
set_seed(42) # deterministic results
print(get_seed()) # check current seed
reset_seed() # back to random- Getting Started — Overview and first pipeline
- Step-by-Step Tutorial — Build a pipeline from scratch
- API Reference — All classes, parameters, and defaults
- Migration Guide — Upgrading from older versions
- Biological Context — What biological processes are simulated
- Selection-aware mutation model
- Additional germline databases
- Sphinx auto-generated API docs from docstrings
See open issues. Feel something's missing? Open a feature request.
Contributions are welcome! Please read our contributing guide and check the good first issue label.
If GenAIRR helps your research, please cite:
Konstantinovsky T, Peres A, Polak P, Yaari G.
An unbiased comparison of immunoglobulin sequence aligners.
Briefings in Bioinformatics. 2024 Sep 23; 25(6): bbae556.
https://doi.org/10.1093/bib/bbae556
PMID: 39489605 | PMCID: PMC11531861
Distributed under the GPL-3.0 License. See LICENSE for details.
GenAIRR is inspired by and builds upon work from the immunoinformatics community — especially AIRRship.