ssa-scRNA: Semi-Supervised Annotation of scRNA-seq Data

Overview

ssa-scRNA is a Python package for semi-supervised cell type annotation of scRNA-seq data. The pipeline combines weak-labeling strategies (QCQ, Otsu, graph-based, Dirichlet Process) with supervised label propagation (KNN, Random Forest, Centroid) and consensus voting to assign cell type labels from marker genes.

Intended Use

This package is designed for annotating scRNA-seq datasets when you have prior knowledge of marker genes for target cell types. It is not intended for unsupervised clustering or novel cell type discovery (yet!). Instead, it provides a robust framework for leveraging known biology to generate high-confidence annotations.

Key characteristics:

Combines multiple weak labeling strategies for robust predictions
Propagates labels from seed assignments to unlabeled cells
Evaluates method agreement using Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI)
Runs efficiently on standard hardware, taking advantage of parallel processing where possible

Requirements and Limitations:

Requires prior knowledge of marker genes for target cell types
Annotation accuracy depends directly on marker gene quality
Limited to cell types with defined markers (no automated discovery)
Works best with well-separated cell types in the data

Installation

The package requires Python ≥ 3.10. Core dependencies (scanpy, scikit-learn, pandas, numpy) are automatically installed.

Quick Start

Basic Usage

import scanpy as sc
import ssa_scrna as ssa

# Load your scRNA-seq data
adata = sc.read_h5ad("data/pbmc.h5ad")

# Define marker genes for cell types of interest
markers = {
    "CD8_T": ["CD8A", "CD8B", "GZMA"],
    "CD4_T": ["CD4", "IL7R", "TCF7"],
    "B_cells": ["CD19", "MS4A1", "CD79A"],
    "Dendritic": ["ITGAX", "CD1C", "FCER1A"],
    # ... more cell types
}

# Phase 1: Generate seed labels from multiple strategies (no consensus)
seed_strategies = {
    "seeds_qcq": ssa.strategies.QCQAdaptiveThresholding(markers=markers),
    "seeds_otsu": ssa.strategies.OtsuAdaptiveThresholding(markers=markers),
    "seeds_graph": ssa.strategies.GraphScorePropagation(markers=markers),
}
ssa.tl.label(adata, strategies=seed_strategies, n_jobs=4)

# Phase 2: Propagate labels independently from each seed source
# This allows different propagation methods to learn from different seed qualities
propagation_methods = ["knn", "rf"]  # KNN and Random Forest
all_propagated_labels = []

for seed_name in seed_strategies.keys():
    propagators = {
        f"prop_{method}_{seed_name.split('_')[1]}": (
            ssa.strategies.KNNPropagation(seed_key=seed_name)
            if method == "knn"
            else ssa.strategies.RandomForestPropagation(seed_key=seed_name)
        )
        for method in propagation_methods
    }
    ssa.tl.label(adata, strategies=propagators, n_jobs=2)
    all_propagated_labels.extend(propagators.keys())

# Phase 3: Final consensus across all propagated predictions
final_consensus = ssa.strategies.ConsensusVoting(
    keys=all_propagated_labels,
    majority_fraction=0.66
)
ssa.tl.label(adata, strategies=final_consensus, key_added="labels_final")

# Results are stored in adata.obs
print(adata.obs["labels_final"].value_counts())

Working with Individual Strategies

# Apply a single strategy
strategy = ssa.strategies.QCQAdaptiveThresholding(markers=markers, quota=50)
result = ssa.tl.label(adata, strategies=strategy, key_added="my_labels")

# Access the result
print(f"Assigned labels: {result['my_labels'].labels.value_counts()}")

# Check confidence scores if available
if "my_labels_max_confidence" in adata.obs:
    print(f"Mean confidence: {adata.obs['my_labels_max_confidence'].mean():.2f}")

Batch Processing with Async

import asyncio

async def label_multiple():
    strategies = [
        ssa.strategies.QCQAdaptiveThresholding(markers),
        ssa.strategies.OtsuAdaptiveThresholding(markers),
    ]
    
    results = await asyncio.gather(
        *[ssa.tl.label_async(adata, s) for s in strategies]
    )
    return results

# results = asyncio.run(label_multiple())

Example: PBMC3k Dataset

A complete end-to-end pipeline is provided in examples/pbmc3k/run.py:

# Run the PBMC3k example
uv run ssa-examples pbmc3k

This demonstrates:

QC filtering with data-driven thresholds
Feature preprocessing (HVGs, PCA, neighbors)
Phase 1 seeding with 4 independent strategies (no early consensus)
Phase 2 propagation from each seed independently (3 × 4 combinations)
Baseline clustering (Leiden at multiple resolutions)
Visualization (UMAP, t-SNE)
Quantitative ablation metrics (ARI, NMI)

Architecture

Strategy Pattern

All labeling methods inherit from BaseLabelingStrategy and implement:

class MyStrategy(BaseLabelingStrategy):
    def __init__(self, markers, **kwargs):
        self.markers = markers
    
    @property
    def name(self) -> str:
        return "my_strategy"
    
    def execute_on(self, adata: AnnData) -> LabelingResult:
        # Implement labeling logic
        labels = self.predict(adata)
        
        return LabelingResult(
            adata=adata,
            strategy=self,
            labels=labels,
            obs={"confidence": confidence_scores},  # Optional
            uns={"parameters": self.__dict__},      # Optional
        )

Data Flow

Raw scRNA-seq data
    ↓
QC Filtering (genes, UMI, mitochondrial content)
    ↓
Normalization (log1p)
    ↓
Preprocessing (as needed: HVG selection, PCA, neighbors, etc.)
    ↓
┌──────────────────────────────────────────┐
│   PHASE 1: Weak Labeling Strategies      │
│  (QCQ, Otsu, Graph, DPMM, etc.)          │
│  → Generate independent seed labels      │
└──────────────────────────────────────────┘
    ↓
┌──────────────────────────────────────────┐
│   PHASE 2: Independent Propagation       │
│  Each propagator (KNN, RF, Centroid)     │
│  learns from EACH seed independently     │
│  → Generates: prop_<method>_<seed>       │
└──────────────────────────────────────────┘
    ↓
Final Consensus Voting (across all Phase 2 outputs)
    ↓
Quantitative Evaluation (ARI, NMI)

Design rationale:

Phase 1 generates independent seeds from diverse methods
Phase 2 propagates from each seed independently, allowing diverse approaches to coexist
Final consensus aggregates all Phase 2 predictions for robust classification

Implementation

Phase 0: Marker Gene Specification

Define markers as a dictionary mapping cell type names to lists of genes:

markers = {
    "T_cells": ["CD3D", "CD3E", "CD3G"],
    "B_cells": ["CD19", "MS4A1", "CD79A"],
    "Monocytes": ["LYZ", "S100A8", "S100A9"],
    "NK_cells": ["GNLY", "NKG7", "GZMB"],
}

Guidelines:

Use 3-5 robust marker genes per cell type for reliability
Prefer genes with high expression in target cell type
Prefer genes with low expression in other cell types
Test markers on reference data before large-scale analysis

Phase 1: Initial Labeling

Use any (or many!) of the following strategies to generate independent seed labels. Each is propagated independently in Phase 2:

Strategy	Class	Description
QCQ Adaptive Thresholding	`QCQAdaptiveThresholding`	Quantile-based marker scoring with data-driven thresholds
Otsu Adaptive Thresholding	`OtsuAdaptiveThresholding`	Automatic threshold selection via Otsu's method
Graph Score Propagation	`GraphScorePropagation`	Network-based marker co-expression scoring
Dirichlet Process Labeling	`DirichletProcessLabeling`	Bayesian mixture model with automatic component selection and probabilistic confidence bounds

Common Parameters:

markers (dict): Cell type → marker gene list mapping
unknown_label (str, default "unknown"): Label for unlabeled cells

Phase 2: Label Propagation Strategies

Propagate seed labels to unlabeled cells using supervised learning:

Strategy	Class	Description
KNN Propagation	`KNNPropagation`	k-Nearest neighbor classification
Random Forest Propagation	`RandomForestPropagation`	Ensemble-based classification
Nearest Centroid Propagation	`NearestCentroidPropagation`	Centroid-based assignment

Common Parameters:

seed_key (str): Column in adata.obs containing seed labels
obsm_key (str, default "X_pca"): Feature representation for classification
unknown_label (str, default "unknown"): Label for unlabeled cells
keep_seeds (bool, default True): Preserve original seed labels

Consensus Strategy

Obtain a final consensus label by combining multiple strategies with majority voting:

Strategy	Class	Description
Consensus Voting	`ConsensusVoting`	Combine multiple predictions via majority voting

Parameters:

keys (list[str]): Column names to combine
majority_fraction (float, default 0.66): Fraction of votes required (0.51 to 1.0)
unknown_label (str, default "unknown"): Label for unlabeled cells

Example Usage:

# Combine propagation outputs from all seeds at the final stage
consensus = ssa.strategies.ConsensusVoting(
    keys=["prop_knn_qcq", "prop_knn_otsu", "prop_rf_qcq", "prop_rf_otsu"],
    majority_fraction=0.66  # Supermajority
)
ssa.tl.label(adata, strategies=consensus, key_added="labels_final")

Using `majority_fraction` for Consensus Voting

Controls how many independent strategies must agree to assign a label:

0.51: Simple majority (loose, more cells labeled)
0.66: Supermajority (balanced, recommended)
1.00: Unanimous (strict, fewer cells labeled, higher confidence)

Example:

# Strict consensus (all methods must agree)
consensus = ssa.strategies.ConsensusVoting(
    keys=["seeds_qcq", "seeds_otsu", "seeds_graph"],
    majority_fraction=1.0
)

# Loose consensus (2 out of 3)
consensus = ssa.strategies.ConsensusVoting(
    keys=["seeds_qcq", "seeds_otsu", "seeds_graph"],
    majority_fraction=0.51
)

Output Format

Labeling results are stored in adata.obs with the following convention:

├── {key}                         # Main labels ("Unknown" = unlabeled)
├── {key}_max_confidence          # Maximum voting score (if available)
├── {key}_is_confident            # Boolean confidence flag (optional)
└── {key}_params                  # Strategy parameters in adata.uns

Example output structure for Phase 1 seeding:

adata.obs columns:
├── seeds_qcq              # QCQ strategy labels
├── seeds_otsu             # Otsu strategy labels
├── seeds_graph            # Graph strategy labels
├── seeds_dpmm             # DPMM strategy labels
├── seeds_dpmm_max_confidence    # DPMM confidence score
├── seeds_dpmm_is_confident      # DPMM confidence boolean
# Note: seeds_consensus is no longer generated; seeds are propagated independently

Performance Considerations

QCQ and Otsu strategies run in seconds to minutes depending on data size
Graph-based methods may take longer due to network construction
DPMM is computationally intensive; consider subsampling for large datasets. Parallel processing is available for DPMM using the n_jobs parameter.
Multiple strategies can be run in parallel using label_async and asyncio.gather for efficient batch processing

Troubleshooting

"No labeled cells found in the seed column"

Phase 2 propagation requires at least some seeds from Phase 1
Check that Phase 1 resulted in labeled cells (not all "unknown")
Try looser marker gene definitions or majority_fraction=0.51

"Key 'X_pca' not found in adata.obsm"

Ensure PCA was computed before running propagation strategies
Run: sc.tl.pca(adata) and sc.pp.neighbors(adata)

Few cells labeled in Phase 1

Markers might be missing from your dataset
Check gene names match annotation (case-sensitive)
Markers might be too specific; try more permissive thresholds

Inconsistent labels between methods

This is expected! Different methods have different biases
Use consensus voting to increase confidence
Check majority_fraction parameter

FAQ

Q: Can I use gene expression data from different platforms (10x, Smart-seq)?
A: Yes! Ensure proper normalization before running (sc.pp.normalize_total + sc.pp.log1p)

Q: What if I have confidence scores for true cell types?
A: Filter your labeled data to high-confidence cells before using as seed labels

Q: Can I use this for novel cell type discovery?
A: Not at the moment. This package is designed for annotation with known markers, not unsupervised clustering. Future versions may include treating "unknown" cells as a separate cluster for discovery, using a hierarchical approach.

Q: How many marker genes do I need?
A: 3-5 robust markers per cell type is a good starting point. More markers reduce false positives.

Q: Can I combine predictions with a reference atlas?
A: Not built-in, but is possible using a custom strategy.

Authorship and Acknowledgments

This package was developed as part of the Master of Statistics (M.Stat.) project at the Indian Statistical Institute in partial fulfillment of curriculum requirements. See CONTRIBUTORS.md for detailed author and contributor information, including ORCIDs.

I am immensely grateful to my advisors, Prof. Raghunath Chatterjee and Dr. Jayant Jha, for their guidance and support throughout this project. I also thank Dr. Snehalika Lall for her valuable insights, experience and infrastructure support, without which this work would not have been possible. Finally, I acknowledge the open-source community and the developers of the libraries used in this project for their contributions to scientific software.

Licensing and Citation

BSD 3-Clause License - See LICENSE file for details.

Please cite this package appropriately if you use it in your research. A citation file will be provided upon publication.

For questions, issues, or feature requests, please open an issue on GitHub.

Last Updated: 14 February 2026

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
docs		docs
examples/pbmc3k		examples/pbmc3k
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

ssa-scRNA: Semi-Supervised Annotation of scRNA-seq Data

Overview

Intended Use

Installation

Quick Start

Basic Usage

Working with Individual Strategies

Batch Processing with Async

Example: PBMC3k Dataset

Architecture

Strategy Pattern

Data Flow

Implementation

Phase 0: Marker Gene Specification

Phase 1: Initial Labeling

Phase 2: Label Propagation Strategies

Consensus Strategy

Using majority_fraction for Consensus Voting

Output Format

Performance Considerations

Troubleshooting

"No labeled cells found in the seed column"

"Key 'X_pca' not found in adata.obsm"

Few cells labeled in Phase 1

Inconsistent labels between methods

FAQ

Authorship and Acknowledgments

Licensing and Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Using `majority_fraction` for Consensus Voting

Packages