This is the official codebase for TOMIC: A Transformer-based Domain Separation Network for Organ-specific Metastasis-Initiating Cell Identification.
TOMIC is a Transformer-based domain separation network (DSN) for Organ-specific Metastasis-Initiating Cell (MIC) identification from primary tumor cells.
Metastasis-initiating cells (MICs) are a rare subpopulation of tumor cells capable of seeding and establishing secondary tumors in distant organs. Understanding and identifying organ-specific MICs is critical for determining metastatic tropism and patient prognosis, and they represent potential targets for therapeutic intervention. However, the precise identification of MICs remains challenging due to cellular heterogeneity within primary tumors, and the complex interplay between tumor cells and the microenvironment of distinct metastatic organs.
TOMIC addresses this challenge by leveraging paired primary tumor and metastatic multi-organ cells, where metastatic tumor cells serve as the labeled source domain and primary tumor single cells serve as the unlabeled target domain. The model aligns the source and target domains through shared feature extraction while separating domain-specific features, enabling an organ-specific classifier trained on the source domain to accurately predict organ-specific MICs in the primary tumor cells.
Figure 1: Overview of TOMIC approach. a. Input data includes scRNA-seq gene expression profiles from distant metastatic cells across multiple organs and paired primary tumor cells. b. Ranked gene name-based tokenization strategy, in which genes are ordered by within-cell expression magnitude to form a deterministic token sequence. c. Architecture of the Transformer-based domain separation network. d. The transformer encoder used in the TOMIC approach.
- Domain Separation Network (DSN): Separates domain-shared and domain-private features to enable effective domain adaptation
- Multiple Model Architectures: Supports MLP, Patch Transformer, Expression Transformer, and Name Transformer models
- Ranked Gene Name-based Tokenization: Genes are ordered by within-cell expression magnitude to form a deterministic token sequence
- Comprehensive Evaluation: Evaluation on both synthetic datasets with gold-standard labels and real paired metastasis datasets with silver-standard labels
TOMIC works with Python >= 3.12 and CUDA >= 12.8. Please make sure you have the correct version of Python and CUDA installed.
- Clone the repository:
git clone https://github.com/Foursheeps/TOMIC.git
cd TOMIC- Create and activate the conda environment:
conda env create -f environment.yml
conda activate tomicPython environment:
- Python 3.12 (tested with Python 3.12.11)
- CUDA 12.8 (tested with CUDA 12.8)
- Conda or Miniconda
- PyTorch 2.8.0
- PyTorch Lightning 2.5.5
- transformers 4.57.0
- scanpy 1.11.4
- anndata 0.12.2
- scikit-learn 1.7.2
- numpy 2.2.6
- pandas 2.3.3
- scipy 1.16.2
- flash-attn 2.8.3 (optional, recommended for faster training)
R environment (for bioinformatics analysis):
- R >= 3.6.1
- Required R packages: Seurat, dplyr, ggplot2, readr, purrr, future
To install R packages:
install.packages(c("dplyr", "ggplot2", "readr", "purrr", "future"))
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Seurat")We evaluate TOMIC on one synthetic dataset and four real-world single-cell RNA sequencing datasets covering different cancer types and metastatic patterns.
- SYNC (Synthetic): Contains 240,000 cells with 14,000 genes, simulating metastasis to four different organs (liver, lung, stomach, and peritoneum). Highly variable genes: 1,200.
| Dataset | Primary Organ | Metastatic Organ | Total Cells | Total Genes | Highly Variable Genes | Links |
|---|---|---|---|---|---|---|
| GSE249057 | Esophageal | Lung | 13,378 | 33,538 | 2,000 | NCBI GEO |
| GSE173958 | Pancreatic | Peritoneal, Liver, Lung | 29,734 | 31,054 | 1,200 | NCBI GEO |
| GSE123902 | Lung | Brain, Adrenal, Bone | 51,910 | 22,854 | 1,200 | NCBI GEO |
| GSE163558 | Gastric | Liver, Peritoneum, Ovary, Lymph node | 36,425 | 33,538 | 1,200 | NCBI GEO |
The synthetic dataset simulates multi-organ metastasis scenarios, while real-world datasets represent different cancer types with various metastatic patterns. Highly variable genes are selected for dimensionality reduction and feature extraction.
- Synthetic data generation scripts and data preprocessing pipeline:
process_data/create_syncdata.py - Processed datasets: Available for download from Google Drive
Option 1: Download processed data (recommended)
Processed datasets are available from Google Drive. Simply download and extract the data to your desired location.
Option 2: Process raw data from scratch
If you want to process the data yourself:
-
Download raw data: Raw datasets are available from NCBI GEO (links above).
-
Process raw data: Use the scripts in
process_data/to process raw datasets:python process_data/create_syncdata.py python process_data/process_GSE173958.py python process_data/process_GSE123902.py python process_data/process_GSE163558.py
We provide a minimal example script (example_train.py) that demonstrates how to train TOMIC models with a simple Python script. This is the easiest way to get started.
Quick start:
-
Edit
example_train.pyto configure your paths:data_path = Path("/path/to/your/data") output_dir = Path("/path/to/output")
-
Modify the model type and training parameters as needed:
model_type = "name" # Choose: "mlp", "patch", "expr", or "name"
-
Run the script:
python example_train.py
The script will:
- Load data configuration from
info_config.json - Train a TOMIC model using DSN (Domain Separation Network)
- Save checkpoints and logs automatically
- Test the trained model and display results
Features:
- Supports all model types: MLP, Patch, Expression, and Name Transformer
- Supports DSN training method and standard supervised learning
- Includes detailed comments and documentation
- Automatically saves test results to JSON file
For more details, see the comments in example_train.py.
We provide convenient one-click training scripts in the scripts/ directory that handle both data processing and training automatically. These scripts are pre-configured for specific datasets and can run multiple training methods (DSN, DANN, ADDA, and standard supervised learning) sequentially.
Available scripts:
-
Real datasets:
GSE173958_M1_1200.sh- GSE173958 dataset with 1200 highly variable genesGSE123902_1200.sh- GSE123902 dataset with 1200 highly variable genesGSE163558_1200.sh- GSE163558 dataset with 1200 highly variable genesGSE249057_2000.sh- GSE249057 dataset with 2000 highly variable genes
-
Synthetic datasets:
C120000G14000H1200S10C1.sh- Synthetic dataset configuration 1C120000G14000H1200S10C2.sh- Synthetic dataset configuration 2
Usage:
- Edit the script to configure paths and parameters:
- Set
PYTHONpath to your Python interpreter - Set
DATA_PATHto your data directory - Set
RAW_DATA_DIRif processing raw data
- Set
- Configure training parameters (batch size, learning rate, etc.)
- Choose which training methods to run (
RUN_DSN,RUN_USUAL)
- Choose which training methods to run (
- Run the script:
bash scripts/GSE173958_M1_1200.shThese scripts will:
- Process raw data (if needed) or use existing processed data
- Train multiple model architectures (MLP, Patch, Expression, Name)
- Run DSN domain adaptation method and standard supervised learning
- Save checkpoints and logs automatically
The main training script for DSN models is train_val_scripts/main_dsn.py. You can use the provided shell script template:
# Edit run_dsn_template.sh to set your paths
bash train_val_scripts/run_dsn_template.shOr run directly with Python:
python train_val_scripts/main_dsn.py \
--train_models "['mlp', 'patch', 'expr', 'name']" \
--data_path /path/to/your/data \
--default_root_dir /path/to/output \
--run_training 1 \
--run_testing 1 \
--devices 2 \
--train_batch_size 256 \
--max_epochs 80 \
--patience 10 \
--patch_size 40 \
--bingings "[None, 50]"- mlp: Multi-layer perceptron baseline
- patch: Patch-based Transformer model
- expr: Expression-based Transformer model
- name: Gene name-based Transformer model (ranked tokenization)
For standard supervised learning without domain adaptation:
- Standard Supervised:
train_val_scripts/main_usual.py/run_usual_template.sh
To test a trained model:
python train_val_scripts/main_dsn.py \
--data_path /path/to/your/data \
--default_root_dir /path/to/output \
--run_training 0 \
--run_testing 1 \
--checkpoint_path /path/to/checkpoint.ckpt \
--train_models "['name']"After training and prediction, R scripts are provided in the R/ directory for downstream bioinformatics analysis:
R/DEG_analysis.R performs differential expression analysis on predicted MICs:
- Evaluates prediction accuracy on source domain (metastatic cells)
- Performs DEG analysis for each predicted organ-specific MIC class
- Generates volcano plots for visualization
- Outputs DEG results for each class
Usage:
# Edit paths in the script
source("R/DEG_analysis.R")Requirements:
- R >= 3.6.1
- Seurat
- dplyr
- ggplot2
- readr
- purrr
R/GSE249057_integration.R performs single-cell data integration and analysis:
- Reads and integrates multiple timepoint samples
- Quality control and filtering
- Normalization and batch correction using Seurat integration
- Dimensionality reduction (PCA, UMAP)
- Clustering analysis
- Time-course analysis
Usage:
# Edit paths in the script
source("R/GSE249057_integration.R")Requirements:
- R >= 3.6.1
- Seurat
- ggplot2
- dplyr
- future
TOMIC/
├── tomic/ # Main package
│ ├── dataset/ # Data loading and preprocessing
│ │ ├── abc.py # Base dataset classes
│ │ ├── dataconfig.py # Data configuration
│ │ ├── dataset4da.py # Domain adaptation datasets
│ │ ├── dataset4common.py # Common datasets
│ │ └── preprocessing.py # Data preprocessing utilities
│ ├── model/ # Model architectures
│ │ ├── dsn/ # Domain Separation Network models
│ │ ├── usual/ # Standard supervised learning models
│ │ └── encoder_decoder/ # Encoder-decoder architectures
│ ├── train/ # Training scripts
│ │ ├── dsn/ # DSN training configuration
│ │ └── usual/ # Standard training configuration
│ ├── utils.py # Utility functions for metrics computation
│ └── logger.py # Logging utilities
├── example_train.py # Minimal example training script
├── train_val_scripts/ # Main training scripts
│ ├── main_dsn.py # DSN training entry point
│ ├── main_usual.py # Standard training entry point
│ └── run_*_template.sh # Shell script templates
├── scripts/ # One-click training scripts
│ ├── GSE173958_M1_1200.sh # GSE173958 dataset training script
│ ├── GSE123902_1200.sh # GSE123902 dataset training script
│ ├── GSE163558_1200.sh # GSE163558 dataset training script
│ ├── GSE249057_2000.sh # GSE249057 dataset training script
│ └── C*.sh # Synthetic dataset training scripts
├── process_data/ # Data processing scripts
│ ├── process_GSE173958.py # Process GSE173958 dataset
│ ├── process_GSE123902.py # Process GSE123902 dataset
│ ├── process_GSE163558.py # Process GSE163558 dataset
│ └── create_syncdata.py # Create synthetic data
├── R/ # Bioinformatics analysis scripts
│ ├── DEG_analysis.R # Differential expression gene analysis
│ └── GSE249057_integration.R # Single-cell data integration analysis
├── tests/ # Unit tests
├── assests/ # Figures and assets
├── environment.yml # Conda environment configuration
└── pyproject.toml # Project configuration
TOMIC implements a Domain Separation Network with the following components:
- Shared Encoder: Extracts domain-invariant features from both source and target domains
- Private Encoders: Extract domain-specific features for source and target domains separately
- Reconstructor: Reconstructs original input from combined shared and private features
- Classifier: Organ-specific classifier trained on source domain features
- Domain Discriminator: Distinguishes between source and target domains (for DANN loss)
The model uses Transformer encoders with different tokenization strategies:
- Name-based: Genes ordered by expression magnitude
- Patch-based: Expression values divided into patches
- Expression-based: Direct expression value encoding
The data directory should contain an info_config.json file with the following structure:
{
"class_map": {
"Liver": 0,
"Lung": 1,
"Met": 2
},
"seq_len": 1200,
"num_classes": 3,
"raw_data_path": "/path/to/raw_data"
}The model reports the following metrics:
- Accuracy: Overall classification accuracy
- AUC: Area under the ROC curve (binary) or macro-averaged AUC (multi-class)
- F1 Score: Macro, micro, and weighted F1 scores
To assess domain adaptation effectiveness across all datasets, we examine the DANN (Domain Adversarial Neural Network) loss convergence. The DANN loss converges towards the theoretical optimal value of
- GSE249057: Esophageal cancer with temporal progression
- GSE173958_M1: Pancreatic cancer with multi-organ metastasis
- GSE163558: Gastric cancer with metastasis to liver, peritoneum, ovary, and lymph nodes
- GSE123902: Lung cancer with metastasis to brain, adrenal, and bone
The convergence to
Figure 2: DANN loss convergence curves for Transformer[GeneName] model across different real-world datasets. The horizontal dashed line at
We sincerely thank the authors of following open-source projects:
- DSN - Domain Separation Network
- PyTorch and PyTorch Lightning - Deep learning framework
- scanpy and anndata - Single-cell data analysis
- transformers - Transformer models and utilities
- scikit-learn - Machine learning utilities
- Seurat - Single-cell RNA-seq analysis toolkit (R)
For questions, issues, or contributions, please:
- Open an issue on the GitHub repository
- Contact the repository owner: luoyang@stu.xidian.edu.cn
We welcome feedback, bug reports, and contributions!

