Fast cell lineage tree reconstruction and genotype calling for large single cell DNA sequencing data
Software accompanyment for:
Here is the conference version:
ScisTree2: An Improved Method for Large-scale Inference of Cell Lineage Trees and Genotype Calling from Noisy Single Cell Data, RECOMB 2025. (presented at RECOMB 2025).
and preprint version:
This is an enhanced version of ScisTree:
If you find this work helpful, please consider citing our Genome Research paper.
Note: If you have copy-number data and want to infer a cell lineage tree from both SNVs and CNAs, please refer to ScisTreeCNA.
You can refer to our documentation for more details, or simply follow the instructions below.
To use ScisTree2, you will need the following tools and libraries installed:
python&pip: Version 3.8 or higher.
*We have successfully tested it on Linux, macOS, and Windows.
1. Install from PyPI:
- Upgrade pip:
python -m pip install --upgrade pip
- Install
scistree2:pip install scistree2
You will need g++ and make if you are using Linux or macOS, or the Microsoft C++ Build Tools if you are on Windows, in order to compile the C++ code.
-
Clone the repository:
git clone https://github.com/yufengwudcs/ScisTree2.git cd scistree2 -
Install the Python package (includes C++ backend compilation): You can install the
scistree2package usingpip:pip install .This command will also automatically compile the C++ backend. Once built, the executable binary file can be found in
scistree2/bin.*We recommend that users create a virtual environment using either
condaorvenvto comply with PEP 668. -
(Optional) Manual C++ backend build (for testing/development): If you want to build or test the C++ backend (
scistree), you can navigate to thesrcdirectory and compile it:- Linux/macOS:
cd src make # You can then test it directly, e.g., ./scistree example_input.txt
- Windows:
cd src nmake /f Makefile.win # You can then test it directly, e.g., scistree.exe example_input.txt
This step is not required for the Python package installation if using
pip install .as described above. - Linux/macOS:
ScisTree2 offers both Python and C++ interfaces. We recommend using the Python version because it provides a wider variety of supported input formats and evaluation tools, and it is more easily integrated into the broader Python ecosystem.
A detailed tutorial on how to use ScisTree2 in Python is available as a Jupyter Notebook in the tutorials/ directory:
Or you can try it easily on Google Colab:
The tutorial covers:
- Getting started with ScisTree2.
- Running inference with probabilistic genotype matrices (CSV supported).
- Running inference with raw read data (CSV supported).
- Running inference with VCF file.
- Visualizing trees.
- Evaluating results using various metrics.
- Bootstrapping for branch (clade) confidence estimates (added September 27, 2025).
The example data used in the tutorial can be found in the tutorials/data/ directory.
To run ScisTree2 directly from the console, please refer to step 3 in the installation guide above.
The executable is called scistree.
Check if ScisTree2 is ready to run by typing: ./scistree, you should see some output about the basic usage of ScisTree2.
Now type:
./scistree example_input.txt, you should see the following output:
*** SCISTREE ver. 2.2.3.0, August 14, 2025 ***
#cells: 5, #sites: 6
List of cell names: c1 c2 c3 c4 c5
Called genotypes output to file: example_input.txt.genos.imp
**** Maximum log-likelihood: -6.27126, number of changed genotypes: 2
Computed log-lielihood from changed genotypes: -6.27126
Constructed single cell phylogeny: (((c1,c3),(c2,c4)),c5)
Elapsed time = 0 seconds.
Options:
-e: Output a mutation tree (which may not be a binary tree) with branch labels from the called genotypes.-e0: Output a mutation tree without branch labels, which is useful for visualizing large trees.-q: Use NNI (Nearest Neighbor Interchange) for local tree search. NNI is faster but less accurate. By default, ScisTree2 uses SPR (Subtree Pruning and Regrafting) local search, which we have found to be very fast.-T <num-of-threads>: Specify the number of threads for multi-threading support.-s <num-of-iterations>: Set the maximum number of iterations to control the running time. A smaller number (e.g., 5) will reduce the running time but may also reduce accuracy. Default: 1,000 iterations.
You may also read the ScisTree2's User Manual, which is in PDF format and is distributed as part of ScisTree2.
First, you should understand some basics about ScisTree2. I would recommend to read the user mannual of the orgianl ScisTree.
The first thing to use ScisTree2 is to prepare the input. Here is the content of an example(example_input.txt):
c1 c2 c3 c4 c5
s1 0.01 0.6 0.08 0.8 0.7
s2 0.8 0.02 0.7 0.01 0.3
s3 0.02 0.8 0.02 0.8 0.9
s4 0.9 0.9 0.8 0.8 0.02
s5 0.01 0.8 0.01 0.8 0.9
s6 0.05 0.02 0.7 0.05 0.9Explanations:
-
You should specifiy the cell names in the first row. For example, "c1 c2 c3 c4 c5". Please note that don't use HAPLOID or HAPLOTYPES as cell names. These two words are reserved keywords in ScisTree2.
-
The following row starts with the row identifier, then the probability of the five cells being zero (wild-type). For example, the second row says for the first site, the probability of the first cell (cell 1) has probability 0.01 being the wild type, the second cell has probability 0.6 being the wild type, and so on.
Be careful: the rows are for the SNV sites and the columns are for the cells. Don't get this wrong.
ScisTree2 is essentially a faster and also somewhat more accurate ScisTree. Some features from the original ScisTree (version 1) are not supported in the current implementaiton of ScisTree2. These include: (i) ternary data input: ScisTree2 only supports binary data as of now; (ii) parameter imputation and doublet imputation. I haven't got chance to upgrade these features. For the moment, ScisTree2 is dedicated for cell lineage tree inference.
The main change is about speed and accuracy. ScisTree2 is order of mangnitude faster than ScisTree. ScisTree2 supports multi-threading while ScisTree doesn't. More importantly, ScisTree2 implements faster and also possibly more accurate tree search algorithms. By default, ScisTree2 performs the subtree prune and regraft (SPR) local search, while ScisTree performs neareast neighbor interchange (NNI) search. The SPR local search is usually more accurate than the NNI search. Our tests show that ScisTree2 can infer cell lineage tree from data with 10,000 cells (and say 10,000 single nucleiotide variant or SNV sites) while being more accurate in both cell lineage tree and genotype calling.
All simulated data, experimental data(HGSOC), and scripts used to reproduce the results in the SicsTree2 paper are released at Zenodo.
Post your issues here inside GitHub repositary if you have questions/issues.
