Skip to content

yufengwudcs/ScisTree2

Repository files navigation

ScisTree2 Logo

colabgenomeres

Fast cell lineage tree reconstruction and genotype calling for large single cell DNA sequencing data

Software accompanyment for:

ScisTree2 enables large-scale inference of cell lineage trees and genotype calling using efficient local search", Haotian Zhang, Yiming Zhang, Teng Gao and Yufeng Wu, Genome Research, in press, 2025.

Here is the conference version:

ScisTree2: An Improved Method for Large-scale Inference of Cell Lineage Trees and Genotype Calling from Noisy Single Cell Data, RECOMB 2025. (presented at RECOMB 2025).

and preprint version:

Large-scale Inference of Cell Lineage Trees and Genotype Calling from Noisy Single-Cell Data Using Efficient Local Search*, biorxiv, 2025.

This is an enhanced version of ScisTree:

Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach, Bioinformatics, 2020.

If you find this work helpful, please consider citing our Genome Research paper.

Note: If you have copy-number data and want to infer a cell lineage tree from both SNVs and CNAs, please refer to ScisTreeCNA.

Documentation

You can refer to our documentation for more details, or simply follow the instructions below.

Required Tools

To use ScisTree2, you will need the following tools and libraries installed:

  • python & pip: Version 3.8 or higher.

*We have successfully tested it on Linux, macOS, and Windows.

Installation

1. Install from PyPI:

  1. Upgrade pip:
    python -m pip install --upgrade pip
  2. Install scistree2:
    pip install scistree2

2. Install from source:

You will need g++ and make if you are using Linux or macOS, or the Microsoft C++ Build Tools if you are on Windows, in order to compile the C++ code.

  1. Clone the repository:

    git clone https://github.com/yufengwudcs/ScisTree2.git
    cd scistree2
  2. Install the Python package (includes C++ backend compilation): You can install the scistree2 package using pip:

    pip install .

    This command will also automatically compile the C++ backend. Once built, the executable binary file can be found in scistree2/bin.

    *We recommend that users create a virtual environment using either conda or venv to comply with PEP 668.

  3. (Optional) Manual C++ backend build (for testing/development): If you want to build or test the C++ backend (scistree), you can navigate to the src directory and compile it:

    • Linux/macOS:
      cd src
      make
      # You can then test it directly, e.g., ./scistree example_input.txt
    • Windows:
      cd src
      nmake /f Makefile.win
      # You can then test it directly, e.g., scistree.exe example_input.txt

    This step is not required for the Python package installation if using pip install . as described above.

Tutorial

ScisTree2 offers both Python and C++ interfaces. We recommend using the Python version because it provides a wider variety of supported input formats and evaluation tools, and it is more easily integrated into the broader Python ecosystem.

Using ScisTree2 in Python

A detailed tutorial on how to use ScisTree2 in Python is available as a Jupyter Notebook in the tutorials/ directory:

Or you can try it easily on Google Colab: Colab

The tutorial covers:

  • Getting started with ScisTree2.
  • Running inference with probabilistic genotype matrices (CSV supported).
  • Running inference with raw read data (CSV supported).
  • Running inference with VCF file.
  • Visualizing trees.
  • Evaluating results using various metrics.
  • Bootstrapping for branch (clade) confidence estimates (added September 27, 2025).

The example data used in the tutorial can be found in the tutorials/data/ directory.

Using ScisTree2 in C++

To run ScisTree2 directly from the console, please refer to step 3 in the installation guide above.

The executable is called scistree. Check if ScisTree2 is ready to run by typing: ./scistree, you should see some output about the basic usage of ScisTree2.

Now type: ./scistree example_input.txt, you should see the following output:

*** SCISTREE ver. 2.2.3.0, August 14, 2025 ***

#cells: 5, #sites: 6
List of cell names: c1 c2 c3 c4 c5 
Called genotypes output to file: example_input.txt.genos.imp
**** Maximum log-likelihood: -6.27126, number of changed genotypes: 2
Computed log-lielihood from changed genotypes: -6.27126
Constructed single cell phylogeny: (((c1,c3),(c2,c4)),c5)
Elapsed time = 0 seconds.

Options:

  • -e: Output a mutation tree (which may not be a binary tree) with branch labels from the called genotypes.
  • -e0: Output a mutation tree without branch labels, which is useful for visualizing large trees.
  • -q: Use NNI (Nearest Neighbor Interchange) for local tree search. NNI is faster but less accurate. By default, ScisTree2 uses SPR (Subtree Pruning and Regrafting) local search, which we have found to be very fast.
  • -T <num-of-threads>: Specify the number of threads for multi-threading support.
  • -s <num-of-iterations>: Set the maximum number of iterations to control the running time. A smaller number (e.g., 5) will reduce the running time but may also reduce accuracy. Default: 1,000 iterations.

You may also read the ScisTree2's User Manual, which is in PDF format and is distributed as part of ScisTree2.

Data format of ScisTree2 in C++?

First, you should understand some basics about ScisTree2. I would recommend to read the user mannual of the orgianl ScisTree.

The first thing to use ScisTree2 is to prepare the input. Here is the content of an example(example_input.txt):

c1 c2 c3 c4 c5
s1 0.01 0.6 0.08 0.8 0.7
s2 0.8 0.02 0.7 0.01 0.3
s3 0.02 0.8 0.02 0.8 0.9
s4 0.9 0.9 0.8 0.8 0.02
s5 0.01 0.8 0.01 0.8 0.9
s6 0.05 0.02 0.7 0.05 0.9

Explanations:

  • You should specifiy the cell names in the first row. For example, "c1 c2 c3 c4 c5". Please note that don't use HAPLOID or HAPLOTYPES as cell names. These two words are reserved keywords in ScisTree2.

  • The following row starts with the row identifier, then the probability of the five cells being zero (wild-type). For example, the second row says for the first site, the probability of the first cell (cell 1) has probability 0.01 being the wild type, the second cell has probability 0.6 being the wild type, and so on.

    Be careful: the rows are for the SNV sites and the columns are for the cells. Don't get this wrong.

ScisTree2 is essentially a faster and also somewhat more accurate ScisTree. Some features from the original ScisTree (version 1) are not supported in the current implementaiton of ScisTree2. These include: (i) ternary data input: ScisTree2 only supports binary data as of now; (ii) parameter imputation and doublet imputation. I haven't got chance to upgrade these features. For the moment, ScisTree2 is dedicated for cell lineage tree inference.

What is new about ScisTree2 over ScisTree?

The main change is about speed and accuracy. ScisTree2 is order of mangnitude faster than ScisTree. ScisTree2 supports multi-threading while ScisTree doesn't. More importantly, ScisTree2 implements faster and also possibly more accurate tree search algorithms. By default, ScisTree2 performs the subtree prune and regraft (SPR) local search, while ScisTree performs neareast neighbor interchange (NNI) search. The SPR local search is usually more accurate than the NNI search. Our tests show that ScisTree2 can infer cell lineage tree from data with 10,000 cells (and say 10,000 single nucleiotide variant or SNV sites) while being more accurate in both cell lineage tree and genotype calling.

Data Availability

All simulated data, experimental data(HGSOC), and scripts used to reproduce the results in the SicsTree2 paper are released at Zenodo. DOI

Contact

Post your issues here inside GitHub repositary if you have questions/issues.

About

Fast cell lineage tree reconstruction

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •