From a6e55341b2f43f1ba6ad2a707482c27909c2485e Mon Sep 17 00:00:00 2001 From: Luiz Mata Lopez Date: Tue, 26 Aug 2025 17:52:10 -0400 Subject: [PATCH 1/4] doc update --- README.md | 3 + data/simulation/README.md | 139 ++++++++++++++++++++++++++++++++++++ data/simulation/Snakefile | 6 +- data/simulation/config.yaml | 2 +- 4 files changed, 146 insertions(+), 4 deletions(-) create mode 100644 data/simulation/README.md diff --git a/README.md b/README.md index 4f4b343..346e342 100644 --- a/README.md +++ b/README.md @@ -36,6 +36,9 @@ mkdir data/sample python src/simulation_reads.py -n 25 -m 25 -p 5 -k 1 -s 0 -d 0.1 -a 0.001 -b 0.001 -o data/sample/overview ``` +For a more in depth coverage of the data simulator please refer to [the simulation documentation](data/simulation/README.md). + + ### ConDoR diff --git a/data/simulation/README.md b/data/simulation/README.md new file mode 100644 index 0000000..411bb95 --- /dev/null +++ b/data/simulation/README.md @@ -0,0 +1,139 @@ +# Tumor phylogeny data simulation + +A standalone simulation tool for mutation matrices and phylogenetic trees of tumor cells with assignments of cells to copy number clusters to evaluate ConDoR and other cell lineage tree inference methods. + +## Quick Start + +Usage: + +```bash +python src/simulation_reads.py \ + +[-h] [-n N] [-m M] [-p P] [-k K] [-o O] [-s S] [-d D] [--cov COV] [-a A] [-b B] [--ado ADO] \ + +[--maxcn MAXCN] [--readthreshold READTHRESHOLD] [--vafthreshold VAFTHRESHOLD] [-l L] [-v] +``` + + +Example: + +```bash +#25 cells, 25 SNV mutations, 3 copy number clusters, k=5 example with missing data + error + +python simulation_reads.py \ + +-n 25 -m 25 -p 3 -k 5 -s 1 -d 0.1 --cov 50 -a 0.001 -b 0.001 --ado 15 \ + +--maxcn 8 --readthreshold 5 --vafthreshold 0.1 -l 0.8 -v + +``` + +## Detailed Parameter Specification + +Here is an overview of the parameters of the simulation in more detail along with their input field type and default values. + +| cmd flag | Description | Type | Default | +| --- | --- | --- | --- | +| -n | Number of samples/cells | integer | 5 | +| -m | Number of mutations | integer | 5 | +| -p | Number of copy number clusters | integer | 1 | +| -k | (k)-Dollo/number of allowed losses | integer | 0 | +| -s | Seed/replicate number | integer | 0 | +| -d | Missing data rate | float | 0.0 | +| --cov | Mean coverage for read counts | integer | 50 | +| -a | Mutation matrix false positive rate | float | NA | +| -b | Mutation matrix false negative rate | float | NA | +| -l | Mutation loss rate | float | 0.8 | +| --ado | Allelic dropout precision | float | 15 | +| --maxcn | Number of allowed copy number clusters | integer | 8 | +| --readthreshold | Variant read count threshold for generating the mutation matrix | integer | 5 | +| --vafthreshold | VAF threshold for generating the mutation matrix | float | 0.1 | + +## **Output Specification** + +**Data formats** + +The simulation produces data which is compatible with ConDoR, SCARLET, SPhyR, SCITE, SIFIT, and PhiSCS. However, it is relatively straightforward to add custom formats. + +**Character Matrices** + +* `character_matrix.csv` — matrix with error and copy number column, typically the method input + +* `character_matrix_without_noise.csv` — ground truth matrix used for evaluation + +* `multi_state_character_matrix.csv` — k-Dollo matrix + +* `multi_state_tree_node_character_matrix.csv` — intermediate matrix prior to leaf assignment + +**Read Counts** + +* `read_count.csv` — total counts with missing data + +* `read_count_without_missing.csv` — total counts + +* `variant_count.csv` — variant counts with missing data, fp, fn + +* `variant_count_without_missing.csv` — variant counts with only fp, fn + +**Trees** + +* `tree.dot` — Mutation tree with cells for viewing in GraphViz + +* `tree_edgelist.csv` — True tree edge list for evaluation + +* `tree.newick` — Newick string representing multifurcating cell lineage tree + +**User-defined Formats** + +To create a custom output format for a method not in the list of compatible methods you want to modify the objects listed below within the main function of `simulation_reads.py`. An example is provided. + +* Dataframe object holding character matrix (method input): `Acell_noisy` + +* Dataframe objects holding read counts (method input): `df_Rcount_missing` and `df_Vcount_missing` + +Example: + +```python + +# EXAMPLE: SCITE data format +Acell_noisy_scite = Acell_noisy[:,:-1].T.copy().astype(int) +Acell_noisy_scite[Acell_noisy_scite == -1] = 3 +np.savetxt(f"{prefix}_scite.txt", Acell_noisy_scite, delimiter=" ", fmt='%d') + +# data format + +#Create and save your own output format here +... + +``` + +## Simulating Large Datasets with Snakemake + +The Snakefile and configuration file can be found in `data/simulation`. The current configuration file will generate datasets from the ConDoR study, this can be modified to add additional cases or replicates. There are also some required changes which have to be made by the user. In the `config.yaml` file change the following lines to match your system’s Python executable. + +```bash +python2_exec: '/path/to/python' +``` + +After making the changes above, run the pipeline with one core using the following command within the `data/simulation` directory. Note that it is definitely advisable to use more than one cpu core to run this program on large configurations. + +```bash +Snakemake --cores 1 +``` + +This command will create the `ground_truth` simulation directory with the data. Within this directory filename prefixes correspond with the parameter configuration. + +## Simulation Strategy + +**True cell lineage tree and mutation matrix** + +Tree simulation is done using a random growing network where our node set consists of mutations and copy number deletions. There are $m+p$ total nodes where $m$ and $p$ are given by the `-m` and `-p` fields and all nodes are attached to an existing node in the tree with equal probability. Copy number deletions induce the loss of mutation nodes on the path from the deletion node to the root node, each with probability given by the mutation loss rate parameter `-l`. Each mutation is allowed to be lost at most $k$ times where $k$ is given by the `-k` field, and each copy number deletion node will change the copy number cluster column for all descendant nodes. + +A matrix is constructed alongside the tree, we have a correspondence between nodes on the tree and rows in this intermediate matrix called `multi_state_tree_node_character_matrix`. A subset of these rows are then selected and relabeled $s_0, s_1, …, s_n$ , and binarized this is the true mutation matrix `character_matrix_without_noise`. + +**Read counts and estimated mutation matrix** + +Total read counts are Poisson distributed with mean coverage parameter given by the field `--cov`. Variant read counts are simulated under the beta-binomial read count model parameterized by the user-provided false positive `-a` and false negative `-b` rates and allelic dropout precision `--ado`. Given the read counts, the variant allele frequency matrix is computed and mutations are called according to the value provided for the VAF threshold parameter `--vafthreshold`. + +For more information on this simulation or the ConDoR simulated study please refer to the main publication and supplement. + diff --git a/data/simulation/Snakefile b/data/simulation/Snakefile index 2b7570a..d9d26f3 100644 --- a/data/simulation/Snakefile +++ b/data/simulation/Snakefile @@ -18,8 +18,8 @@ filtered_product = filter_combinator(itertools.product) rule all: input: # simulation - #expand('ground_truth/n{ncells}_m{ncharacters}_p{nclusters}_k{k}_d{missing}_s{seed}_read_count.csv', filtered_product, seed=seeds, ncells=config['ncells'], ncharacters=config['ncharacters'], nclusters=config['nclusters'], missing=config['missing_rate'], k=config['k']), - #expand('ground_truth/n{ncells}_m{ncharacters}_p{nclusters}_k{k}_d{missing}_s{seed}_character_matrix.csv', filtered_product, seed=seeds, ncells=config['ncells'], ncharacters=config['ncharacters'], nclusters=config['nclusters'], missing=config['missing_rate'], k=config['k']), + expand('ground_truth/n{ncells}_m{ncharacters}_p{nclusters}_k{k}_d{missing}_s{seed}_read_count.csv', filtered_product, seed=seeds, ncells=config['ncells'], ncharacters=config['ncharacters'], nclusters=config['nclusters'], missing=config['missing_rate'], k=config['k']), + expand('ground_truth/n{ncells}_m{ncharacters}_p{nclusters}_k{k}_d{missing}_s{seed}_character_matrix.csv', filtered_product, seed=seeds, ncells=config['ncells'], ncharacters=config['ncharacters'], nclusters=config['nclusters'], missing=config['missing_rate'], k=config['k']), # condor #expand('condor/n{ncells}_m{ncharacters}_p{nclusters}_k{k}_d{missing}_s{seed}_B.csv', filtered_product, seed=seeds, ncells=config['ncells'], ncharacters=config['ncharacters'], nclusters=config['nclusters'], missing=config['missing_rate'], k=config['k']), #expand('condor_reads/n{ncells}_m{ncharacters}_p{nclusters}_k{k}_d{missing}_s{seed}_B.csv', filtered_product, seed=seeds, ncells=config['ncells'], ncharacters=config['ncharacters'], nclusters=config['nclusters'], missing=config['missing_rate'], k=config['k']), @@ -33,7 +33,7 @@ rule all: #expand('sifit/n{ncells}_m{ncharacters}_p{nclusters}_k{k}_d{missing}_s{seed}_mlTree.newick', filtered_product, seed=seeds, ncells=config['ncells'], ncharacters=config['ncharacters'], nclusters=config['nclusters'], missing=config['missing_rate'], k=config['k']), #expand('sifit/n{ncells}_m{ncharacters}_p{nclusters}_k{k}_d{missing}_s{seed}_character_matrix.tsv', filtered_product, seed=seeds, ncells=config['ncells'], ncharacters=config['ncharacters'], nclusters=config['nclusters'], missing=config['missing_rate'], k=config['k']), # evaluate - expand('results/n{ncells}_m{ncharacters}_p{nclusters}_k{k}_d{missing}_s{seed}_{method}.csv', filtered_product, seed=seeds, ncells=config['ncells'], ncharacters=config['ncharacters'], nclusters=config['nclusters'], missing=config['missing_rate'], k=config['k'], method=config['eval_methods']), + #expand('results/n{ncells}_m{ncharacters}_p{nclusters}_k{k}_d{missing}_s{seed}_{method}.csv', filtered_product, seed=seeds, ncells=config['ncells'], ncharacters=config['ncharacters'], nclusters=config['nclusters'], missing=config['missing_rate'], k=config['k'], method=config['eval_methods']), rule simulate: output: diff --git a/data/simulation/config.yaml b/data/simulation/config.yaml index badc3a1..ee00103 100644 --- a/data/simulation/config.yaml +++ b/data/simulation/config.yaml @@ -36,7 +36,7 @@ estimated_fn: 0.16 nseeds: 5 # scarlet parameters -python2_exec: '/n/fs/ragr-data/users/palash/anaconda3/envs/scarlet/bin/python2' +python2_exec: '/path/to/python' scarlet_exec: '/n/fs/ragr-data/users/palash/scarlet/code/scarlet.py' scarlet_plot_exec: '/n/fs/ragr-data/users/palash/scarlet/code/plot_tree.py' From 9761228ada714fd850ceefb930900eda1658b079 Mon Sep 17 00:00:00 2001 From: Luiz Mata Lopez <70717884+LuizMata@users.noreply.github.com> Date: Tue, 26 Aug 2025 17:58:02 -0400 Subject: [PATCH 2/4] Update README.md --- data/simulation/README.md | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/data/simulation/README.md b/data/simulation/README.md index 411bb95..58ac332 100644 --- a/data/simulation/README.md +++ b/data/simulation/README.md @@ -15,16 +15,11 @@ python src/simulation_reads.py \ ``` -Example: +Example: 25 cells, 25 SNV mutations, 3 copy number clusters, k=5 example with missing data + error ```bash -#25 cells, 25 SNV mutations, 3 copy number clusters, k=5 example with missing data + error - -python simulation_reads.py \ - --n 25 -m 25 -p 3 -k 5 -s 1 -d 0.1 --cov 50 -a 0.001 -b 0.001 --ado 15 \ ---maxcn 8 --readthreshold 5 --vafthreshold 0.1 -l 0.8 -v +python simulation_reads.py -n 25 -m 25 -p 3 -k 5 -s 1 -d 0.1 --cov 50 -a 0.001 -b 0.001 --ado 15 --maxcn 8 --readthreshold 5 --vafthreshold 0.1 -l 0.8 -v ``` From c0c09be94be4f7754e3cce6cf5a1077d867ee636 Mon Sep 17 00:00:00 2001 From: Luiz Mata Lopez <70717884+LuizMata@users.noreply.github.com> Date: Tue, 26 Aug 2025 18:01:38 -0400 Subject: [PATCH 3/4] Update README.md --- data/simulation/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/data/simulation/README.md b/data/simulation/README.md index 58ac332..1f469e3 100644 --- a/data/simulation/README.md +++ b/data/simulation/README.md @@ -19,7 +19,8 @@ Example: 25 cells, 25 SNV mutations, 3 copy number clusters, k=5 example with mi ```bash -python simulation_reads.py -n 25 -m 25 -p 3 -k 5 -s 1 -d 0.1 --cov 50 -a 0.001 -b 0.001 --ado 15 --maxcn 8 --readthreshold 5 --vafthreshold 0.1 -l 0.8 -v +mkdir example +python simulation_reads.py -n 25 -m 25 -p 3 -k 5 -s 1 -d 0.1 --cov 50 -a 0.001 -b 0.001 --ado 15 --maxcn 8 --readthreshold 5 --vafthreshold 0.1 -l 0.8 -o example/example ``` From 2ea2294bc466798910c70989abe49529d60f1a4f Mon Sep 17 00:00:00 2001 From: Luiz Mata Lopez <70717884+LuizMata@users.noreply.github.com> Date: Tue, 26 Aug 2025 18:09:39 -0400 Subject: [PATCH 4/4] Update README.md --- data/simulation/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/simulation/README.md b/data/simulation/README.md index 1f469e3..887b16f 100644 --- a/data/simulation/README.md +++ b/data/simulation/README.md @@ -125,7 +125,7 @@ This command will create the `ground_truth` simulation directory with the data. Tree simulation is done using a random growing network where our node set consists of mutations and copy number deletions. There are $m+p$ total nodes where $m$ and $p$ are given by the `-m` and `-p` fields and all nodes are attached to an existing node in the tree with equal probability. Copy number deletions induce the loss of mutation nodes on the path from the deletion node to the root node, each with probability given by the mutation loss rate parameter `-l`. Each mutation is allowed to be lost at most $k$ times where $k$ is given by the `-k` field, and each copy number deletion node will change the copy number cluster column for all descendant nodes. -A matrix is constructed alongside the tree, we have a correspondence between nodes on the tree and rows in this intermediate matrix called `multi_state_tree_node_character_matrix`. A subset of these rows are then selected and relabeled $s_0, s_1, …, s_n$ , and binarized this is the true mutation matrix `character_matrix_without_noise`. +A matrix is constructed alongside the tree, we have a correspondence between nodes on the tree and rows in this intermediate matrix called `multi_state_tree_node_character_matrix`. A subset of these rows are then selected, relabeled $s_0, s_1, …, s_n$ , and binarized this is the true mutation matrix `character_matrix_without_noise`. **Read counts and estimated mutation matrix**