PlasEval is a tool aimed at evaluating the accuracy and at comparing methods for the problem of plasmid binning.
Plasmid binning aims at detecting, from the draft assembly of a single isolate bacterial pathogen, groups of contigs assumed eachto originate from a plasmid present in the sequenced isolate. Recent methods for plasmid binning include PlasBin-Flow, MOB-recon and gplas.
The result of applying a plasmid binning method to a draft assembly is a set of unordered groups of contigs, called plasmid bins.
In the evaluation mode of PlasEval, given a set of predicted plasmid bins resulting from a plasmid binning tools and a ground truth set of plasmid bins, where each true plasmid bin contains all the contigs that belong to one of the plasmid present in the sequenced isolate, PlasEval computes three statistics, the precision, the recall and the F1-score of the predicted plasmid bins with respect to the ground truth.
For a group
The F1-score is the arithmetic mean of the precision and recall
In the comparison mode of PlasEval, given two sets of predicted plasmid bins (either resulting from two plasmid binning tools or from a plasmid binning tool and a ground truth), PlasEval computes three statistics, PlasEval computes a dissimilarity measure that indictaes how much both sets of predicted plasmid bins are in agreement. The full details of the dissimilarity measure computed by PlasEval are available in the preprint PlasEval: a framework for comparing and evaluating plasmid binning tools and we describe below its general principle. The dissimilariy value is the sum of 4 terms accounting respectively for
-
extra contigs: the contigs present in
$A$ but not in$B$ ; -
missing contigs: the contigs present in
$B$ but not in$A$ ; -
splits: splitting of the plasmid bins of
$A$ to obtain a set of intermediate bins defined as the intersection of$A$ and$B$ ; -
joins: joining the splitted plasmid bins into the plasmid bins of
$B$ .
Each of these components incur a cost, parameterized by a parameter
- each extra or missing contig
$c$ results in a cost$\ell(c)^\alpha$ , where$\ell(c)$ is the length of$c$ ; - each split of a plasmid bin
$P$ into two smaller bins$P',P''$ results in a cost$\min(L(P')^\alpha,L(P'')^\alpha)$ ; - each join of two intermediate plasmid bins
$P',P''$ into larger bin$P$ results in a cost$\min(L(P')^\alpha,L(P'')^\alpha)$ .
So with
In the case where some contigs are repeated, i.e. a contig appears in more than one plasmid bin of
Finally the dissimilarity obtained as described above is normalized into a value in
pip install -r requirements.txtThe main input in both modes of PlasEval are the two plasmid bins files.
A plasmid bins file is a TSV file that describes a set of plasmids bins in a forma where each row contains three pieces of information: the identifier of a plasmid bin, the identifier of a contig that belongs to this plasmid bin and the length of the contig.
The file should have a header row with column names plasmid, contig, contig_len.
An example is provided below that describes the set of plasmid bins where bin P1 contains contigs C1,C2 and bin P2 contains contigs C1, C3 and C4.
plasmid contig contig_len
P1 C1 2000
P2 C3 3000
P1 C2 2000
P2 C1 2000
P2 C4 2000If a contig appears in several copies in a plasmid bin, the evaluation mode only accounts for one copy of the contig. The comparison mode can account for multiple copies of a contig in the same bin.
In both evaluation and comparison mode, PlasEval takes an extra optional parameter min_len: every contig of length below the value min_len is discarded from both sets of considered plasmid bins. This parameter is useful in comparison mode in the case of plasmid bins sets that contain many short repeated contigs, which can result in the branch-and-bound algorithm taking a long time to complete.
The comparison mode uses two more parameters. Firstly, the value of p, although by default it takes value max_calls) is
-
The following command is used for the evaluation mode:
python plaseval.py eval --pred PREDICTED_BINS_TSV --gt GROUNDTRUTH_BINS_TSV --out_file OUT_FILE --log_file LOG_FILE (--min_len LEN_THRESHOLD)Where
predandgtare TSV files, with the set of predicted and ground truth plasmid bins respecitvely.out_fileis the path to the output file. The integer length thresholdmin_lencan be provided as an optional parameter. -
In addition to the two plasmid bin files, the comparison mode also takes the path to the log file as input. The following command is used for the comparison mode:
python plaseval.py comp --l LEFT_BINS_TSV --r RIGHT_BINS_TSV --out_file OUT_FILE --log_file LOG_FILE (--min_len LEN_THRESHOLD --p ALPHA --max_calls MAX_RECURSIVE_CALLS)
Where
LEFT_BINS_TSVandRIGHT_BINS_TSVare TSV files, each with one set of plasmid bins.out_fileis the path to the output file whilelog_fileis the path to the log file. The parametersmin_len,pandmax_callsare optional.
-
The output file is TSV file with the following columns:
Level: Precision and recall statistics are computed for individual bins (Individuallevel). The average precision, recall and F1 statistics are then computed for the overall sample (Overalllevel).Statistic: This columns list the type of statistic (Precision,RecallorF1).Bin: For theIndividuallevel, we compute the precision for each predicted plasmid bin and the recall for each ground truth plasmid bin. The identity of the bin (predicted or ground truth) is given in this column. Note that this column is empty forOverallsample statistics.Unwtd_Stat: Contig-level statistics for an individual bin or for the overall sample.Wtd_Stat: Basepair-level statistics for an individual bin or for the overall sample.Unwtd_Match: The best bin from the opposite side matched to the bin in question (from the 'Bin' column) according to contig-level statistics. ForPrecision, this column will have the ground truth bin that best matches the predicted bin from the 'Bin' column. ForRecall, this column will have the predicted bin that best matches the ground truth bin from the 'Bin' column. Note that this column is empty forOverallsample statistics.Wtd_Stat: The best bin from the opposite side matched to the bin in question (from the 'Bin' column) according to basepair-level statistics. ForPrecision, this column will have the ground truth bin that best matches the predicted bin from the 'Bin' column. ForRecall, this column will have the predicted bin that best matches the ground truth bin from the 'Bin' column. Note that this column is empty forOverallsample statistics.
-
The output file for the compare mode contains the following information:
Total_ctg_length: Cumulative length of contigs present in at least one of set of plasmid bins.Total_ctg_length_alpha: Cumulative dissimilarity cost of all contigs from (a).Cuts: Cost of cuts: cost of splitting bins from the first set of plasmid bins.Joins: Cost of joins: cost of splitting bins from the second set of plasmid bins.Extra_ctgs: Cumulative length of contigs present only in the first set.Missing_ctgs: Cumulative length of contigs present only in the second set.Dissimilarity: Dissimilarity score
The compare mode also provides a log file with some other details related to the comparison algorithm. These include the maximum number of matchings possible, the time taken to execute the method, the number of recursive function calls made during the comparison and finally the actual matching between contigs of both sets of plasmid bins that yields the dissimilarity score in the output file described above.
A few toy examples to demonstrate the use of PlasEval have been provided in the examples directory.
-
The following command evaluates predicted plasmid bins (
pred_bins_1.tsv) against the ground truth plasmid bins (gt_bins_1.tsv). Details of the evaluation output will be printed to the fileP1G1_eval.out.cd src/ python plaseval.py eval --pred ../examples/input/pred_bins_1.tsv --gt ../examples/input/gt_bins_1.tsv --out_file ../examples/output/P1G1_eval.out --log_file ../examples/output/P1G1_eval.log
-
The following command evaluates predicted plasmid bins (
pred_bins_1.tsv) against the other set of ground truth plasmid bins (gt_bins_2.tsv), only considering contigs above$1000$ bp. Details of the evaluation output will be printed to the fileP1G2_eval.out.cd src/ python plaseval.py eval --pred ../examples/input/pred_bins_1.tsv --gt ../examples/input/gt_bins_2.tsv --min_len 1000 --out_file ../examples/output/P1G2_eval.out
-
The following command will compare the sets of plasmid bins in
pred_bins_1.tsvandgt_bins_1.tsv. The details of the dissimilarity between the two sets will be output toP1G1_comp.outwhile the log fileP1G1_comp.logcontains the miscellaneous detials.cd src/ python plaseval.py comp --l ../examples/input/pred_bins_1.tsv --r ../examples/input/gt_bins_1.tsv --out_file ../examples/output/P1G1_comp.out --log_file ../examples/output/P1G1_comp.log -
The following command will compare the sets of plasmid bins in
pred_bins_1.tsvandpred_bins_2.tsv, neither of which contains the ground truth. The cost parameter$\alpha$ is set at$0.75$ . The details of the dissimilarity between the two sets will be output toP1P2_comp.outwhile the log fileP1P2_comp.logcontains the miscellaneous detials for the comparison.cd src/ python plaseval.py comp --l ../examples/input/pred_bins_1.tsv --r ../examples/input/pred_bins_2.tsv --p 0.25 --out_file ../examples/output/P1P2_0.75_comp.out --log_file ../examples/output/P1P2_0.75_comp.log