Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified .DS_Store
Binary file not shown.
1 change: 1 addition & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ jobs:
pip install polars==0.16.18
pip install pypandoc==1.7.2
pip install "tensorflow<2.16"
pip install imbalanced-learn
pip install -e .
- name: List installed packages
run: |
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 35 additions & 0 deletions docs/pvacseq/optional_downstream_analysis_tools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,41 @@ epitopes are well-binding to. Lastly, the report will bin variants into tiers
that offer suggestions as to the suitability of variants for use in vaccines.
For a full definition of these tiers, see the pVACseq :ref:`output file documentation <aggregated>`.

Add ML Predictions
------------------

.. program-output:: pvacseq add_ml_predictions -h

This tool adds machine learning (ML)-based neoantigen prioritization predictions to existing pVACseq output files.
It uses a trained random forest model to predict whether neoantigen candidates should be evaluated as "Accept",
"Reject", or "Pending" based on a comprehensive set of features derived from binding affinity predictions,
expression data, and variant characteristics.

This tool requires that you have already generated both MHC Class I and Class II aggregated reports using
the ``generate_aggregated_report`` command or by running the pVACseq pipeline (``pvacseq run``). It takes as input
the Class I aggregated TSV, Class I all epitopes TSV, and Class II aggregated TSV files from a pVACseq run.
The tool merges these files, performs data cleaning and imputation, and applies the ML model to generate evaluation predictions for each variant.

The output file is named ``<sample_name>_predict_pvacview.tsv`` and contains all columns from the original
Class I aggregated file with two additional columns:


.. list-table::

* - ``Evaluation``
- The ML-predicted evaluation status: "Accept", "Reject", or "Pending", based on the prediction probability score.
* - ``ML Prediction (score)``
- A formatted string combining the model-predicted evaluation with the prediction probability score (e.g.,
"Accept (0.72)"). It shows "NA" for variants where the model could not make a prediction, which may be due to a candidate
not being present in either the Class I or Class II aggregated reports.

The ``--threshold_accept`` parameter controls the probability threshold for Accept predictions (default: 0.55).
Variants with prediction probabilities >= this threshold are evaluated as "Accept". The ``--threshold_reject`` parameter
controls the probability threshold for Reject predictions (default: 0.30). Variants with prediction probabilities <=
this threshold are evaluated as "Reject". Everything in between is set to "Pending" for manual review.
The ``--artifacts_path`` parameter allows you to specify a custom directory
containing ML model artifacts, though by default the tool uses the model artifacts included with the pvactools package.

Calculate Reference Proteome Similarity
---------------------------------------

Expand Down
24 changes: 24 additions & 0 deletions docs/pvacseq/output_files.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ created):
* - ``ui.R``, ``app.R``, ``server.R``, ``styling.R``, ``anchor_and_helper_functions.R``
- pVACview R Shiny application files. Not generated when running only with presentation and immunogenicity algorithms.
* - ``www`` (directory)
- Directory containing image files for pVACview. Not generated when running with presentation and immunogenicity algorithms only.
* - ``ml_predict/<sample_name>_predict_pvacview.tsv`` (optional)
- ML-based neoantigen evaluation predictions file. Generated when both MHC Class I and Class II predictions are run and the ``--run-ml-predictions`` flag is set.
- Directory containing image files for pVACview. Not generated when running only with presentation and immunogenicity algorithms only.


Expand Down Expand Up @@ -387,6 +390,27 @@ included epitopes, selecting the best-scoring epitope, and which values are outp
* - ``Evaluation``
- Column to store the evaluation of each variant when evaluating the run in pVACview. Either ``Accept``, ``Reject``, or ``Review``.

.. _ml_prediction_output:

<sample_name>_predict_pvacview.tsv Report Columns
--------------------------------------------------

The ``<sample_name>_predict_pvacview.tsv`` file is generated when using the :ref:`add_ml_predictions <optional_downstream_analysis_tools_label>`
tool or when running pVACseq with both MHC Class I and Class II predictions and the ``--run-ml-predictions`` flag enabled.
This file contains all columns from the Class I aggregated file (``all_epitopes.aggregated.tsv``) with one additional ML prediction column added.

The file is written to the ``ml_predict`` subdirectory within the output directory.

.. list-table::
:header-rows: 1

* - All columns from ``all_epitopes.aggregated.tsv``
- All columns described in the :ref:`aggregated` section above are included in this file.
* - ``Evaluation``
- Populated with ML-predicted evaluation status for each candidate. Values: ``Accept`` for variants with prediction probability >= threshold_accept (default: 0.55), ``Reject`` for variants with prediction probability <= threshold_reject (default: 0.30), and ``Pending`` for variants with prediction probability between threshold_reject and threshold_accept or when the ML model cannot make a prediction due to missing data.
* - ``ML Prediction (score)``
- ML-based prediction evaluation with probability score. Format: ``"<Evaluation> (<probability_score>)"`` (e.g., ``"Accept (0.72)"``, ``"Reject (0.15)"``, ``"Review (0.48)"``). Shows ``"NA"`` when the ML model cannot make a prediction due to missing data (e.g., when Class I and Class II aggregated files have different numbers of rows).

.. _pvacseq_best_peptide:

Best Peptide Criteria
Expand Down
72 changes: 72 additions & 0 deletions docs/pvacview/pvacseq_module/pvacseq_vignette.rst
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,78 @@ These potentially problematic characteristics are also flagged by the red boxes
Since the candidate peptide has a match in the reference proteome, we will reject this candidate by clicking the
thumbs-down button.

ML-Based Neoantigen Evaluation Predictions
__________________________________________

This ML prediction output file contains ML-based evaluation predictions that can help prioritize neoantigen candidates by presetting the evaluation status for each candidate.
When pVACseq is run with both MHC Class I and Class II predictions and the ``--run-ml-predictions`` flag enabled, or when using the :ref:`add_ml_predictions <optional_downstream_analysis_tools_label>`
tool, a aggregate report file with ML predictions (``<sample_name>_predict_pvacview.tsv``) is generated. This file can be loaded into pVACview in combination with the Class I metrics.json file generated during the original run.
This metrics.json file is copied next to the ML prediction output file for convenience.
This file contains ML-based evaluation predictions that can help prioritize neoantigen candidates by presetting the evaluation status for each candidate.

The ML prediction file includes all columns from the Class I aggregated file with two columns different:

**Evaluation Column**

The ``Evaluation`` column is pre-populated with ML-predicted evaluation status for each candidate:

- ``Accept``: Variants with prediction probability >= threshold_accept (default: 0.55). These candidates are predicted to be favorable neoantigen candidates to be included in a vaccine.
- ``Reject``: Variants with prediction probability <= threshold_reject (default: 0.30). These candidates are predicted to be unfavorable. threshold_reject should be set to a value less than threshold_accept.
- ``Pending``: Variants with prediction probability between threshold_reject and threshold_accept, or when the ML model cannot make a prediction due to missing data. These candidates require manual review.

**ML Prediction (score) Column**

The ``ML Prediction (score)`` column provides additional context by displaying the evaluation status along with the underlying prediction probability score.
The format is ``"<Evaluation> (<probability_score>)"`` (e.g., ``"Accept (0.72)"``, ``"Reject (0.15)"``, ``"Review (0.48)"``).
The "Review" status is retained in this column as a suggestion for users to change the status in the "Evaluation" column to "Review", or "Accept" or "Reject" manually.
This column shows ``"NA"`` when the ML model cannot make a prediction due to missing data (e.g., when Class I and Class II aggregated files have different numbers of rows).

The probability score represents the model's confidence that a candidate should be accepted to be in a vaccine, with values closer to 1.0 indicating higher confidence in acceptance.


**Important Features Used by the ML Model**

The ML model integrates information from multiple sources to make its predictions. The following features are among the five most important factors considered:
1. Allele expression
2. RNA VAF
3. RNA Expression
4. NetMHCpan MT IC50 Score
5. TSL

The model combines these features (and more not listed here) using a trained random forest algorithm that has learned patterns from expert-reviewed neoantigen candidates.
The predictions serve as a starting point for evaluation, but should be reviewed in conjunction with the detailed information available in pVACview,
including binding affinity plots, anchor position analysis, and reference proteome matches.

**pVACview ML Predictions Example**

To view predictions on pVACview, load the following files:
1. The ML prediction file (``<sample_name>_predict_pvacview.tsv``) in place of the Class I tsv file.
2. The metrics.json file of Class I data.
3. The Class II aggregated.tsv file.
4. A list of genes of interest (optional).

.. figure:: ../../images/screenshots/vignette/pvacview-ml-predictions-example.png
:width: 1000px
:align: right
:alt: pVACview ML Predictions Example
:figclass: align-left


In the pVACview interface shown above, the ML prediction file is loaded in place of the standard Class I TSV file, with all
other inputs as described. Candidate evaluation statuses are automatically pre-populated based on the ML predictions, as shown in the “Acpt,”
“Rej,” and “Rev” columns, with prediction scores displayed in the “ML Prediction (score)” column. Users may review and override these assignments
as needed.

In this example, MAU2 is classified in the Pass tier by pVACseq and predicted as Accept by the ML model, providing concordant support for its
selection. In contrast, TUBGCP6 is labeled as a PoorBinder by pVACseq but predicted as Accept by the ML model, likely due to favorable features
such as high expression and variant allele frequency (VAF), as well as potential Class II binding indicated in the Additional Data table (shown below). While
this candidate may be provisionally accepted, further evaluation is needed to confirm that all Class II selection criteria are met.

.. figure:: ../../images/screenshots/vignette/pvacview-ml-predictions-example2.png
:width: 1000px
:align: right
:alt: pVACview ML Predictions Example TUBGCP6 Class II Additional Data
:figclass: align-left

Export
______
Expand Down
Loading
Loading