griffithlab · jyao36 · Nov 3, 2025 · Nov 4, 2025 · Nov 19, 2025 · Nov 21, 2025
diff --git a/.DS_Store b/.DS_Store
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -37,6 +37,7 @@ jobs:
           pip install polars==0.16.18
           pip install pypandoc==1.7.2
           pip install "tensorflow<2.16"
+          pip install imbalanced-learn
           pip install -e .
       - name: List installed packages
         run: |

diff --git a/docs/images/screenshots/vignette/pvacview-ml-predictions-example.png b/docs/images/screenshots/vignette/pvacview-ml-predictions-example.png
diff --git a/docs/images/screenshots/vignette/pvacview-ml-predictions-example2.png b/docs/images/screenshots/vignette/pvacview-ml-predictions-example2.png
diff --git a/docs/pvacseq/optional_downstream_analysis_tools.rst b/docs/pvacseq/optional_downstream_analysis_tools.rst
@@ -61,6 +61,41 @@ epitopes are well-binding to. Lastly, the report will bin variants into tiers
 that offer suggestions as to the suitability of variants for use in vaccines.
 For a full definition of these tiers, see the pVACseq :ref:`output file documentation <aggregated>`.
 
+Add ML Predictions
+------------------
+
+.. program-output:: pvacseq add_ml_predictions -h
+
+This tool adds machine learning (ML)-based neoantigen prioritization predictions to existing pVACseq output files. 
+It uses a trained random forest model to predict whether neoantigen candidates should be evaluated as "Accept", 
+"Reject", or "Pending" based on a comprehensive set of features derived from binding affinity predictions, 
+expression data, and variant characteristics.
+
+This tool requires that you have already generated both MHC Class I and Class II aggregated reports using 
+the ``generate_aggregated_report`` command or by running the pVACseq pipeline (``pvacseq run``). It takes as input 
+the Class I aggregated TSV, Class I all epitopes TSV, and Class II aggregated TSV files from a pVACseq run. 
+The tool merges these files, performs data cleaning and imputation, and applies the ML model to generate evaluation predictions for each variant.
+
+The output file is named ``<sample_name>_predict_pvacview.tsv`` and contains all columns from the original 
+Class I aggregated file with two additional columns:  
+
+
+.. list-table::
+
+ * - ``Evaluation``
+   - The ML-predicted evaluation status: "Accept", "Reject", or "Pending", based on the prediction probability score.
+ * - ``ML Prediction (score)``
+   - A formatted string combining the model-predicted evaluation with the prediction probability score (e.g., 
+     "Accept (0.72)"). It shows "NA" for variants where the model could not make a prediction, which may be due to a candidate 
+     not being present in either the Class I or Class II aggregated reports.
+
+The ``--threshold_accept`` parameter controls the probability threshold for Accept predictions (default: 0.55). 
+Variants with prediction probabilities >= this threshold are evaluated as "Accept". The ``--threshold_reject`` parameter 
+controls the probability threshold for Reject predictions (default: 0.30). Variants with prediction probabilities <= 
+this threshold are evaluated as "Reject". Everything in between is set to "Pending" for manual review. 
+The ``--artifacts_path`` parameter allows you to specify a custom directory 
+containing ML model artifacts, though by default the tool uses the model artifacts included with the pvactools package.
+
 Calculate Reference Proteome Similarity
 ---------------------------------------
 

diff --git a/docs/pvacseq/output_files.rst b/docs/pvacseq/output_files.rst
@@ -55,6 +55,9 @@ created):
    * - ``ui.R``, ``app.R``, ``server.R``, ``styling.R``, ``anchor_and_helper_functions.R``
      - pVACview R Shiny application files. Not generated when running only with presentation and immunogenicity algorithms.
    * - ``www`` (directory)
+     - Directory containing image files for pVACview. Not generated when running with presentation and immunogenicity algorithms only.  
+   * - ``ml_predict/<sample_name>_predict_pvacview.tsv`` (optional)
+     - ML-based neoantigen evaluation predictions file. Generated when both MHC Class I and Class II predictions are run and the ``--run-ml-predictions`` flag is set.
      - Directory containing image files for pVACview. Not generated when running only with presentation and immunogenicity algorithms only.
 
 
@@ -387,6 +390,27 @@ included epitopes, selecting the best-scoring epitope, and which values are outp
    * - ``Evaluation``
      - Column to store the evaluation of each variant when evaluating the run in pVACview. Either ``Accept``, ``Reject``, or ``Review``.
 
+.. _ml_prediction_output:
+
+<sample_name>_predict_pvacview.tsv Report Columns
+--------------------------------------------------
+
+The ``<sample_name>_predict_pvacview.tsv`` file is generated when using the :ref:`add_ml_predictions <optional_downstream_analysis_tools_label>` 
+tool or when running pVACseq with both MHC Class I and Class II predictions and the ``--run-ml-predictions`` flag enabled. 
+This file contains all columns from the Class I aggregated file (``all_epitopes.aggregated.tsv``) with one additional ML prediction column added.  
+
+The file is written to the ``ml_predict`` subdirectory within the output directory.
+
+.. list-table::
+   :header-rows: 1
+
+   * - All columns from ``all_epitopes.aggregated.tsv``
+     - All columns described in the :ref:`aggregated` section above are included in this file.
+   * - ``Evaluation``
+     - Populated with ML-predicted evaluation status for each candidate. Values: ``Accept`` for variants with prediction probability >= threshold_accept (default: 0.55), ``Reject`` for variants with prediction probability <= threshold_reject (default: 0.30), and ``Pending`` for variants with prediction probability between threshold_reject and threshold_accept or when the ML model cannot make a prediction due to missing data.
+   * - ``ML Prediction (score)``
+     - ML-based prediction evaluation with probability score. Format: ``"<Evaluation> (<probability_score>)"`` (e.g., ``"Accept (0.72)"``, ``"Reject (0.15)"``, ``"Review (0.48)"``). Shows ``"NA"`` when the ML model cannot make a prediction due to missing data (e.g., when Class I and Class II aggregated files have different numbers of rows).
+
 .. _pvacseq_best_peptide:
 
 Best Peptide Criteria

diff --git a/docs/pvacview/pvacseq_module/pvacseq_vignette.rst b/docs/pvacview/pvacseq_module/pvacseq_vignette.rst
@@ -403,6 +403,78 @@ These potentially problematic characteristics are also flagged by the red boxes
 Since the candidate peptide has a match in the reference proteome, we will reject this candidate by clicking the
 thumbs-down button.
 
+ML-Based Neoantigen Evaluation Predictions
+__________________________________________
+
+This ML prediction output file contains ML-based evaluation predictions that can help prioritize neoantigen candidates by presetting the evaluation status for each candidate. 
+When pVACseq is run with both MHC Class I and Class II predictions and the ``--run-ml-predictions`` flag enabled, or when using the :ref:`add_ml_predictions <optional_downstream_analysis_tools_label>` 
+tool, a aggregate report file with ML predictions (``<sample_name>_predict_pvacview.tsv``) is generated. This file can be loaded into pVACview in combination with the Class I metrics.json file generated during the original run. 
+This metrics.json file is copied next to the ML prediction output file for convenience.  
+This file contains ML-based evaluation predictions that can help prioritize neoantigen candidates by presetting the evaluation status for each candidate.
+
+The ML prediction file includes all columns from the Class I aggregated file with two columns different:
+
+**Evaluation Column**
+
+The ``Evaluation`` column is pre-populated with ML-predicted evaluation status for each candidate:
+
+- ``Accept``: Variants with prediction probability >= threshold_accept (default: 0.55). These candidates are predicted to be favorable neoantigen candidates to be included in a vaccine. 
+- ``Reject``: Variants with prediction probability <= threshold_reject (default: 0.30). These candidates are predicted to be unfavorable. threshold_reject should be set to a value less than threshold_accept.
+- ``Pending``: Variants with prediction probability between threshold_reject and threshold_accept, or when the ML model cannot make a prediction due to missing data. These candidates require manual review.
+
+**ML Prediction (score) Column**
+
+The ``ML Prediction (score)`` column provides additional context by displaying the evaluation status along with the underlying prediction probability score. 
+The format is ``"<Evaluation> (<probability_score>)"`` (e.g., ``"Accept (0.72)"``, ``"Reject (0.15)"``, ``"Review (0.48)"``). 
+The "Review" status is retained in this column as a suggestion for users to change the status in the "Evaluation" column to "Review", or "Accept" or "Reject" manually.
+This column shows ``"NA"`` when the ML model cannot make a prediction due to missing data (e.g., when Class I and Class II aggregated files have different numbers of rows).
+
+The probability score represents the model's confidence that a candidate should be accepted to be in a vaccine, with values closer to 1.0 indicating higher confidence in acceptance.
+
+
+**Important Features Used by the ML Model**
+
+The ML model integrates information from multiple sources to make its predictions. The following features are among the five most important factors considered:
+1. Allele expression
+2. RNA VAF
+3. RNA Expression
+4. NetMHCpan MT IC50 Score
+5. TSL
+
+The model combines these features (and more not listed here) using a trained random forest algorithm that has learned patterns from expert-reviewed neoantigen candidates. 
+The predictions serve as a starting point for evaluation, but should be reviewed in conjunction with the detailed information available in pVACview, 
+including binding affinity plots, anchor position analysis, and reference proteome matches.
+
+**pVACview ML Predictions Example**
+
+To view predictions on pVACview, load the following files: 
+1. The ML prediction file (``<sample_name>_predict_pvacview.tsv``) in place of the Class I tsv file. 
+2. The metrics.json file of Class I data. 
+3. The Class II aggregated.tsv file.  
+4. A list of genes of interest (optional).
+
+.. figure:: ../../images/screenshots/vignette/pvacview-ml-predictions-example.png
+    :width: 1000px
+    :align: right
+    :alt: pVACview ML Predictions Example
+    :figclass: align-left
+
+
+In the pVACview interface shown above, the ML prediction file is loaded in place of the standard Class I TSV file, with all 
+other inputs as described. Candidate evaluation statuses are automatically pre-populated based on the ML predictions, as shown in the “Acpt,” 
+“Rej,” and “Rev” columns, with prediction scores displayed in the “ML Prediction (score)” column. Users may review and override these assignments 
+as needed.
+
+In this example, MAU2 is classified in the Pass tier by pVACseq and predicted as Accept by the ML model, providing concordant support for its 
+selection. In contrast, TUBGCP6 is labeled as a PoorBinder by pVACseq but predicted as Accept by the ML model, likely due to favorable features 
+such as high expression and variant allele frequency (VAF), as well as potential Class II binding indicated in the Additional Data table (shown below). While 
+this candidate may be provisionally accepted, further evaluation is needed to confirm that all Class II selection criteria are met.
+
+.. figure:: ../../images/screenshots/vignette/pvacview-ml-predictions-example2.png
+    :width: 1000px
+    :align: right
+    :alt: pVACview ML Predictions Example TUBGCP6 Class II Additional Data
+    :figclass: align-left
 
 Export
 ______