This repository contains the core analysis code for the TPHP project, which supports large-scale, spatially resolved human proteome profiling. The dataset quantifies >13,000 proteins across 2856 samples spanning 58 major tissue types (251 tissue subtypes) and 25 cancer types using DIA-MS.
The study is described in the bioRxiv (2025) preprint.
The code performs per-cancer, per-protein tumor versus paired non-tumor comparisons using linear mixed-effects regression, enabling systematic analysis of oncogenic proteome changes across tissues. Results are exported as RDS objects for downstream statistical analysis, visualization and integration with the public proteome database.
tumor_dysregulation_analysis.R— main analysis scriptdata/tumor_compare_data.parquet— input table (paired tumor/non-tumor)output/compare_report_output.rds— main resultoutput/package_versions.txt— R and package versions used
The workflow is expected to run on every common desktop/server platforms that support R. The test environment is on Windows 11 x64 system.
- R (recommended R 4.0+)
- CRAN packages
tidyversearrowlme4lmerTestplyr
Exact tested version recorded in output/package_versions.txt
- CPU-only execution (no GPU required)
- Recommended RAM: ≥8 GB (larger datasets may require more)
- No required non-standard hardware
git clone <REPO_URL>
cd <REPO_DIR>Install R for the operating system from CRAN.
Optionally, install packages manually:
install.packages(c("tidyverse","arrow","lmerTest","lme4"))This project is script-based, and no software installation step is required beyond installing dependencies.
From the repository root:
Rscript tumor_dysregulation_analysis.RDefault input path:
data/tumor_compare_data.parquet
After a successful run, the script writes:
-
output/compare_report_output.rdsAn R list
DEAwith:DEA$Diff.report: full per-(cancer, protein) model resultsDEA$Diff.report.filter: filtered results usingHedges'g >= 0.5andp_adj_BH < 0.05
-
output/package_versions.txtR version and package versions used.
Runtime scales primarily with the number of model fits, i.e., cancer types × proteins × samples.
As a rough guide, on a standard desktop CPU (8–16 threads, 16–32 GB RAM), throughput is ~25 model fits per second, corresponding to < 10 minutes per cancer type under typical settings.
The script expects a Parquet table with metadata columns:
patient_ID(string)sample_type(must includeNTfor non-tumor andTfor tumor)cancer_abbr(string)cancer_subtype(string; may be constant within a subset)Gender(categorical)Age(integer-like)Dataset(categorical)
All remaining columns are treated as numeric features (proteins).
For each cancer_abbr and each protein, the script fits a mixed-effects model:
value ~ 1 + sample_type + (optional covariates) + (1 | patient_ID)
Optional covariates among {cancer_subtype, Gender, Age_c, Dataset} are included only when estimable within the cancer/protein subset.
The reported tumor effect is the fixed-effect coefficient for sample_typeT.
Each row in DEA$Diff.report corresponds to one (cancer, protein) fit and includes:
effect,se,tdf(degree of freedom)p,p_adj_BH(p value)sigma(residual SD)es_adj = effect / sigma(standardized effect)g_adj = J(df) * es_adj(small-sample adjusted standardized effect)formula(the exact model formula used)is_singular,re_var_patient(fit diagnostics)