This was a research project developed as teamwork for the third term of the Post-Degree Diploma in Data Analytics in Langara College.
This repository contains the implementation of a breast cancer classification and survival analysis pipeline using TCGA-BRCA RNA-seq data. The project explores gene expression patterns, applies differential expression analysis (DEA), performs dimensionality reduction, and benchmarks multiple machine learning models across classification tasks (Normal vs. Tumor, Early vs. Late stage, multi-stage classification, and survival prediction).
The main objectives of this project are:
- To examine differences in gene expression across breast cancer stages.
- To identify key biomarkers for early detection and prognosis.
- To classify tumor vs. normal samples and cancer stages using supervised and unsupervised learning techniques.
- To predict patient survival using Random Survival Forest models.
- Languages & Libraries: Python, scikit-learn, XGBoost, RandomForest, Logistic Regression, SHAP
- Feature Selection & Dimensionality Reduction: DESeq2 (R), PCA, LDA, NCA, PCoA
- Visualization: Matplotlib, Seaborn, 3D plots for PCA/LDA/PCoA
- Data Source: TCGA-BRCA mRNA-seq dataset (The Cancer Genome Atlas)
- Survival Modeling: Random Survival Forest
-
Data Preparation & Normalization
- Integration of RNA-seq counts and clinical metadata.
- Variance Stabilizing Transformation (VST) / log2 normalization.
-
Exploratory Data Analysis (EDA)
- Patient demographics, stage distribution, and survival trends.
- Visualization of gene expression distributions and outliers.
-
Differential Expression Analysis (DEA)
- DEG detection using DESeq2.
- Visualization: volcano plots, heatmaps, and clustering.
-
Dimensionality Reduction
- PCA, PCoA (unsupervised).
- LDA, NCA (supervised).
-
Model Building & Evaluation
- Benchmarked classifiers: Random Forest, XGBoost, Logistic Regression.
- Classification tasks:
- Normal vs. Tumor
- Early vs. Late stage
- Multi-stage (I–IV) classification
- GridSearchCV for hyperparameter tuning.
-
Survival Analysis
- Random Survival Forest applied to DEG-filtered gene sets.
- Evaluation using C-index and survival curve predictions.
- DEG filtering significantly improved classification performance compared to full gene sets.
- XGBoost consistently outperformed other models across classification tasks .
- Supervised dimensionality reduction (LDA, NCA) yielded better separation of classes than unsupervised methods .
- Survival analysis achieved moderate predictive performance (C-index ≈ 0.57) , with survival curves generated for exploratory use.
Model Benchmarking Summary
| Task | Best Model | Accuracy | AUC / C-index | Recall (Positive) | Key Insight |
|---|---|---|---|---|---|
| Normal vs. Tumor | NCA + XGBoost (DEG-filtered) | 99% | High AUC ROC / AUPRC | Strong across all metrics | Robust separation of tumor vs. normal samples |
| Early vs. Late Stage | PCA + Random Forest (DEG-filtered) | 71% | Balanced AUC | 0.58 recall (Late stage) | Prioritized recall to avoid missing Late-stage cases |
| Stage I–IV Classification | LDA + XGBoost (DEG-filtered) | Lower (~50–60%) | Weak AUC | Low recall | Struggled to distinguish intermediate stages; needs additional clinical features |
| Survival Prediction | Random Survival Forest (RSF) | — | C-index 0.571 | — | Modest discriminative ability; survival curves provided for exploratory use |
├── Data/ # TXT check for information
├── Src/ # Jupyter notebooks for EDA, DEA, modeling, and survival analysis. R file for download data set.
├── Documentation/ # Final project report and presentation
├── Output/ # Gene information
├── README.md # Project overview (this file)Developed by:
- Javier Merino
- Meyliani Sanjaya
- Angeli De los Reyes
- Nay Zaw Lin