Welcome to our cutting-edge computational pipeline designed to accelerate Alzheimer's Disease (AD) research. This project integrates advanced bioinformatics and cheminformatics, creating a seamless workflow from raw single-cell RNA sequencing (scRNA-seq) data to predictive Quantitative Structure-Activity Relationship (QSAR) modeling.
Our mission is to democratize access to powerful predictive tools, lowering the barrier to entry for researchers in the neurodegenerative disease space. This repository provides a comprehensive toolkit for data integration, cellular analysis, and machine learning-based bioactivity prediction.
You can access and use the live application at: https://QSARify.com
This pipeline is organized into three core modules, each providing a distinct set of functionalities.
- π Data Integration: Merges complex
scRNA-seqdatasets from multiple public studies (GSE138852, GSE157827, GSE175814, GSE163577) across various brain regions into a unifiedSeuratobject. -
- Notably, the original studies encompassed over 500,000 cells, all of which were processed in our analysis. For ease of GitHub upload, a random subset of 25,000 cells (approximately 5,000 from each study) has been provided
- β
Quality Control & Normalization: Implements a robust QC pipeline that filters low-quality cells (e.g., mitochondrial content > 10%), normalizes data using
SCTransform, and removes artifacts withDoubletFinder. - 𧬠Cell Type Annotation: Automatically annotates cell clusters using the
SingleRpackage with established reference datasets. - π Dimensional Reduction & Clustering: Leverages
PCAfor initial dimensionality reduction,Harmonyfor batch effect correction, andUMAPfor visualization and clustering. - π Differential Abundance: Employs
MiloRto identify statistically significant changes in cell population abundance between experimental conditions. - π‘ Cell-Cell Communication: Uses
CellChatto infer and analyze intercellular communication networks, identifying key ligand-receptor interactions and signaling pathways.
- π― Target Data Collection: Queries the ChEMBL database using UniProt IDs for key AD-related targets (
MAO-B,COX-2,VISFATIN,BACE1,AChE). It retrieves bioactivity data (e.g., IC50 values) and calculates critical ADME properties (MW, LogP, HBD, HBA, Lipinski's Rule). - π§ Machine Learning Pipeline:
- Feature Engineering: Converts SMILES strings into 2048-bit Morgan fingerprints and calculates key physicochemical descriptors.
- Imbalance Handling: Utilizes the
SMOTETomektechnique to address class imbalance in the bioactivity data. - Model Training & Tuning: Trains multiple baseline models (Random Forest, Gradient Boosting, Neural Network) and hyperparameter-tunes the top performer (Random Forest) to achieve an AUC of ~0.82 on the test set.
- π Model Evaluation: Generates comprehensive visualizations for model performance, including ROC curves, accuracy plots, and F1-score comparisons.
- π RESTful Endpoints: A
Flask-based API provides endpoints for health checks (/health), single predictions (/predict), and batch predictions (/predict_batch). - π¨ User-Friendly Interface: An intuitive web UI for interacting with the QSAR model.
- Single & Batch Predictions: Supports bioactivity prediction for one or multiple compounds via SMILES input or file upload (
.txt,.xls,.xlsx). - Dynamic Target Selection: Allows users to predict against a specific target or all available targets.
- History Tracking: A sortable and filterable history tab keeps a record of all prediction tasks.
- About Page: Contains project details and team information.
- Single & Batch Predictions: Supports bioactivity prediction for one or multiple compounds via SMILES input or file upload (
AlzheimerDisease_FromSingleCell/
βββ SingleCell/ # 𧬠scRNA-seq preprocessing (R)
β βββ Merge_Data.R
β βββ SingleCell_Main.R
βββ MiloR/ # 𧬠Differential-abundance analysis (R)
β βββ MiloR_CellAbundance.R
βββ CellChat/ # 𧬠Cellβcell communication analysis (R)
β βββ CellChat.R
βββ QSAR/ # π§ QSAR modeling & web app (Python)
β βββ figures/
β β βββ roc_curves_comparison.png
β βββ Data/
β β βββ chembl_results_P_27338_MAO-B_IC50_classified.csv
β β βββ chembl_results_P_35354_COX2_IC50_classified.csv
β β βββ chembl_results_P_43490_VISFATIN_IC50_classified.csv
β β βββ chembl_results_P_56817_BACE1_IC50_classified.csv
β β βββ chembl_results_Q_04844_ACHE_IC50_classified.csv
β βββ Model/
β β βββ final_tuned_model.pkl
β βββ templates/
β β βββ index.html
β βββ Target_Collection.ipynb
β βββ Ligand_Final.ipynb
β βββ app.py
β βββ requirements.txt
βββ Data/ # A large-scale analysis of over 500,000 cells was performed. A 25,000-cell subset (5,000 from each study) is provided on GitHub for convenience.
β βββ 25K_Sample.rds
βββ README.md
| Script π₯οΈ | Purpose π― | Key Libraries π οΈ | Output π |
|---|---|---|---|
Merge_Data.R |
Integrates raw scRNA-seq count matrices from multiple GSE studies. | Seurat, batchelor, SingleCellExperiment |
A unified Seurat object containing all datasets. |
SingleCell_Main.R |
Performs QC, normalization, clustering, and cell type annotation. | Seurat, harmony, DoubletFinder, SingleR |
A processed Seurat object with UMAPs and cell annotations. |
MiloR_CellAbundance.R |
Conducts differential abundance testing on cell neighborhoods. | miloR, SingleCellExperiment, ggplot2 |
Differential abundance statistics and visualizations. |
CellChat.R |
Infers and analyzes cell-cell communication pathways. | CellChat, Seurat, dplyr |
Communication network data and plots (bubble plots, heatmaps). |
Target_Collection.ipynb |
Retrieves and preprocesses bioactivity & ADME data from ChEMBL. | pandas, chembl_webresource_client, rdkit |
A cleaned DataFrame and exploratory data visualizations. |
Ligand_Final.ipynb |
Trains, tunes, and evaluates the QSAR machine learning model. | scikit-learn, imbalanced-learn, rdkit, pandas |
A serialized model (.pkl) and performance plots. |
app.py |
Serves a Flask-based web API for on-demand bioactivity predictions. | flask, flask-cors, joblib, rdkit |
JSON responses with predictions and confidence scores. |
index.html |
Provides the interactive front-end UI for the QSAR prediction tool. | HTML, CSS, JavaScript | An interactive web interface rendered in the browser. |
To set up the project environment, please follow these steps.
(Required for SingleCell, MiloR, and CellChat analysis)
# Install core packages from CRAN
install.packages(c("Seurat", "dplyr", "ggplot2", "patchwork", "scater", "scran", "harmony", "batchelor", "SingleR"))
# Install Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("SingleCellExperiment", "miloR", "glmGamPoi"))(Required for QSAR modeling and the Flask API)
# Clone the repository
git clone [https://github.com/xhammady/AD-scRNA2QSAR.git](https://github.com/xhammady/AD-scRNA2QSAR.git)
cd AD-scRNA2QSAR
# Install Python packages from requirements.txt
pip install -r QSAR/requirements.txtNote: Key Python libraries include
chembl-webresource-client,rdkit,scikit-learn,imbalanced-learn,pandas,flask, andflask-cors.
Follow this sequence to run the full analysis pipeline.
- 𧬠Merge Data: Execute
SingleCell/Merge_Data.Rto combine the raw count matrices. - π¬ Preprocess & Cluster: Run
SingleCell/SingleCell_Main.Rto perform QC, normalization, integration, and annotation. - π Analyze Differential Abundance: Use
MiloR/MiloR_CellAbundance.Rto compare cell populations. - π‘ Infer Communication: Run
CellChat/CellChat.Rto analyze signaling pathways.
- π§ͺ Collect Target Data: Open and run the
QSAR/Target_Collection.ipynbnotebook to query ChEMBL and generate the analysis dataset. - π§ Train ML Model: Open and run
QSAR/Ligand_Final.ipynbto preprocess features, train the Random Forest model, and save the final.pklfile.
- π Start the Server: From the command line, run the Flask application:
python QSAR/app.py
- π¨ Access the UI: Open your web browser and navigate to http://localhost:5000/ or visit the live application at https://QSARify.com. You can now:
- Enter a SMILES string for a single compound prediction.
- Upload a file (
.txt,.xls,.xlsx) for batch predictions. - View and manage results in the History tab.
We welcome contributions to improve this project! Please fork the repository, create a new branch for your feature, and submit a pull request with a detailed description of your changes. Ensure you follow existing coding standards and include tests where applicable.
This project is licensed under the MIT License. See the LICENSE file for more details.
- Amr Mohamed ElHefnawy
- Ibrahim Abdelkarim Hammad
- Mira Moheb Attia
- Abdelrahman Wagih
- Reem Sharaf EL-Deen Hassan
- Lorance Gergis Labeeb
- Prof. Dr. Sameh Ibrahim Hassanien
- 𧬠Software & Libraries:
Seurat,CellChat,MiloR,RDKit,scikit-learn,imbalanced-learn,Flask,DoubletFinder,SingleR,pandas,numpy,matplotlib. - π§ͺ Databases: We gratefully acknowledge the ChEMBL database for providing essential bioactivity data.
- π Data Providers: This work would not be possible without the public datasets provided by Gene Expression Omnibus (GEO): GSE138852, GSE157827, GSE175814, GSE163577.
- For questions or issues, please open an issue on GitHub or contact the project maintainers through mail.
- Amr Mohamed ElHefnawy
- Ibrahim Abdelkarim Hammad
- Mira Moheb Attia
- Abdelrahman Wagih
- Reem Sharaf EL-Deen Hassan
- Lorance Gergis Labeeb
- Prof. Dr. Sameh Ibrahim Hassanien



