This project applies machine learning techniques to predict lung cancer presence based on patient health data. The goal is to build an interpretable, high-performing model that supports early detection and contributes to data-driven healthcare solutions.
- Domain: Medical diagnostics
- Modeling: Classification using supervised ML algorithms
- Tools: Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
- Accuracy Achieved: 91% (on test set)
| File / Folder | Description |
|---|---|
.gitignore |
Specifies intentionally untracked files to ignore |
LICENSE |
MIT license for open-source use |
README.md |
Project overview, instructions, and documentation |
requirements.txt |
Python dependencies for reproducibility |
Lung Cancer Dataset.csv |
Dataset containing patient health records and cancer diagnosis |
lung-cancer-prediction.ipynb |
Jupyter notebook with EDA, modeling, and evaluation |
thumbnail.png |
Visual preview for GitHub portfolio presentation |
The dataset used in this project was sourced from Kaggle:
Lung Cancer Dataset – Kaggle
It contains anonymized patient health records including age, smoking habits, and symptoms, used to train and evaluate machine learning models for lung cancer prediction.
All rights and usage terms belong to the original dataset creator. This project is for educational and research purposes only.
- Data Cleaning: Handled missing values and outliers with domain-aware strategies
- Feature Engineering: Converted categorical variables and scaled numerical features
- Model Comparison: Evaluated Logistic Regression, Random Forest, SVM, and KNN
- Performance Metrics: Accuracy, precision, recall, F1-score, and confusion matrix
- Visualization: Used Seaborn and Matplotlib for EDA and model insights
This project includes exploratory data analysis and model evaluation using Matplotlib and Seaborn. Key visualizations include:
- Missing Value Heatmap: Visualized null entries across features to guide data cleaning
- Feature Distributions: Plotted histograms and boxplots to understand variable spread and detect outliers
- Correlation Heatmap: Identified relationships between numerical features to inform feature selection
- Class Balance Plot: Displayed distribution of cancer vs. non-cancer cases
- Confusion Matrix: Evaluated classification performance
- ROC Curve: Assessed model discrimination ability
git clone https://github.com/ArianJr/lung-cancer-prediction-ml.git
cd lung-cancer-prediction-ml
pip install -r requirements.txtOpen lung-cancer-prediction.ipynb in Jupyter Notebook or VS Code and run the cells sequentially.
- Best Model: Random Forest
- Test Accuracy: 91%
- Insights: Smoking, age, and chronic diseases were key predictors
- Add cross-validation and hyperparameter tuning
- Explore deep learning models (e.g., CNNs for imaging data)
- Integrate SHAP or LIME for model interpretability
This project is for educational and research purposes only. Predictive models in healthcare must be validated by medical professionals and regulatory bodies before deployment. All data used is anonymized and publicly available.
This project was made possible thanks to publicly available datasets and the contributions of the open-source Python ecosystem. Special thanks to the creators of the Lung Cancer Dataset on Kaggle, and to the developers and maintainers of libraries such as Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn, which enabled efficient data analysis and model development.
The project is inspired by real-world applications of machine learning in healthcare and is intended for educational and research purposes.
This project is licensed under the MIT License. You are free to use, modify, and distribute this code with proper attribution.
Arian J.
Computer Engineering Student | Machine Learning & AI Explorer
📫 Email: arianjafar59@gmail.com
🔗 GitHub: ArianJr
If you found this project helpful or insightful, feel free to give it a ⭐ on GitHub. Thanks for checking it out!