Skip to content

Predicting lung cancer using patient health data and machine learning classifiers. Includes data cleaning, EDA, and model evaluation.

License

Notifications You must be signed in to change notification settings

ArianJr/lung-cancer-prediction-ml

Repository files navigation

Lung Cancer Prediction Thumbnail

Lung Cancer Prediction Using Machine Learning

🔍 Project Overview

This project applies machine learning techniques to predict lung cancer presence based on patient health data. The goal is to build an interpretable, high-performing model that supports early detection and contributes to data-driven healthcare solutions.

  • Domain: Medical diagnostics
  • Modeling: Classification using supervised ML algorithms
  • Tools: Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
  • Accuracy Achieved: 91% (on test set)

📁 Repository Structure

File / Folder Description
.gitignore Specifies intentionally untracked files to ignore
LICENSE MIT license for open-source use
README.md Project overview, instructions, and documentation
requirements.txt Python dependencies for reproducibility
Lung Cancer Dataset.csv Dataset containing patient health records and cancer diagnosis
lung-cancer-prediction.ipynb Jupyter notebook with EDA, modeling, and evaluation
thumbnail.png Visual preview for GitHub portfolio presentation

📂 Dataset

The dataset used in this project was sourced from Kaggle:

Lung Cancer Dataset – Kaggle
It contains anonymized patient health records including age, smoking habits, and symptoms, used to train and evaluate machine learning models for lung cancer prediction.

All rights and usage terms belong to the original dataset creator. This project is for educational and research purposes only.


🧠 Key Features

  • Data Cleaning: Handled missing values and outliers with domain-aware strategies
  • Feature Engineering: Converted categorical variables and scaled numerical features
  • Model Comparison: Evaluated Logistic Regression, Random Forest, SVM, and KNN
  • Performance Metrics: Accuracy, precision, recall, F1-score, and confusion matrix
  • Visualization: Used Seaborn and Matplotlib for EDA and model insights

📊 Visualizations

This project includes exploratory data analysis and model evaluation using Matplotlib and Seaborn. Key visualizations include:

  • Missing Value Heatmap: Visualized null entries across features to guide data cleaning
  • Feature Distributions: Plotted histograms and boxplots to understand variable spread and detect outliers
  • Correlation Heatmap: Identified relationships between numerical features to inform feature selection
  • Class Balance Plot: Displayed distribution of cancer vs. non-cancer cases
  • Confusion Matrix: Evaluated classification performance
  • ROC Curve: Assessed model discrimination ability

🚀 Getting Started

Installation

git clone https://github.com/ArianJr/lung-cancer-prediction-ml.git
cd lung-cancer-prediction-ml
pip install -r requirements.txt

Run the Notebook

Open lung-cancer-prediction.ipynb in Jupyter Notebook or VS Code and run the cells sequentially.


📊 Results Summary

  • Best Model: Random Forest
  • Test Accuracy: 91%
  • Insights: Smoking, age, and chronic diseases were key predictors

📌 Future Improvements

  • Add cross-validation and hyperparameter tuning
  • Explore deep learning models (e.g., CNNs for imaging data)
  • Integrate SHAP or LIME for model interpretability

⚖️ Ethical Considerations

This project is for educational and research purposes only. Predictive models in healthcare must be validated by medical professionals and regulatory bodies before deployment. All data used is anonymized and publicly available.


🤝 Acknowledgments

This project was made possible thanks to publicly available datasets and the contributions of the open-source Python ecosystem. Special thanks to the creators of the Lung Cancer Dataset on Kaggle, and to the developers and maintainers of libraries such as Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn, which enabled efficient data analysis and model development.

The project is inspired by real-world applications of machine learning in healthcare and is intended for educational and research purposes.


📄 License

This project is licensed under the MIT License. You are free to use, modify, and distribute this code with proper attribution.


👤 Author

Arian J.
Computer Engineering Student | Machine Learning & AI Explorer
📫 Email: arianjafar59@gmail.com
🔗 GitHub: ArianJr


⭐ Support

If you found this project helpful or insightful, feel free to give it a ⭐ on GitHub. Thanks for checking it out!

About

Predicting lung cancer using patient health data and machine learning classifiers. Includes data cleaning, EDA, and model evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published