Lung Cancer Prediction Using Machine Learning

🔍 Project Overview

This project applies machine learning techniques to predict lung cancer presence based on patient health data. The goal is to build an interpretable, high-performing model that supports early detection and contributes to data-driven healthcare solutions.

Domain: Medical diagnostics
Modeling: Classification using supervised ML algorithms
Tools: Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
Accuracy Achieved: 91% (on test set)

📁 Repository Structure

File / Folder	Description
`.gitignore`	Specifies intentionally untracked files to ignore
`LICENSE`	MIT license for open-source use
`README.md`	Project overview, instructions, and documentation
`requirements.txt`	Python dependencies for reproducibility
`Lung Cancer Dataset.csv`	Dataset containing patient health records and cancer diagnosis
`lung-cancer-prediction.ipynb`	Jupyter notebook with EDA, modeling, and evaluation
`thumbnail.png`	Visual preview for GitHub portfolio presentation

📂 Dataset

The dataset used in this project was sourced from Kaggle:

Lung Cancer Dataset – Kaggle
It contains anonymized patient health records including age, smoking habits, and symptoms, used to train and evaluate machine learning models for lung cancer prediction.

All rights and usage terms belong to the original dataset creator. This project is for educational and research purposes only.

🧠 Key Features

Data Cleaning: Handled missing values and outliers with domain-aware strategies
Feature Engineering: Converted categorical variables and scaled numerical features
Model Comparison: Evaluated Logistic Regression, Random Forest, SVM, and KNN
Performance Metrics: Accuracy, precision, recall, F1-score, and confusion matrix
Visualization: Used Seaborn and Matplotlib for EDA and model insights

📊 Visualizations

This project includes exploratory data analysis and model evaluation using Matplotlib and Seaborn. Key visualizations include:

Missing Value Heatmap: Visualized null entries across features to guide data cleaning
Feature Distributions: Plotted histograms and boxplots to understand variable spread and detect outliers
Correlation Heatmap: Identified relationships between numerical features to inform feature selection
Class Balance Plot: Displayed distribution of cancer vs. non-cancer cases
Confusion Matrix: Evaluated classification performance
ROC Curve: Assessed model discrimination ability

🚀 Getting Started

Installation

git clone https://github.com/ArianJr/lung-cancer-prediction-ml.git
cd lung-cancer-prediction-ml
pip install -r requirements.txt

Run the Notebook

Open lung-cancer-prediction.ipynb in Jupyter Notebook or VS Code and run the cells sequentially.

📊 Results Summary

Best Model: Random Forest
Test Accuracy: 91%
Insights: Smoking, age, and chronic diseases were key predictors

📌 Future Improvements

Add cross-validation and hyperparameter tuning
Explore deep learning models (e.g., CNNs for imaging data)
Integrate SHAP or LIME for model interpretability

⚖️ Ethical Considerations

This project is for educational and research purposes only. Predictive models in healthcare must be validated by medical professionals and regulatory bodies before deployment. All data used is anonymized and publicly available.

🤝 Acknowledgments

This project was made possible thanks to publicly available datasets and the contributions of the open-source Python ecosystem. Special thanks to the creators of the Lung Cancer Dataset on Kaggle, and to the developers and maintainers of libraries such as Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn, which enabled efficient data analysis and model development.

The project is inspired by real-world applications of machine learning in healthcare and is intended for educational and research purposes.

📄 License

This project is licensed under the MIT License. You are free to use, modify, and distribute this code with proper attribution.

👤 Author

Arian J.
Computer Engineering Student | Machine Learning & AI Explorer
📫 Email: arianjafar59@gmail.com
🔗 GitHub: ArianJr

⭐ Support

If you found this project helpful or insightful, feel free to give it a ⭐ on GitHub. Thanks for checking it out!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lung Cancer Prediction Using Machine Learning

🔍 Project Overview

📁 Repository Structure

📂 Dataset

🧠 Key Features

📊 Visualizations

🚀 Getting Started

Installation

Run the Notebook

📊 Results Summary

📌 Future Improvements

⚖️ Ethical Considerations

🤝 Acknowledgments

📄 License

👤 Author

⭐ Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Lung Cancer Dataset.csv		Lung Cancer Dataset.csv
README.md		README.md
lung_cancer_prediction_ml.ipynb		lung_cancer_prediction_ml.ipynb
requirements.txt		requirements.txt
thumbnail.png		thumbnail.png

License

ArianJr/lung-cancer-prediction-ml

Folders and files

Latest commit

History

Repository files navigation

Lung Cancer Prediction Using Machine Learning

🔍 Project Overview

📁 Repository Structure

📂 Dataset

🧠 Key Features

📊 Visualizations

🚀 Getting Started

Installation

Run the Notebook

📊 Results Summary

📌 Future Improvements

⚖️ Ethical Considerations

🤝 Acknowledgments

📄 License

👤 Author

⭐ Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages