The purpose of this project is to create a stroke prediction machine learning model to predict the likelihood that a person will experience a stroke based on their health and their lifestyle. This enables early intervention, preventive care, and better resource allocation in healthcare.
If you're using Anaconda or Miniconda, follow these steps to create and activate a virtual environment:
You can name it stroke_env (or choose your own name):
conda create --name stroke_env python=3.10conda activate stroke_envpip install jupyter pandas numpy matplotlib seaborn scikit-learn xgboostor
conda install jupyter pandas numpy matplotlib seaborn scikit-learn xgboostThe dataset contains information such as:
- Age
- Gender
- Hypertension
- Heart Disease
- Marital Status
- Work Type
- Residence Type
- Average Glucose Level
- BMI
- Smoking Status
- Stroke (Target)
The data is cleaned, preprocessed, and encoded for training and evaluation.
-
Data Cleaning
Handling missing values, encoding categorical variables. -
Exploratory Data Analysis (EDA)
Correlation analysis and data visualization to understand key predictors of stroke. -
Model Training
- Logistic Regression
- Random Forest
- XG Boost
- SVM
-
Evaluation Metrics
- Accuracy, Precision, Recall, F1-score
- Confusion Matrix
-
Model Optimization
Hyperparameter and threshold tuning.
ML_Python_exam.ipynb– Main notebook with code and resultsstroke_data.csv– Input datasetREADME.md– Project overview