This project was developed for participation in the Kaggle Playground Series - Season 5, Episode 7 machine learning competition.
The main objective is to predict whether an individual is an Introvert or Extrovert based on their behavioral, social, and activity-related traits.
The dataset contains survey-based features reflecting user behavior, including:
Time_spent_AloneStage_fearSocial_event_attendanceGoing_outsideDrained_after_socializingFriends_circle_sizePost_frequencyPersonality(Target: Extrovert or Introvert)
The project mainly consists of:
main.py: A python file containing all stages from data loading to submission, including EDA, feature engineering, model selection, and prediction.preprocessing.py: A reusable module that includes the data transformation pipeline for both train and test datasets. It ensures consistency between model training and final predictions.
The pipeline includes the following steps:
- Data Loading: Train and test datasets are loaded from Kaggle inputs.
- Exploratory Data Analysis (EDA): Summary statistics, missing value inspection, and visualization of feature distributions.
- Missing Value Imputation:
- KNNImputer is used for numeric columns.
- SimpleImputer with mode strategy is used for categorical variables.
- Outlier Handling:
- Outliers are capped using the IQR method (adjusted thresholds).
- Feature Engineering:
- New features like
NEW_Alone_Level,NEW_Social_Score, and categorical bins were created to better capture personality traits.
- New features like
- Encoding:
- Binary features were label encoded.
- Multiclass categorical variables were one-hot encoded using
sklearn.OneHotEncoderwithhandle_unknown='ignore'.
- Model Comparison:
- Several classifiers (CatBoost, XGBoost, LightGBM, SVC, RandomForest, etc.) were compared.
- CatBoost (GPU) performed the best on the validation set.
- Hyperparameter Optimization:
Optunawas used to tune CatBoost hyperparameters with cross-validation usingf1_macroas the objective.
- Final Prediction and Submission:
- The final model was trained on the full training dataset using the best parameters.
- Predictions were made on the test dataset and saved as
submission.csv.
To run this project on your local machine or cloud:
-
Clone the Repository:
git clone https://github.com/BahriDogru/Personality_Type_Classification.git cd predicting_personality_type -
Install Dependencies:
Using the provided environment.yaml file:
conda env create -f environment.yaml conda activate personality_prediction_env
-
Prepare the Dataset: Download
train.csvandtest.csvfrom the competition page
Place them in adataset/folder:. βββ dataset/ β βββ train.csv β βββ test.csv βββ main.ipynb βββ environment.yaml βββ .gitignore βββ preprocessing.py βββ README.md -
Run the Script:
python main.py
After model comparison and tuning, the CatBoostClassifier (GPU) model provided the best results.
π Private Leaderboard Score: 0.974089 (F1 Score)