Personality Type Prediction Project (Kaggle Playground Series - S5E7)

Project Overview

This project was developed for participation in the Kaggle Playground Series - Season 5, Episode 7 machine learning competition.
The main objective is to predict whether an individual is an Introvert or Extrovert based on their behavioral, social, and activity-related traits.

Dataset

The dataset contains survey-based features reflecting user behavior, including:

Time_spent_Alone
Stage_fear
Social_event_attendance
Going_outside
Drained_after_socializing
Friends_circle_size
Post_frequency
Personality (Target: Extrovert or Introvert)

Project Structure

The project mainly consists of:

main.py: A python file containing all stages from data loading to submission, including EDA, feature engineering, model selection, and prediction.
preprocessing.py: A reusable module that includes the data transformation pipeline for both train and test datasets. It ensures consistency between model training and final predictions.

Workflow

The pipeline includes the following steps:

Data Loading: Train and test datasets are loaded from Kaggle inputs.
Exploratory Data Analysis (EDA): Summary statistics, missing value inspection, and visualization of feature distributions.
Missing Value Imputation:
- KNNImputer is used for numeric columns.
- SimpleImputer with mode strategy is used for categorical variables.
Outlier Handling:
- Outliers are capped using the IQR method (adjusted thresholds).
Feature Engineering:
- New features like NEW_Alone_Level, NEW_Social_Score, and categorical bins were created to better capture personality traits.
Encoding:
- Binary features were label encoded.
- Multiclass categorical variables were one-hot encoded using sklearn.OneHotEncoder with handle_unknown='ignore'.
Model Comparison:
- Several classifiers (CatBoost, XGBoost, LightGBM, SVC, RandomForest, etc.) were compared.
- CatBoost (GPU) performed the best on the validation set.
Hyperparameter Optimization:
- Optuna was used to tune CatBoost hyperparameters with cross-validation using f1_macro as the objective.
Final Prediction and Submission:
- The final model was trained on the full training dataset using the best parameters.
- Predictions were made on the test dataset and saved as submission.csv.

Setup and Running

To run this project on your local machine or cloud:

Clone the Repository:

git clone https://github.com/BahriDogru/Personality_Type_Classification.git
cd predicting_personality_type

Install Dependencies:

Using the provided environment.yaml file:

conda env create -f environment.yaml
conda activate personality_prediction_env

Prepare the Dataset: Download train.csv and test.csv from the competition page
Place them in a dataset/ folder:

.
├── dataset/
│   ├── train.csv
│   └── test.csv
├── main.ipynb
├── environment.yaml
├── .gitignore
├── preprocessing.py
└── README.md

Run the Script:
```
python main.py
```

Results

After model comparison and tuning, the CatBoostClassifier (GPU) model provided the best results.
📈 Private Leaderboard Score: 0.974089 (F1 Score)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Personality Type Prediction Project (Kaggle Playground Series - S5E7)

Project Overview

Dataset

Project Structure

Workflow

Setup and Running

Results

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
main.py		main.py
preprocessing.py		preprocessing.py

BahriDogru/Personality_Type_Classification

Folders and files

Latest commit

History

Repository files navigation

Personality Type Prediction Project (Kaggle Playground Series - S5E7)

Project Overview

Dataset

Project Structure

Workflow

Setup and Running

Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages