This project aims to build a machine learning model to detect fraudulent credit card transactions. It uses a Random Forest classifier trained on a dataset of credit card transactions.
Directory structure:
├── README.md
├── requirements.txt
├── models/
│ └── random_forest_model.pkl
├── notebooks/
│ ├── 01_data_preprocessing.ipynb
│ └── 02_model_training.ipynb
└── outputs/
└── evaluation_report.txt
-
Data Preprocessing (
notebooks/01_data_preprocessing.ipynb):- Loads the raw dataset (
data/creditcard.csv). - Performs basic data exploration (info, description, null checks).
- Visualizes class distribution and correlations.
- Applies StandardScaler to 'Amount' and 'Time' features.
- Drops original 'Time' and 'Amount' columns.
- Saves the processed data to
data/processed_data.csv.
- Loads the raw dataset (
-
Model Training (
notebooks/02_model_training.ipynb):- Loads the processed data (
data/processed_data.csv). - Splits the data into training and testing sets.
- Trains a
RandomForestClassifiermodel. - Evaluates the model using classification report, confusion matrix, and ROC AUC score.
- Saves the trained model to
models/random_forest_model.pkl. - Saves the evaluation results to
outputs/evaluation_report.txt. - Generates and saves the ROC curve plot to
outputs/roc_curve.png.
- Loads the processed data (
- Clone the repository:
git clone <your-repository-url> cd <repository-directory>
- Create a virtual environment (recommended):
- Linux/MacOS
python -m venv venv
source venv/bin/activate- Windows
python -m venv venv
venv\Scripts\activate- Install dependencies:
- Ensure the
requirements.txtfile lists all necessary packages. Based on the notebooks, you'll likely need:pandas,numpy,matplotlib,seaborn,scikit-learn. - Install using pip:
pip install -r requirements.txt
- Ensure the
- Add Data:
- Place the raw dataset file (
creditcard.csv) into thedata/directory. (Note: This dataset is often found on Kaggle).
- Place the raw dataset file (
- Ensure the
creditcard.csvfile is in thedata/directory. - Run the data preprocessing notebook:
- Execute the cells in
notebooks/01_data_preprocessing.ipynb. This will generatedata/processed_data.csv.
- Execute the cells in
- Run the model training notebook:
- Execute the cells in
notebooks/02_model_training.ipynb. This will train the model, save it tomodels/, and generate evaluation files inoutputs/.
- Execute the cells in
The model performance metrics (precision, recall, F1-score, confusion matrix, AUC score) are available in outputs/evaluation_report.txt. The ROC curve visualization is saved as outputs/roc_curve.png