Skip to content

A curated collection of four transformer models-BERT, RoBERTa, ALBERT, and DistilBERT-each fine-tuned on 60,000 Stack Overflow questions.

Notifications You must be signed in to change notification settings

Kr1mson/Stack_pop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

stack_pop

A collection of four fine-tuned transformer models (BERT, RoBERTa, ALBERT, DistilBERT) trained on 60,000 Stack Overflow questions from Kaggle. This repository provides preprocessed data, model training and evaluation notebooks, and guidance for reproducibility and further research.


๐Ÿ“Œ Overview

stack_pop aims to provide robust, fine-tuned transformer models for question understanding and related NLP tasks in the programming Q&A domain. The models are trained and evaluated on a large, real-world dataset of Stack Overflow questions, making them suitable for academic, research, and practical applications.


๐Ÿ“‹ Contents

  • df_preprocessed: Preprocessed version of the original Stack Overflow dataset, ready for model training and easier retraining.
  • finetuning.ipynb: Jupyter notebook for preprocessing, model loading, fine-tuning, and saving models to Hugging Face.
  • model_eval.ipynb: Jupyter notebook for evaluating all fine-tuned models and comparing their performance.

๐Ÿง  Models

The following transformer models are fine-tuned and included:

  • BERT-base
  • RoBERTa
  • ALBERT
  • DistilBERT

All models are fine-tuned on the same dataset for consistent benchmarking and comparison.


๐Ÿ“šDataset

  • Source: 60,000 Stack Overflow questions from Kaggle
  • Preprocessing: Cleaning, normalization, and formatting performed in finetuning.ipynb
  • Saved File: Preprocessed data is stored as df_preprocessed for reproducibility and faster retraining

๐Ÿ› ๏ธ Getting Started

Requirements:

  • Python 3.8+
  • Jupyter Notebook
  • PyTorch
  • transformers, datasets, pandas, scikit-learn, and other standard ML/NLP libraries

Setup:

  1. Clone the repository.
  2. Install dependencies:
pip install requirements.txt
  1. Open finetuning.ipynb to preprocess data and fine-tune models.
  2. Use model_eval.ipynb to evaluate and compare model performance.

โš™๏ธ Usage

  • Retraining: Use df_preprocessed as your starting dataset for further fine-tuning or experimentation.
  • Model Evaluation: Run model_eval.ipynb to assess model accuracy, F1, and other relevant metrics.
  • Hugging Face Integration: Models can be uploaded and shared via Hugging Face Model Hub.

๐Ÿ“‚ File Structure

File/Folder Description
df_preprocessed Preprocessed Stack Overflow dataset (post-cleaning)
finetuning.ipynb Data preprocessing, model loading, fine-tuning, and saving
model_eval.ipynb Evaluation and comparison of all fine-tuned models

๐Ÿ“Š Results

  • Comparative evaluation of all four models is available in model_eval.ipynb, including accuracy, F1-score, and other relevant metrics.

About

A curated collection of four transformer models-BERT, RoBERTa, ALBERT, and DistilBERT-each fine-tuned on 60,000 Stack Overflow questions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published