📝 Text Classification using Machine Learning & Deep Learning

This repository demonstrates a complete Natural Language Processing (NLP) pipeline for text classification, progressing from a traditional machine learning baseline to a deep learning model built with TensorFlow and Keras.

The project is designed to be educational, extensible, and presentation-ready, making it suitable for learning, interviews, and portfolio showcasing.

📌 Problem Statement

Given a piece of text, the task is to classify it into one of several predefined categories using supervised learning techniques.

📂 Dataset

Dataset: 20 Newsgroups (via scikit-learn)
Selected Categories:
- rec.autos
- sci.med
- comp.graphics
- sci.space
Total Samples: ~3,940 documents
Preprocessing:
- Removed headers, footers, and quoted text
- Used cleaned raw text for modeling

🧠 Project Overview

This project is implemented in two stages to clearly compare traditional NLP methods with deep learning approaches.

1️⃣ TF-IDF + Logistic Regression (Baseline)

📓 Notebook: 01_tfidf_logistic_regression.ipynb

Methodology

Text Representation: TF-IDF Vectorization
- Max features: 5000
- English stopwords removed
Model: Logistic Regression
Train–Test Split: 80% / 20%

Evaluation

Accuracy
Precision, Recall, F1-score

Results

Test Accuracy: ~88%
Strong, interpretable baseline with minimal computation

2️⃣ Deep Learning with TensorFlow & Keras

📓 Notebook: 02_tensorflow_text_classification.ipynb

This notebook replaces manual feature engineering with learned word embeddings and a neural network architecture.

Model Architecture

TextVectorization layer
Embedding layer
Global Average Pooling
Dense hidden layer (ReLU)
Softmax output layer

Training Details

Framework: TensorFlow & Keras
Loss Function: Sparse Categorical Crossentropy
Optimizer: Adam
Epochs: 10
Batch Size: 32
Validation Split: 20%

Results

Validation Accuracy: ~81–85%
Test Accuracy: ~81%
Demonstrates semantic learning and scalability

📊 Model Comparison

Model	Feature Type	Test Accuracy
TF-IDF + Logistic Regression	Sparse statistical features	~88%
TensorFlow Neural Network	Learned embeddings	~81%

✅ Key Takeaways

TF-IDF with linear models remains a strong NLP baseline
Deep learning models learn semantic representations automatically
Neural networks scale better for large and complex datasets
The project highlights trade-offs between interpretability and flexibility

🚀 Future Improvements

CNN-based text classifiers
LSTM / Bi-LSTM architectures
Transformer models (BERT)
Hyperparameter tuning
Confusion matrix & detailed error analysis

🧪 Requirements

Python 3.x
scikit-learn
TensorFlow
NumPy
Jupyter Notebook / Google Colab

📜 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📝 Text Classification using Machine Learning & Deep Learning

📌 Problem Statement

📂 Dataset

🧠 Project Overview

1️⃣ TF-IDF + Logistic Regression (Baseline)

Methodology

Evaluation

Results

2️⃣ Deep Learning with TensorFlow & Keras

Model Architecture

Training Details

Results

📊 Model Comparison

✅ Key Takeaways

🚀 Future Improvements

🧪 Requirements

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📝 Text Classification using Machine Learning & Deep Learning

📌 Problem Statement

📂 Dataset

🧠 Project Overview

1️⃣ TF-IDF + Logistic Regression (Baseline)

Methodology

Evaluation

Results

2️⃣ Deep Learning with TensorFlow & Keras

Model Architecture

Training Details

Results

📊 Model Comparison

✅ Key Takeaways

🚀 Future Improvements

🧪 Requirements

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages