This repository demonstrates a complete Natural Language Processing (NLP) pipeline for text classification, progressing from a traditional machine learning baseline to a deep learning model built with TensorFlow and Keras.
The project is designed to be educational, extensible, and presentation-ready, making it suitable for learning, interviews, and portfolio showcasing.
Given a piece of text, the task is to classify it into one of several predefined categories using supervised learning techniques.
- Dataset: 20 Newsgroups (via scikit-learn)
- Selected Categories:
rec.autossci.medcomp.graphicssci.space
- Total Samples: ~3,940 documents
- Preprocessing:
- Removed headers, footers, and quoted text
- Used cleaned raw text for modeling
This project is implemented in two stages to clearly compare traditional NLP methods with deep learning approaches.
📓 Notebook: 01_tfidf_logistic_regression.ipynb
- Text Representation: TF-IDF Vectorization
- Max features: 5000
- English stopwords removed
- Model: Logistic Regression
- Train–Test Split: 80% / 20%
- Accuracy
- Precision, Recall, F1-score
- Test Accuracy: ~88%
- Strong, interpretable baseline with minimal computation
📓 Notebook: 02_tensorflow_text_classification.ipynb
This notebook replaces manual feature engineering with learned word embeddings and a neural network architecture.
- TextVectorization layer
- Embedding layer
- Global Average Pooling
- Dense hidden layer (ReLU)
- Softmax output layer
- Framework: TensorFlow & Keras
- Loss Function: Sparse Categorical Crossentropy
- Optimizer: Adam
- Epochs: 10
- Batch Size: 32
- Validation Split: 20%
- Validation Accuracy: ~81–85%
- Test Accuracy: ~81%
- Demonstrates semantic learning and scalability
| Model | Feature Type | Test Accuracy |
|---|---|---|
| TF-IDF + Logistic Regression | Sparse statistical features | ~88% |
| TensorFlow Neural Network | Learned embeddings | ~81% |
- TF-IDF with linear models remains a strong NLP baseline
- Deep learning models learn semantic representations automatically
- Neural networks scale better for large and complex datasets
- The project highlights trade-offs between interpretability and flexibility
- CNN-based text classifiers
- LSTM / Bi-LSTM architectures
- Transformer models (BERT)
- Hyperparameter tuning
- Confusion matrix & detailed error analysis
- Python 3.x
- scikit-learn
- TensorFlow
- NumPy
- Jupyter Notebook / Google Colab
This project is licensed under the MIT License.