A beginner-friendly Machine Learning project that classifies SMS messages as Spam or Ham (Not Spam) using Python, Scikit-learn, and TF-IDF vectorization. This project includes text preprocessing, model training, evaluation, and vocabulary visualization using WordClouds.
- Classifies SMS messages as spam or not spam
- Uses TF-IDF to convert text to numeric features
- Built with Naive Bayes (MultinomialNB) algorithm
- Accuracy score, confusion matrix, and classification report
- WordCloud visualizations for spam and ham messages
- Live predictions on custom sample messages
| Category | Tools Used |
|---|---|
| Language | Python |
| Libraries | Pandas, Scikit-learn, Matplotlib, WordCloud |
| ML Algorithm | Naive Bayes (MultinomialNB) |
| Feature Extraction | TF-IDF Vectorizer |
| IDE | Jupyter Notebook |
- Name: SMS Spam Collection Dataset
- Source: Kaggle Datasets
- File:
spam.csv(included in this repo)
This dataset contains 5,572 SMS messages labeled as either "spam" or "ham".
- Load Dataset – Reads and cleans the SMS spam dataset.
- Preprocess Data – Labels are converted (ham → 0, spam → 1).
- WordClouds – Generates WordClouds for both spam and ham.
- TF-IDF Vectorization – Converts text into numeric vectors.
- Train-Test Split – 80% training, 20% testing.
- Train Model – Uses Naive Bayes classifier.
- Evaluate Model – Prints accuracy, confusion matrix, and report.
- Live Prediction – Test the model with your own text inputs.
sample = ["Congratulations! You've won a free iPhone. Click the below link to claim"]
# Output: Spam
sample = ["Hey, are we still meeting at 6 PM?"]
# Output: Not SpamThis project is licensed under the MIT License. Feel free to use, modify, and distribute for personal and commercial purposes.
Contributions, issues, and feature requests are welcome! Feel free to fork this repo and submit a pull request.
Created with ❤️ by Sushmitha Shettigar

