Predict one or more movie genres from a film’s plot summary using sentence/document embeddings (Bert, SBert, W2v, tf-idf) and a flexible neural classifier. This repo implements a clean, reproducible NLP → multi-label classification pipeline in Python.
- Task: Multi-label text classification (a movie can have several genres).
- Pipeline: text cleanup → embedding (BERT / SBERT / Word2Vec) → feed-forward neural network → thresholding.
- Models (as implemented):
- BERT with CLS + MLP (
BertPretrained embeddings→FlexibleNeuralNetwork) - SBERT + MLP (
Sentence-BERT embeddings→FlexibleNeuralNetwork) - Google News Word2Vec + MLP (
Word2Vec 300-d embeddings→FlexibleNeuralNetwork)
- BERT with CLS + MLP (
- Evaluation: Micro/Macro-F1, Precision/Recall, Jaccard, Hamming accuracy; global decision-threshold tuning.
- Reproducibility: Seeded runs,
requirements.txt, scriptable CLI; metrics saved underanalysis/model_comparison/.
Top run (test set, best global threshold):
| Model | F1 (best-threshold) | Threshold | Precision | Recall | Jaccard | Hamming acc. |
|---|---|---|---|---|---|---|
| SBERT with NN | 0.660 | 0.271 | 0.642 | 0.676 | 0.491 | 0.905 |