Benchmark three generations of recommenders on a sequence next‑item task:
- Biased Matrix Factorization (classical baseline)
- Tabular Deep Models (DeepFM and/or DCNv2)
- Sequence‑Aware Transformers (SASRec and/or PinnerFormer‑style encoder)
Goal: Quantify the lift each stage provides on top‑N ranking quality, calibrated across identical data splits and negative‑sampling schemes.
Dataset: MovieLens 1M
- Ratings: 1,000,209
- Users: 6,040
- Movies: 3,900
Signals:
- Explicit star ratings (1–5)
- Timestamps
- User demographics
- Movie genres
Task:
Sequence next‑item prediction — given a user’s chronological prefix, predict the next movie they rate ≥ 4 stars.
- Positives: Held‑out last ≥4‑star interaction per user
- Negatives: 99 random unseen movies per test case (uniform sampling)
- Load and clean
ratings,users,movies. Drop invalid MovieIDs. - Filter to ratings ≥ 4 (positive signal).
- Sort each user’s ratings chronologically → sequences
s = [m₁, ..., m_T]. - Split per user:
- Train = first 80%
- Validation = next 10%
- Test = final 10%
- Construct training sequences:
- For each prefix
s[:t](t ≥ 2):- Positive =
s[t] - K negatives =
sample_unseen(user, K)(default K = 5)
- Positive =
- For each prefix
- Pad and mask sequences for transformer batches (max length L = 50).
- Build feature tables for tabular models:
- User: age bucket, occupation
- Movie: multi‑hot genres
| Tier | Core Idea | Loss | Init | Notes |
|---|---|---|---|---|
| MF | Biased latent factors | BPR (1+K) | Xavier | Fast baseline; used to warm-start deep models |
| DeepFM/DCNv2 | Sparse-dense cross layers | BCE | Random | Ingests demographics + genres |
| SASRec | Transformer encoder | CE over 1+K | Pre-trained MF | Self-attention (L=50, 2 layers) |
| PinnerFormer-lite | SASRec + conv-mixing | CE | SASRec weights | (stretch goal) Adds local context with depthwise convolutions |
- For each user’s test prefix, score the 100-item candidate set.
- Compute and average user-level metrics:
- Primary: Hit@10, NDCG@10
- Secondary: MRR, MAP
- [stretch] Analyze cold‑start (<10 train ratings) vs power users (≥50).
- [stretch] Statistical significance:
- Bootstrap 1,000× NDCG@10
- Report 95% CI and paired t‑test
| Axis | Variation | Hypothesis |
|---|---|---|
| Negative sample K | 1, 5, 20, 50 | Higher K → sharper rank signal, but memory-intensive |
| Sequence length L | 20, 50, 100 | Longer histories benefit transformers more than MF |
| Positive threshold | Rating ≥5 vs ≥4 | Stricter positive reduces data, may lower recall |
| Feature sets | +demographics, +genres, both, none | DCNv2 expected to gain most from feature richness |
| Embedding warm-start | MF → deep vs random | Warm-start helps convergence |
| Model depth | SASRec: 2 vs 4 layers | Little gain beyond 2 layers for small dataset |
This repository provides end-to-end code for preprocessing the MovieLens dataset, training models, and evaluating their performance. It’s structured to let you plug in new models easily, while reusing existing preprocessing and evaluation pipelines.
- Install Conda (if you don’t have it).
- Create environment and install dependencies:
conda env create -f environment.yml conda activate recommender-env
- Ensure Python 3.10+ is active.
- Download and place the MovieLens raw files (
movies.dat,ratings.dat,users.dat) into thedata/folder. - Verify file integrity (MD5 or file sizes) if needed.
The preprocess.py script handles:
- Loads raw MovieLens .dat files into Pandas DataFrames
- Filters out low-rating interactions (e.g., only keep ratings ≥ 4)
- Sorts and splits each user’s interaction history into train/val/test
- Returns clean dictionaries and sequences per user
- Performs negative sampling during evaluation set construction
from preprocess import load_movielens, clean_and_filter, get_user_sequences, split_sequences, build_datasets
ratings, users, movies = load_movielens("data/")
ratings, users, movies = clean_and_filter(ratings, users, movies, rating_threshold=4)
user_seqs = get_user_sequences(ratings)
splits = split_sequences(user_seqs, train_ratio=0.8, val_ratio=0.1)
all_items = set(movies.MovieID)
train_data, val_data, test_data = build_datasets(splits, all_items, candidate_size=100)Refer to the function docstring for parameter details.
Contains the matrix factorization baseline:
MatrixFactorization: simple embedding-based MF with dot-product (no features)
- Create a new file in
models/, e.g.my_model.py. - Define a class with
.forward(user_ids, item_ids)or.forward(batch). - Ensure it accepts the same input format as other models.
- Import and instantiate it in your experiment script or notebook.
This module provides reusable evaluation tools for recommender systems using implicit feedback. It supports Hit@K, NDCG@K, MRR, and MAP, and evaluates models using a parameterized negative sampling routine.
- Standard ranking metrics: Hit@K, NDCG@K, MRR, MAP
- Compatible with any PyTorch model: accepts model(user_tensor, item_tensor) interface
- Supports custom negative samplers. By default it samples negatives from items the user hasn’t seen
- Designed for sequence-aware recommendation, but works with general recommendation models
evaluate_ranking_model(
model, # PyTorch model
user_splits, # dict[user_id] = (train_seq, val_seq, test_seq)
global_items, # set of all item IDs
device, # torch.device('cuda') or 'cpu'
*,
candidate_size=100, # 1 positive + N-1 negatives
k=10, # Metric cutoff
negative_sampler=... # Function to generate negatives
) -> dictfrom evaluation import evaluate_ranking_model
from torch import device
metrics = evaluate_ranking_model(
model=my_model,
user_splits=my_user_split_dict,
global_items=set_of_all_items,
device=torch.device("cuda"),
candidate_size=100,
k=10
)
print(metrics)The notebook shows:
- Loading and preprocessing data
- Instantiating the baseline MF model
- Training loop with BCE loss
- Evaluating metrics on validation/test
- Plotting learning curves
I recommend you create new notebooks (e.g. preprended with your initials) to run experiments for the report.
- Implement your model in
models/your_model.py. - In a new notebook or script:
- Import
preprocess_movielens,SequentialDataset,evaluate_model. - Create data loaders.
- Instantiate your model, optimizer, and loss.
- Train with a loop similar to the baseline notebook.
- Call
evaluate_modelafter each epoch.
- Import
This structure lets you focus on model innovation without rewriting data or eval code.
- Follow PEP8 style in new modules.
- Write google style docstrings for new functions.
- Add unit tests under a
tests/folder (future).