This project implements a translation model using the Transformer architecture, based on the groundbreaking paper "Attention is All You Need" (Vaswani et al., 2017). The implementation focuses on English-to-French translation whilst offering a simple to understand implementation of the architecture in PyTorch.
The Transformer architecture revolutionized natural language processing by eliminating the need for recurrent or convolutional neural networks, instead relying entirely on attention mechanisms to capture relationships between words. This implementation showcases three key innovations:
- Multi-Head Self-Attention: Allowing the model to simultaneously attend to information from different representation subspaces
- Encoder-Decoder Architecture: Processing the input sequence and generating the output sequence using stacked attention layers
- Positional Encoding: Incorporating sequence order information without recurrence
git clone https://github.com/yourusername/transformer-translation.git
cd transformer-translation
pip install -r requirements.txt-
Tokenization (
tokenisers.py):- Word-level tokenization with special tokens (PAD, UNK, START, END)
- Vocabulary creation with frequency-based filtering
- Text encoding and decoding utilities
-
Transformer Architecture (
model.py):- Multi-head attention implementation with separate query, key, and value projections
- Encoder and decoder stacks with residual connections
- Position-wise feed-forward networks
- Positional encoding implementation
-
Training Pipeline (
train.py):- Custom dataset class for handling parallel text data
- Training loop with learning rate scheduling
- Validation and model checkpointing
- Generation utilities for inference
We have also provided a notebook Transformer_Translation.ipynb which describes how the model works and how training works.
The default model configuration includes:
- 6 encoder layers
- 3 pre-cross-attention decoder layers
- 3 cross-attention decoder layers
- 8 attention heads
- 256 embedding dimensions
- Dropout rate of 0.1
To train the model:
python train.pyThe training script includes:
- Dynamic learning rate adjustment
- Gradient clipping
- Model checkpointing
- Validation monitoring
The model expects parallel text data in CSV format with columns for source (English) and target (French) sentences. The data should be preprocessed to:
- Convert text to lowercase
- Add appropriate spacing around punctuation
- Remove special characters
- Normalize whitespace
An example dataset that is appropriate can be found at https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench.
To translate text using a trained model:
from model import Transformer, TransformerConfig
import torch
# Load model and tokenizers
model = Transformer(config).to(device)
model.load_state_dict(torch.load('models/best_model.pt')['model_state_dict'])
# Generate translation
translated_ids = model.generate(
src_ids=encoded_input,
max_new_tokens=128,
temperature=1.0,
top_k=50
)The implementation includes several optimizations:
- Parallel computation in multi-head attention
- Efficient batch processing of sequences
- Memory-efficient attention masking
- Gradient clipping for stable training
- Python 3.7+
- PyTorch 1.7+
- NumPy
- tqdm
Additional dependencies can be found in requirements.txt.
If you use this implementation in your research, please cite:
@article{vaswani2017attention,
title={Attention is all you need},
author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia},
journal={Advances in neural information processing systems},
volume={30},
year={2017}
}