BayaanX is a Neural Machine Translation (NMT) system that translates Arabic text into English using a Transformer-based Sequence-to-Sequence model. It is trained on the Helsinki-NLP/tatoeba_mt dataset and includes preprocessing, training, inference, and a web-based translation interface using Streamlit.
✅ Preprocessing pipeline to generate training and evaluation data
✅ Custom Vocabulary Builder with tokenization
✅ Transformer-based Encoder-Decoder model
✅ Training script with batch processing and multi-head attention
✅ Inference script for sentence translation
✅ Interactive Web App for real-time translation
git clone https://github.com/AbsarRaashid3/NMT-Arabic-English.git
cd NMT-Arabic-EnglishMake sure you have Python 3.8+ installed. Then, install the required packages:
pip install -r requirements.txt
Generate training and test pairs:
python src/preprocess.py --output_file data/train_pairs.txt --split validation
python src/preprocess.py --output_file data/test_pairs.txt --split test
python src/vocab.py --pairs_file data/train_pairs.txt --src_vocab_file src/src_vocab.pkl --tgt_vocab_file src/tgt_vocab.pkl
python src/train.py --pairs_file data/train_pairs.txt --src_vocab_file src/src_vocab.pkl --tgt_vocab_file src/tgt_vocab.pkl --epochs 50 --batch_size 32
To translate an Arabic sentence:
python src/infer.py --model_checkpoint transformer_nmt.pt --src_vocab_file src/src_vocab.pkl --tgt_vocab_file src/tgt_vocab.pkl --input_sentence "يا له من مغامر !"
Launch the Streamlit web interface for translation:
streamlit run src/app.py
The translation model is based on the Transformer architecture using Multi-Head Attention and Positional Encoding.
It includes:
Encoder: Multi-layer Transformer Encoder Decoder: Transformer Decoder with attention to the encoder’s outputs Token Embeddings and Positional Encoding Beam Search / Greedy Search for inference
Dataset: Helsinki-NLP/tatoeba_mt (Arabic-English)
Model Size: 2-layer Transformer with 256 hidden units
Optimizer: Adam (lr=1e-4)
Loss Function: Cross-Entropy
Batch Size: 32
Epochs: 50
Hugging Face Datasets – for providing the Tatoeba Arabic-English dataset PyTorch – for the deep learning framework Streamlit – for the interactive UI


