This repository contains the implementation and resources for our research on tokenization techniques for code-mixed Indian languages. We evaluate three tokenization approaches (BPE, WordPiece, and SentencePiece) across three code-mixed language pairs (Tanglish, Kanglish, and Hinglish).
Code-mixed languages, characterized by the blending of two or more languages in a single utterance, pose significant challenges in natural language processing (NLP). Tokenization plays a crucial role in handling such linguistically complex data effectively.
In this study, we comprehensively evaluate three tokenization techniques—WordPiece, SentencePiece, and Byte Pair Encoding (BPE)—across three distinct code-mixed languages: Tanglish, Kanglish, and Hinglish. By leveraging a dataset collected from YouTube video transcriptions and transliterated using AI4Bharat's Bhashini, we systematically determine the optimal tokenizer for each language based on loss function analysis. The results provide nuanced insights into the efficiency of different tokenization strategies for diverse linguistic structures in code-mixed text, contributing to the advancement of multilingual NLP research.
- Python 3.8+
- PyTorch 1.8+
- Transformers 4.10+
- Pandas
- NumPy
- Matplotlib
- tqdm
Install dependencies:
pip install -r requirements.txt.
├── Co-BERT.ipynb # Main notebook containing all experiments and analysis
├── Dataset/ # Dataset folder containing three code-mixed datasets
│ ├── hinglish.txt # Hindi-English code-mixed data
│ ├── kanglish.txt # Kannada-English code-mixed data
│ └── tanglish.txt # Tamil-English code-mixed data
├── README.md # Project documentation
└── LICENSE # MIT License
-
Clone this repository:
git clone https://github.com/madhans476/Co-BERT.git cd Co-BERT -
Open and run the Jupyter notebook:
jupyter notebook Co-BERT.ipynb
-
Follow the notebook to:
- Preprocess and explore code-mixed datasets
- Implement different tokenization techniques
- Train and evaluate models
- Compare tokenization performance
Our comparative analysis of tokenization techniques across different code-mixed languages reveals distinct performance patterns:
| Tokenizer | Tanglish | Kanglish | Hinglish |
|---|---|---|---|
| BPE | 7.20 | 7.32 | 7.18 |
| WordPiece | 7.43 | 7.42 | 7.06 |
| SentencePiece | 7.73 | 7.73 | 7.56 |
Key findings:
- WordPiece demonstrates superior performance for Hinglish with the lowest loss of 7.06
- BPE achieves the best results for Tanglish with a loss of 7.20
- BPE slightly outperforms WordPiece for Kanglish
- SentencePiece consistently shows higher loss values across all language pairs
Co-BERT adapts the BERT architecture with the following modifications:
- 6 transformer encoder layers (vs. original BERT's 12)
- Hidden size of 768 with 12 attention heads
- Vocabulary size of 2000
- Maximum sequence length of 128
- Masked Language Modeling and Next Sentence Prediction pre-training tasks
- AdamW optimizer with learning rate 5e-5

Our research leveraged YouTube video transcripts as the primary source of linguistic data. The processing pipeline:
- Collection of native language transcripts (Tamil, Hindi, Kannada)
- Transliteration using AI4Bharat's Bhashini
- Manual correction of transliteration errors
- Preprocessing for model training
This project is licensed under the MIT License - see the LICENSE file for details.