A package for determining the matrix language in bilingual sentences. This is the implementation of the algorithms presented in the paper Methods for Automatic Matrix Language Determination of Code-Switched Speech. Currently supports English/Mandarin code-switching, create a feature request if you want the system to be extended to other languages.
The main functionality can be easily installed into your Python environment using pip:
pip install ml-determinationTo predict the matrix language using the package import the library and the matrix language determination classes for text:
>>> from ml_determination.predict_matrix_language import MatrixLanguageDeterminerWordMajority
>>> ml = MatrixLanguageDeterminerWordMajority(L1='ZH', L2='EN')
>>> ml.determine_ML('然后 那些 air supply 的 然后 michael learns to rock 的 啊 certain 的 啦')
'EN'The package includes several implementations of methods for matrix language determination:
- Word majority from Bullock et al 2018: MatrixLanguageDeterminerWordMajority
- First part of the Morpheme Order Principle from Myers-Scotton 2002, called the singleton principle in Iakovenko 2024: MatrixLanguageDeterminerP11
- Second part of the the Morpheme Order Principle as in Iakovenko 2024: MatrixLanguageDeterminerP12
- System Morpheme Principle from Myers-Scotton 2002: MatrixLanguageDeterminerP2
MatrixLanguageDeterminerP12 requires trained language models for running in order to rescore code-switched sentences. To download the trained models, used in the experiments of Iakovenko 2024, clone the following repository:
cd /your/model/folder
git clone https://huggingface.co/dinoyay/ml-determination-lmsThen you can determine the matrix language using P1.2:
>>> from ml_determination.predict_matrix_language import MatrixLanguageDeterminerP12
>>> config = {
'EN': {
'data_path': '/your/model/folder/ml-determination-lms/en/',
'model_path': '/your/model/folder/ml-determination-lms/en/model.pt'},
'ZH': {
'data_path': '/your/model/folder/ml-determination-lms/zh/',
'model_path': '/your/model/folder/ml-determination-lms/zh/model.pt'
}
}
>>> ml = MatrixLanguageDeterminerP12(L1='ZH', L2='EN', config=config, alpha=1.2765)
>>> ml.determine_ML('然后 那些 air supply 的 然后 michael learns to rock 的 啊 certain 的 啦')
'ZH'If you use ml_determination in your projects, please feel free to cite the original EMNLP paper the following way:
@inproceedings{iakovenko-hain-2024-methods,
title = "Methods of Automatic Matrix Language Determination for Code-Switched Speech",
author = "Iakovenko, Olga and Hain, Thomas",
editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.330/",
doi = "10.18653/v1/2024.emnlp-main.330",
pages = "5791--5800"
}
This code was created with the support of Engineering and Physical Sciences Research Council.