Skip to content

LeeLanguageLab/URIELPlus-ProxyLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Using ProxyLM Regressor with URIEL+

By Mason Shipton, David Anugraha, York Hay Ng

Contents

About ProxyLM

framework for LM performance prediction

Abstract

Performance prediction is a method to estimate the performance of Language Models (LMs) on various Natural Language Processing (NLP) tasks, mitigating computational costs associated with model capacity and data for fine-tuning. Our paper introduces ProxyLM, a scalable framework for predicting LM performance using proxy models in multilingual tasks. These proxy models act as surrogates, approximating the performance of the LM of interest. By leveraging proxy models, ProxyLM significantly reduces computational overhead on task evaluations, achieving up to a 37.08x speedup compared to traditional methods, even with our smallest proxy models. Additionally, our methodology showcases adaptability to previously unseen languages in pre-trained LMs, outperforming the state-of-the-art performance by 1.89x as measured by root-mean-square error (RMSE). This framework streamlines model selection, enabling efficient deployment and iterative LM enhancements without extensive computational resources.

If you are interested for more information, check out the full paper.

If you use this code for your research, please cite the following work:

@inproceedings{anugraha-etal-2025-proxylm,
    title = "{P}roxy{LM}: Predicting Language Model Performance on Multilingual Tasks via Proxy Models",
    author = "Anugraha, David  and
      Winata, Genta Indra  and
      Li, Chenyue  and
      Irawan, Patrick Amadeus  and
      Lee, En-Shiun Annie",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-naacl.106/",
    pages = "1981--2011",
    ISBN = "979-8-89176-195-7",
    abstract = "Performance prediction is a method to estimate the performance of Language Models (LMs) on various Natural Language Processing (NLP) tasks, mitigating computational costs associated with model capacity and data for fine-tuning. Our paper presents ProxyLM, a scalable task- and language-agnostic framework designed to predict the performance of LMs using proxy models. These proxy models act as surrogates, approximating the performance of the LM of interest. By leveraging these proxy models, ProxyLM significantly reduces computational overhead in task evaluations, achieving up to a 37.08x speedup over traditional methods, even with our smallest proxy models. Our results across multiple multilingual NLP tasks and various robustness tests demonstrate that ProxyLM not only adapts well to previously unseen languages in pre-trained LMs, but also generalizes effectively across different datasets, outperforming the state-of-the-art by at least 1.78x in terms of root-mean-square error (RMSE)."
}

If you have any questions, you can open a GitHub Issue or send them an email.

About URIEL+

knowledge base for natural language processing

Abstract

URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.

If you are interested for more information, check out the full paper.

If you use this code for your research, please cite the following work:

@inproceedings{khan-etal-2025-uriel,
    title = "{URIEL}+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base",
    author = {Khan, Aditya  and
      Shipton, Mason  and
      Anugraha, David  and
      Duan, Kaiyao  and
      Hoang, Phuong H.  and
      Khiu, Eric  and
      Do{\u{g}}ru{\"o}z, A. Seza  and
      Lee, En-Shiun Annie},
    editor = "Rambow, Owen  and
      Wanner, Leo  and
      Apidianaki, Marianna  and
      Al-Khalifa, Hend  and
      Eugenio, Barbara Di  and
      Schockaert, Steven",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.coling-main.463/",
    pages = "6937--6952",
    abstract = "URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec that addresses these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves the user experience with robust, customizable distance calculations to better suit the needs of users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies."
}

If you have any questions, you can open a GitHub Issue or send them an email.

Check out ExploRIEL, the online UI for URIEL+: https://uriel-leelab.streamlit.app/

Environment

Requires Python 3.10 or later.

All dependencies are listed in the requirements/ folder.

Running ProxyLM Regressor

1. Distance Calculation

Run the following script to calculate URIEL+ distances:

python distances/calculate_distances.py

This will create two CSV files containing distances for the MT560 and NUSA language datasets.

Output files will be saved to the distances/ folder.


2. Updating Experiment CSVs

After calculating distances, run:

python distances/replace_distances.py

This updates the experiment CSV files for MT560 and NUSA with URIEL+ distances.

Updated experiment CSVs will be saved to src/proxy_regressor/csv_datasets/.

๐Ÿ“„ Note: To add a new distance type, follow the same format used for morphological distance in distances/replace_distances.py.


3. Changing Language Features

If you add or remove language features (e.g., introducing a new feature type), open src\proxy_regressor\utils.py and update the LANG_FEATURES list to include or exclude the appropriate language features.

4. Running Experiments

MT560 Experiments (click to expand)
  • Random Sampling (M2M100):

    python -m src.proxy_regressor.main -em random -r xgb -rj src/proxy_regressor/regressor_configs/xgb_config_mt560_m2m100.json -d mt560 -m m2m100
  • Random Sampling (NLLB):

    python -m src.proxy_regressor.main -em random -r xgb -rj src/proxy_regressor/regressor_configs/xgb_config_mt560_nllb.json -d mt560 -m nllb
  • Leave-One-Language-Out (LOLO) (M2M100):

    python -m src.proxy_regressor.main -em lolo -r xgb -rj src/proxy_regressor/regressor_configs/xgb_config_mt560_m2m100.json -d mt560 -m m2m100 -l all
  • Leave-One-Language-Out (LOLO) (NLLB):

    python -m src.proxy_regressor.main -em lolo -r xgb -rj src/proxy_regressor/regressor_configs/xgb_config_mt560_nllb.json -d mt560 -m nllb -l all
  • Seen/Unseen (M2M100):

    python -m src.proxy_regressor.main -em seen_unseen -r xgb -rj src/proxy_regressor/regressor_configs/xgb_config_mt560_m2m100.json -d mt560 -m m2m100

    After running the Seen/Unseen (M2M100) command, run:

    python unseen.py

    This will output a text file with more readable results and will output the average standard error. NOTE: For Seen/Unseen (M2M100) experiments, take the average of test_source_rmse and test_target_rmse for the test_rmse.


NUSA Experiments (click to expand)
  • Random Sampling (M2M100):

    python -m src.proxy_regressor.main -em random -r xgb -rj src/proxy_regressor/regressor_configs/xgb_config_nusa_m2m100.json -d nusa -m m2m100
  • Random Sampling (NLLB):

    python -m src.proxy_regressor.main -em random -r xgb -rj src/proxy_regressor/regressor_configs/xgb_config_nusa_nllb.json -d nusa -m nllb
  • Leave-One-Language-Out (LOLO) (M2M100):

    python -m src.proxy_regressor.main -em lolo -r xgb -rj src/proxy_regressor/regressor_configs/xgb_config_nusa_m2m100.json -d nusa -m m2m100 -l all
  • Leave-One-Language-Out (LOLO) (NLLB):

    python -m src.proxy_regressor.main -em lolo -r xgb -rj src/proxy_regressor/regressor_configs/xgb_config_nusa_nllb.json -d nusa -m nllb -l all

๐Ÿ“„ Note: After each experiment finishes, results are automatically saved to a .csv file. Extract the test RMSE and test SE from the CSV (you may need to average them across individual languages). Lower values indicate better performance, as RMSE measures error.

Optional

5. Determining Statistical Significance

You can test statistical significance between URIEL, URIEL+, or different URIEL versions.

Steps:

  1. Open test.py and update the parameters at line 19 to point to the correct experiment.

  2. Run:

    python test.py

    This will save the Y_test results from the experiment to a text file.

  3. Y_pred results from the experiment are saved in a file named {dataset_name}_{model_name}_Y_pred_results.txt.
    Copy both the Y_test and Y_pred values into statistical.py under the correct experiment section.

  4. Run:

    python statistical.py

    This will output the p-value measuring the statistical significance between the different URIEL results.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages