Shopping MMLU

This is the repository for the Gaudi version of 'Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models', which is accepted by NeurIPS 2024 Datasets and Benchmarks Track and used for Amazon KDD Cup 2024. Shopping MMLU is a massive multi-task benchmark for LLMs on online shopping, covering four major shopping skills, shopping concept understanding, shopping knowledge reasoning, user behavior alignment, and multi-lingual abilities.

You can find more detailed information about the dataset in the following links:

The paper and supplementary materials here.
The workshop of our KDD Cup Challenge and winning solutions here.
The HuggingFace Leaderboard here.

Repo Organization

.
├── data:                You will need to create this folder and put the evaluation data in it.
├── skill_wise_eval:     This folder contains code for evaluating a skill as a whole.
├── task_wise_eval:      This folder contains code for evaluating a single task.
└── README.md

Data

Where to download?

The zipfile data.zip contains all data in Shopping MMLU. Create a new folder data, and unzip the zipfile in it.

Data formats

We have five different types of tasks, multiple choice, retrieval, ranking, named entity recognition, and generation.

Files for multiple choice questions are organized in .csv formats with three columns.

question: The question of the multiple choice.
choices: The possible choices (4 in total) of this multiple choice.
answer: The answer (within 0, 1, 2, and 3), indicating that the correct answer is choices[answer].

Files for other types of tasks are organized in .json formats with two fields, input_field and target_field.

Running evaluations

Setup

First let's set up the docker image for running on the Gaudi chips. This will need the correct version of the Synapse drivers (Habana is the name of the team that created both the Gaudi Chips and the Synapse drivers). We recommend always working inside a docker image for this. The below line will create a new docker container with Synapse version 1.21.1, with a suitable Pytorch 2.6.0 for running on Gaudi chips, with the ~/ShoppingMMLU directory mounted as /shopping and all of the underlying machine's Gaudi chips visible. You can't mix and match Pytorch versions- we have modified Pytorch to run well on Gaudi chips and the Synapse and Pytorch versions have to be in sync. Update the -v ~/ShoppingMMLU:/shopping to point to the correct path for where you cloned this repo, mounting it as /shopping.

docker run -d -it --runtime=habana --name shopping -v ~/ShoppingMMLU:/shopping -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host --net=host -e HF_HOME=/data/huggingface vault.habana.ai/gaudi-docker/1.21.1/ubuntu24.04/habanalabs/pytorch-installer-2.6.0:latest /bin/bash

docker exec -it shopping /bin/bash

Note that unlike with other companies environments, the HABANA_VISIBLE_DEVICES environment setting only has effect at docker container creation time- you can't prepend a HABANA_VISIBLE_DEVICES flag in front of a Python script- it will have no effect. If you want to use fewer than "all" you can replace that section with HABANA_VISIBLE_DEVICES=0,1,2,3 You can use all cards, two sets of four cards (either 0,1,2,3 or 4,5,6,7), four sets of two cards (0,1 and 2,3 etc.), or eight separate cards.

Then inside your docker container you will want to go to /shopping and install the requirements:

cd /shopping
pip install -r requirements.txt

Then you will need to set up the data for the testing.

mkdir data
unzip data.zip -d data/

Note that the requirements.txt file uses the correct versions of the libraries for PyTorch 2.6.0. If you are using a different version of the docker container, you might need to adjust the versions in requirements.txt

Docker synapse versions are generally backwards compatible across a reasonable range of versions: the host OS Synapse version can be a several versions ahead of the docker image and everything will still work, but be wary of having a Docker image that is a Synapse version ahead of the host OS, that will often lead to trouble.

Evaluation on a Single Task

Suppose you want to evaluate meta-llama/Meta-Llama-3-8B model on the multiple_choice task of asin_compatibility, you can do the following steps.

cd task_wise_eval/
python3 hf_multi_choice.py --test_subject asin_compatibility --model_name llama3-8b
# The 'model_name' argument should be set according to 'utils.py'.

Other tasks in other task types involve similar processes. There is also a docker_compose.yml file which will setup and run this automatically (docker compose -f docker_compose.yml up). You will need to add your Huggingface token with access to the Llama3-8B model to download the model and update the volume path as necessary.

FP8 Evaluation

Suppose you want to evaluate meta-llama/Meta-Llama-3-8B model on the multiple_choice task of asin_compatibility with FP8, you can do the following steps.

cd task_wise_eval/
python3 hf_multi_choice.py --test_subject asin_compatibility --model_name llama3-8b --quant_config quant_config/maxabs_measure.json
python3 hf_multi_choice.py --test_subject asin_compatibility --model_name llama3-8b --quant_config quant_config/maxabs_quant.json
# The 'model_name' argument should be set according to 'utils.py'.

Running with the maxabs_measure.json file will modify the model so that it captures the quantization necessary for FP8, and then saves it to the ./inc_output/measure directory. Then running again with the maxabs_quant.json file will actually use the FP8 quantization stored in the previous run and use it to run the tasks, evaluating their accuracy as normal. At the moment hf_multi_choice is the only task with the quant-config flag added, but this is easy to add to the other tasks if desired. There is also a docker_compose_fp8.yml file which will setup and run this automatically. You will need to add your Huggingface token with access to the Llama3-8B model to download the model and update the volume path as necessary.

Larger Models

Suppose you want to evaluate the meta-llama/Llama-3.1-70B-Instruct model, which is too large to evaluate on a single Gaudi3 card. To do that, you need to do the following steps:

cd task_wise_eval/
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.21.0
deepspeed --num_nodes 1 --num_gpus 8 --no_local_rank --master_port 29500 hf_multi_choice.py --test_subject asin_compatibility --model_name llama3-70b --deepspeed
# The 'model_name' argument should be set according to 'utils.py'.

The Deepspeed tool will use all 8 GPU's on the local node, communicating on port 29500, and call the hf_multi_choice.py script with all of the flags that follow. The deepspeed flag is set for convenance. At the moment hf_multi_choice is the only task with the deepspeed flag added, but this is easy to add to the other tasks if desired. There is also a docker-compose_multi.yml file which will setup and run this automatically. You will need to add your Huggingface token with access to the Llama3-70B model to download the model and update the volume path as necessary.

Evaluation on a Skill as a whole

Suppose you want to evaluate Llama3-8B model on the skill of skill1_concept, you can do the following steps.

cd skill_wise_eval/
python3 hf_skill_inference.py --model_name llama3-8b --filename skill1_concept --output_filename <your_filename>
# After inference, the output file will be saved at `skill_inference_results/skill1_concept/llama3-8b_<your_filename>.json`.
python3 skill_evaluation.py --data_filename skill1_concept --output_filename llama3-8b_<your_filename>
# After evaluation, the metrics will be saved at `skill_metrics/skill1_concept/llama3-8b_<your_filename>_metrics.json`.

Other skills follow the same process. There are FP8 and larger model adaptions to this code as well, as well as docker-compose files which automate the process of running these.

Dependencies

Our evaluation code is based on HuggingFace transformers with the following dependencies.

transformers==4.49.0
pandas==2.0.3
evaluate==0.4.1
sentence_transformers==3.2.0
rouge_score==0.1.2
accelerate==0.34.2
neural-compressor[pt]==3.3.1
sacrebleu==2.4.1
sacrebleu[jp]

Code Conversion

To convert code from regular PyTorch to run on Gaudi chips you need to make the following changes.

As documented here: GPU Migration Toolkit the GPU Migration Toolkit makes it easy to adapt existing code to run on Habana. In this case the first commit adapted all of the code to at least run on Gaudi chips.
To add support for FP8 quantization, you will need the Intel Neural Compressor. You can see the changes necessary to add support for a single file in this commit. There are three functions necessary for this: convert, prepare, and finalize_quantization. prepare() converts a model to track what quantization is necessary, finalize_quantization() saves the results of that model being run on some sample data to a quantization specification file, and convert() uses the existing quantization file (calculated by the prepare function, stored in the finalize_quantization() function) in an actual model to get actual results.
Finally, adding support for models too large to fit onto a single card requires the Deepspeed tool to be included to run across multiple cards- also if you want to speed up computation by using multiple cards compute power. As you can see from the the main commit there needs to be a lot of boilerplate code. Most of this is unrelated to the actual multi-node running. The problem is that because of the Deepspeed architecture (which runs separate Python processes- running identical code- for each Gaudi we are trying to run on) we don't want to run some things (most importantly, I/O tasks like downloading models, printing out results, etc.) on every process, so we have to suppress anything that isn't running on the 0th logical card from doing those tasks. That requires some ugly code- e.g. we can't rely on the standard Huggingface library to download a model, we have to pull that code into our code so we can ensure that it is done exactly once.

There are various other commits cleaning up bugs or adding convenance features like a requirements.txt or docker files, etc. Do be sure to check the full gaudi_main branch to get all of those bug fixes and convenance, but these three commits each demonstrate the core of the work to add that feature.

Reference

@article{jin2024shopping,
  title={Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models},
  author={Jin, Yilun and Li, Zheng and Zhang, Chenwei and Cao, Tianyu and Gao, Yifan and Jayarao, Pratik and Li, Mao and Liu, Xin and Sarkhel, Ritesh and Tang, Xianfeng and others},
  journal={arXiv preprint arXiv:2410.20745},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Shopping MMLU

Repo Organization

Data

Where to download?

Data formats

Running evaluations

Setup

Evaluation on a Single Task

FP8 Evaluation

Larger Models

Evaluation on a Skill as a whole

Dependencies

Code Conversion

Reference

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
skill_wise_eval		skill_wise_eval
task_wise_eval		task_wise_eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.zip		data.zip
requirements.txt		requirements.txt

License

Chris-Sigopt/ShoppingMMLU

Folders and files

Latest commit

History

Repository files navigation

Shopping MMLU

Repo Organization

Data

Where to download?

Data formats

Running evaluations

Setup

Evaluation on a Single Task

FP8 Evaluation

Larger Models

Evaluation on a Skill as a whole

Dependencies

Code Conversion

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages