Thanks for releasing this wonderful work.
In evaluate_from_local.py, the extract_xx functions appear to have typos (i.e., L101, 111, 119).
As the MMLU PRO dataset have questions with answer A-P, then the pattern should be something like A-P instead of A-J