Skip to content

Conversation

@rka97
Copy link
Contributor

@rka97 rka97 commented Jan 6, 2026

No description provided.

@rka97 rka97 requested a review from a team as a code owner January 6, 2026 19:21
@github-actions
Copy link

github-actions bot commented Jan 6, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@priyakasimbeg
Copy link
Contributor

I get this error when I try to run:

I0108 17:24:24.785126 140345170527424 reader.py:262] Creating a tf.data.Dataset reading 1024 files located in folders: /data/imagenet/jax/imagenet2012/5.1.0.
I0108 17:24:24.878405 140345170527424 logging_logger.py:49] Constructing tf.data.Dataset imagenet2012 for split train, from /data/imagenet/jax/imagenet2012/5.1.0
I0108 17:24:30.342226 140345170527424 submission_runner.py:251] Initializing model.
W0108 17:24:32.651534 140345170527424 submission_runner.py:272] These workloads cannot be fully compiled under current PyTorch version. Proceeding without `torch.compile`.
W0108 17:24:32.651533 140330536600768 submission_runner.py:272] These workloads cannot be fully compiled under current PyTorch version. Proceeding without `torch.compile`.
W0108 17:24:32.651534 140278790313152 submission_runner.py:272] These workloads cannot be fully compiled under current PyTorch version. Proceeding without `torch.compile`.
W0108 17:24:32.651537 140295636341952 submission_runner.py:272] These workloads cannot be fully compiled under current PyTorch version. Proceeding without `torch.compile`.
I0108 17:24:32.651707 140345170527424 submission_runner.py:293] Initializing optimizer.
I0108 17:24:32.652543 140345170527424 submission_runner.py:298] Initializing metrics bundle.
I0108 17:24:32.652706 140345170527424 submission_runner.py:322] Initializing checkpoint and logger.
I0108 17:24:32.653086 140345170527424 submission_runner.py:345] Saving meta data to /experiment_runs/algoperf_pytorch_lm_workload_tf32_resnet_fix/study_0/imagenet_vit_pytorch/trial_1/meta_data_0.json.
fatal: detected dubious ownership in repository at '/algorithmic-efficiency'
To add an exception for this directory, call:

        git config --global --add safe.directory /algorithmic-efficiency
I0108 17:24:32.763983 140295636341952 logger_utils.py:236] Unable to record git information. Continuing without it.
fatal: detected dubious ownership in repository at '/algorithmic-efficiency'
To add an exception for this directory, call:

        git config --global --add safe.directory /algorithmic-efficiency
I0108 17:24:32.769814 140345170527424 logger_utils.py:236] Unable to record git information. Continuing without it.
fatal: detected dubious ownership in repository at '/algorithmic-efficiency'
To add an exception for this directory, call:

        git config --global --add safe.directory /algorithmic-efficiency
fatal: detected dubious ownership in repository at '/algorithmic-efficiency'
To add an exception for this directory, call:

        git config --global --add safe.directory /algorithmic-efficiency
I0108 17:24:32.776495 140330536600768 logger_utils.py:236] Unable to record git information. Continuing without it.
I0108 17:24:32.776825 140278790313152 logger_utils.py:236] Unable to record git information. Continuing without it.
I0108 17:24:32.872176 140345170527424 submission_runner.py:349] Saving flags to /experiment_runs/algoperf_pytorch_lm_workload_tf32_resnet_fix/study_0/imagenet_vit_pytorch/trial_1/flags_0.json.
I0108 17:24:32.894984 140345170527424 submission_runner.py:361] Starting training loop.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/algorithmic-efficiency/submission_runner.py", line 889, in <module>
[rank2]:     app.run(main)
[rank2]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 308, in run
[rank2]:     _run_main(main, args)
[rank2]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
[rank2]:     sys.exit(main(argv))
[rank2]:              ^^^^^^^^^^
[rank2]:   File "/algorithmic-efficiency/submission_runner.py", line 854, in main
[rank2]:     score = score_submission_on_workload(
[rank2]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/algorithmic-efficiency/submission_runner.py", line 717, in score_submission_on_workload
[rank2]:     timing, metrics = train_once(
[rank2]:                       ^^^^^^^^^^^
[rank2]:   File "/algorithmic-efficiency/submission_runner.py", line 389, in train_once
[rank2]:     optimizer_state, model_params, model_state = update_params(
[rank2]:                                                  ^^^^^^^^^^^^^^
[rank2]:   File "/algorithmic-efficiency/algorithms/baselines/external_tuning/pytorch_nadamw_full_budget.py", line 279, in update_params
[rank2]:     loss_dict = workload.loss_fn(
[rank2]:                 ^^^^^^^^^^^^^^^^^
[rank2]:   File "/algorithmic-efficiency/algoperf/workloads/imagenet_resnet/imagenet_pytorch/workload.py", line 216, in loss_fn
[rank2]:     per_example_losses = F.cross_entropy(
[rank2]:                          ^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/functional.py", line 3458, in cross_entropy
[rank2]:     return torch._C._nn.cross_entropy_loss(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: RuntimeError: Expected floating point type for target with class probabilities, got Long
[rank3]: Traceback (most recent call last):
[rank3]:   File "/algorithmic-efficiency/submission_runner.py", line 889, in <module>
[rank3]:     app.run(main)
[rank3]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 308, in run
[rank3]:     _run_main(main, args)
[rank3]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
[rank3]:     sys.exit(main(argv))
[rank3]:              ^^^^^^^^^^
[rank3]:   File "/algorithmic-efficiency/submission_runner.py", line 854, in main
[rank3]:     score = score_submission_on_workload(
[rank3]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/algorithmic-efficiency/submission_runner.py", line 717, in score_submission_on_workload
[rank3]:     timing, metrics = train_once(
[rank3]:                       ^^^^^^^^^^^
[rank3]:   File "/algorithmic-efficiency/submission_runner.py", line 389, in train_once
[rank3]:     optimizer_state, model_params, model_state = update_params(
[rank3]:                                                  ^^^^^^^^^^^^^^
[rank3]:   File "/algorithmic-efficiency/algorithms/baselines/external_tuning/pytorch_nadamw_full_budget.py", line 279, in update_params
[rank3]:     loss_dict = workload.loss_fn(
[rank3]:                 ^^^^^^^^^^^^^^^^^
[rank3]:   File "/algorithmic-efficiency/algoperf/workloads/imagenet_resnet/imagenet_pytorch/workload.py", line 216, in loss_fn
[rank3]:     per_example_losses = F.cross_entropy(
[rank3]:                          ^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/functional.py", line 3458, in cross_entropy
[rank3]:     return torch._C._nn.cross_entropy_loss(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: RuntimeError: Expected floating point type for target with class probabilities, got Long
[rank1]: Traceback (most recent call last):
[rank1]:   File "/algorithmic-efficiency/submission_runner.py", line 889, in <module>
[rank1]:     app.run(main)
[rank1]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 308, in run
[rank1]:     _run_main(main, args)
[rank1]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
[rank1]:     sys.exit(main(argv))
[rank1]:              ^^^^^^^^^^
[rank1]:   File "/algorithmic-efficiency/submission_runner.py", line 854, in main
[rank1]:     score = score_submission_on_workload(
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/algorithmic-efficiency/submission_runner.py", line 717, in score_submission_on_workload
[rank1]:     timing, metrics = train_once(
[rank1]:                       ^^^^^^^^^^^
[rank1]:   File "/algorithmic-efficiency/submission_runner.py", line 389, in train_once
[rank1]:     optimizer_state, model_params, model_state = update_params(
[rank1]:                                                  ^^^^^^^^^^^^^^
[rank1]:   File "/algorithmic-efficiency/algorithms/baselines/external_tuning/pytorch_nadamw_full_budget.py", line 279, in update_params
[rank1]:     loss_dict = workload.loss_fn(
[rank1]:                 ^^^^^^^^^^^^^^^^^
[rank1]:   File "/algorithmic-efficiency/algoperf/workloads/imagenet_resnet/imagenet_pytorch/workload.py", line 216, in loss_fn
[rank1]:     per_example_losses = F.cross_entropy(
[rank1]:                          ^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/functional.py", line 3458, in cross_entropy
[rank1]:     return torch._C._nn.cross_entropy_loss(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: Expected floating point type for target with class probabilities, got Long
[rank0]: Traceback (most recent call last):
[rank0]:   File "/algorithmic-efficiency/submission_runner.py", line 889, in <module>
[rank0]:     app.run(main)
[rank0]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 308, in run
[rank0]:     _run_main(main, args)
[rank0]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
[rank0]:     sys.exit(main(argv))
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/algorithmic-efficiency/submission_runner.py", line 854, in main
[rank0]:     score = score_submission_on_workload(
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/algorithmic-efficiency/submission_runner.py", line 717, in score_submission_on_workload
[rank0]:     timing, metrics = train_once(
[rank0]:                       ^^^^^^^^^^^
[rank0]:   File "/algorithmic-efficiency/submission_runner.py", line 389, in train_once
[rank0]:     optimizer_state, model_params, model_state = update_params(
[rank0]:                                                  ^^^^^^^^^^^^^^
[rank0]:   File "/algorithmic-efficiency/algorithms/baselines/external_tuning/pytorch_nadamw_full_budget.py", line 279, in update_params
[rank0]:     loss_dict = workload.loss_fn(
[rank0]:                 ^^^^^^^^^^^^^^^^^
[rank0]:   File "/algorithmic-efficiency/algoperf/workloads/imagenet_resnet/imagenet_pytorch/workload.py", line 216, in loss_fn
[rank0]:     per_example_losses = F.cross_entropy(
[rank0]:                          ^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/functional.py", line 3458, in cross_entropy
[rank0]:     return torch._C._nn.cross_entropy_loss(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Expected floating point type for target with class probabilities, got Long

Run command to reproduce:

torchrun --redirects 1:0,2:0,3:0 --standalone --nnodes=1 --nproc_per_node=4 submission_runner.py --framework=pytorch --workload=imagenet_vit --submission_path=algorithms/baselines/external_tuning/pytorch_nadamw_full_budget.py --data_dir=/data/imagenet/jax --experiment_dir=/experiment_runs --experiment_name=algoperf_pytorch_lm_workload_tf32_resnet_fix/study_0 --overwrite=true --save_checkpoints=false --rng_seed=1103287440 --max_global_steps=18666 --imagenet_v2_data_dir=/data/imagenet/jax --torch_compile=true --tuning_ruleset=external --tuning_search_space=algorithms/baselines/external_tuning/tuning_search_space.json --num_tuning_trials=5 --hparam_end_index=1

@priyakasimbeg
Copy link
Contributor

@rka97 can you include a copy of the pip freeze of your environment?

@rka97 rka97 closed this Jan 12, 2026
@github-actions github-actions bot locked and limited conversation to collaborators Jan 12, 2026
@rka97
Copy link
Contributor Author

rka97 commented Jan 12, 2026

Making another PR, turns out unifying the pipeline doesn't work as well for ViT, but nevertheless we can still fix the slow pytorch workload issue with separate data loaders.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants