Unify resnet dataloader pipeline #896

rka97 · 2026-01-06T19:21:45Z

No description provided.

…because we now import the datasets library for the lm workload)

github-actions · 2026-01-06T19:21:53Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

priyakasimbeg · 2026-01-08T17:28:31Z

I get this error when I try to run:

I0108 17:24:24.785126 140345170527424 reader.py:262] Creating a tf.data.Dataset reading 1024 files located in folders: /data/imagenet/jax/imagenet2012/5.1.0.
I0108 17:24:24.878405 140345170527424 logging_logger.py:49] Constructing tf.data.Dataset imagenet2012 for split train, from /data/imagenet/jax/imagenet2012/5.1.0
I0108 17:24:30.342226 140345170527424 submission_runner.py:251] Initializing model.
W0108 17:24:32.651534 140345170527424 submission_runner.py:272] These workloads cannot be fully compiled under current PyTorch version. Proceeding without `torch.compile`.
W0108 17:24:32.651533 140330536600768 submission_runner.py:272] These workloads cannot be fully compiled under current PyTorch version. Proceeding without `torch.compile`.
W0108 17:24:32.651534 140278790313152 submission_runner.py:272] These workloads cannot be fully compiled under current PyTorch version. Proceeding without `torch.compile`.
W0108 17:24:32.651537 140295636341952 submission_runner.py:272] These workloads cannot be fully compiled under current PyTorch version. Proceeding without `torch.compile`.
I0108 17:24:32.651707 140345170527424 submission_runner.py:293] Initializing optimizer.
I0108 17:24:32.652543 140345170527424 submission_runner.py:298] Initializing metrics bundle.
I0108 17:24:32.652706 140345170527424 submission_runner.py:322] Initializing checkpoint and logger.
I0108 17:24:32.653086 140345170527424 submission_runner.py:345] Saving meta data to /experiment_runs/algoperf_pytorch_lm_workload_tf32_resnet_fix/study_0/imagenet_vit_pytorch/trial_1/meta_data_0.json.
fatal: detected dubious ownership in repository at '/algorithmic-efficiency'
To add an exception for this directory, call:

        git config --global --add safe.directory /algorithmic-efficiency
I0108 17:24:32.763983 140295636341952 logger_utils.py:236] Unable to record git information. Continuing without it.
fatal: detected dubious ownership in repository at '/algorithmic-efficiency'
To add an exception for this directory, call:

        git config --global --add safe.directory /algorithmic-efficiency
I0108 17:24:32.769814 140345170527424 logger_utils.py:236] Unable to record git information. Continuing without it.
fatal: detected dubious ownership in repository at '/algorithmic-efficiency'
To add an exception for this directory, call:

        git config --global --add safe.directory /algorithmic-efficiency
fatal: detected dubious ownership in repository at '/algorithmic-efficiency'
To add an exception for this directory, call:

        git config --global --add safe.directory /algorithmic-efficiency
I0108 17:24:32.776495 140330536600768 logger_utils.py:236] Unable to record git information. Continuing without it.
I0108 17:24:32.776825 140278790313152 logger_utils.py:236] Unable to record git information. Continuing without it.
I0108 17:24:32.872176 140345170527424 submission_runner.py:349] Saving flags to /experiment_runs/algoperf_pytorch_lm_workload_tf32_resnet_fix/study_0/imagenet_vit_pytorch/trial_1/flags_0.json.
I0108 17:24:32.894984 140345170527424 submission_runner.py:361] Starting training loop.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/algorithmic-efficiency/submission_runner.py", line 889, in <module>
[rank2]:     app.run(main)
[rank2]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 308, in run
[rank2]:     _run_main(main, args)
[rank2]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
[rank2]:     sys.exit(main(argv))
[rank2]:              ^^^^^^^^^^
[rank2]:   File "/algorithmic-efficiency/submission_runner.py", line 854, in main
[rank2]:     score = score_submission_on_workload(
[rank2]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/algorithmic-efficiency/submission_runner.py", line 717, in score_submission_on_workload
[rank2]:     timing, metrics = train_once(
[rank2]:                       ^^^^^^^^^^^
[rank2]:   File "/algorithmic-efficiency/submission_runner.py", line 389, in train_once
[rank2]:     optimizer_state, model_params, model_state = update_params(
[rank2]:                                                  ^^^^^^^^^^^^^^
[rank2]:   File "/algorithmic-efficiency/algorithms/baselines/external_tuning/pytorch_nadamw_full_budget.py", line 279, in update_params
[rank2]:     loss_dict = workload.loss_fn(
[rank2]:                 ^^^^^^^^^^^^^^^^^
[rank2]:   File "/algorithmic-efficiency/algoperf/workloads/imagenet_resnet/imagenet_pytorch/workload.py", line 216, in loss_fn
[rank2]:     per_example_losses = F.cross_entropy(
[rank2]:                          ^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/functional.py", line 3458, in cross_entropy
[rank2]:     return torch._C._nn.cross_entropy_loss(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: RuntimeError: Expected floating point type for target with class probabilities, got Long
[rank3]: Traceback (most recent call last):
[rank3]:   File "/algorithmic-efficiency/submission_runner.py", line 889, in <module>
[rank3]:     app.run(main)
[rank3]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 308, in run
[rank3]:     _run_main(main, args)
[rank3]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
[rank3]:     sys.exit(main(argv))
[rank3]:              ^^^^^^^^^^
[rank3]:   File "/algorithmic-efficiency/submission_runner.py", line 854, in main
[rank3]:     score = score_submission_on_workload(
[rank3]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/algorithmic-efficiency/submission_runner.py", line 717, in score_submission_on_workload
[rank3]:     timing, metrics = train_once(
[rank3]:                       ^^^^^^^^^^^
[rank3]:   File "/algorithmic-efficiency/submission_runner.py", line 389, in train_once
[rank3]:     optimizer_state, model_params, model_state = update_params(
[rank3]:                                                  ^^^^^^^^^^^^^^
[rank3]:   File "/algorithmic-efficiency/algorithms/baselines/external_tuning/pytorch_nadamw_full_budget.py", line 279, in update_params
[rank3]:     loss_dict = workload.loss_fn(
[rank3]:                 ^^^^^^^^^^^^^^^^^
[rank3]:   File "/algorithmic-efficiency/algoperf/workloads/imagenet_resnet/imagenet_pytorch/workload.py", line 216, in loss_fn
[rank3]:     per_example_losses = F.cross_entropy(
[rank3]:                          ^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/functional.py", line 3458, in cross_entropy
[rank3]:     return torch._C._nn.cross_entropy_loss(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: RuntimeError: Expected floating point type for target with class probabilities, got Long
[rank1]: Traceback (most recent call last):
[rank1]:   File "/algorithmic-efficiency/submission_runner.py", line 889, in <module>
[rank1]:     app.run(main)
[rank1]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 308, in run
[rank1]:     _run_main(main, args)
[rank1]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
[rank1]:     sys.exit(main(argv))
[rank1]:              ^^^^^^^^^^
[rank1]:   File "/algorithmic-efficiency/submission_runner.py", line 854, in main
[rank1]:     score = score_submission_on_workload(
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/algorithmic-efficiency/submission_runner.py", line 717, in score_submission_on_workload
[rank1]:     timing, metrics = train_once(
[rank1]:                       ^^^^^^^^^^^
[rank1]:   File "/algorithmic-efficiency/submission_runner.py", line 389, in train_once
[rank1]:     optimizer_state, model_params, model_state = update_params(
[rank1]:                                                  ^^^^^^^^^^^^^^
[rank1]:   File "/algorithmic-efficiency/algorithms/baselines/external_tuning/pytorch_nadamw_full_budget.py", line 279, in update_params
[rank1]:     loss_dict = workload.loss_fn(
[rank1]:                 ^^^^^^^^^^^^^^^^^
[rank1]:   File "/algorithmic-efficiency/algoperf/workloads/imagenet_resnet/imagenet_pytorch/workload.py", line 216, in loss_fn
[rank1]:     per_example_losses = F.cross_entropy(
[rank1]:                          ^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/functional.py", line 3458, in cross_entropy
[rank1]:     return torch._C._nn.cross_entropy_loss(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: Expected floating point type for target with class probabilities, got Long
[rank0]: Traceback (most recent call last):
[rank0]:   File "/algorithmic-efficiency/submission_runner.py", line 889, in <module>
[rank0]:     app.run(main)
[rank0]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 308, in run
[rank0]:     _run_main(main, args)
[rank0]:   File "/usr/local/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
[rank0]:     sys.exit(main(argv))
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/algorithmic-efficiency/submission_runner.py", line 854, in main
[rank0]:     score = score_submission_on_workload(
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/algorithmic-efficiency/submission_runner.py", line 717, in score_submission_on_workload
[rank0]:     timing, metrics = train_once(
[rank0]:                       ^^^^^^^^^^^
[rank0]:   File "/algorithmic-efficiency/submission_runner.py", line 389, in train_once
[rank0]:     optimizer_state, model_params, model_state = update_params(
[rank0]:                                                  ^^^^^^^^^^^^^^
[rank0]:   File "/algorithmic-efficiency/algorithms/baselines/external_tuning/pytorch_nadamw_full_budget.py", line 279, in update_params
[rank0]:     loss_dict = workload.loss_fn(
[rank0]:                 ^^^^^^^^^^^^^^^^^
[rank0]:   File "/algorithmic-efficiency/algoperf/workloads/imagenet_resnet/imagenet_pytorch/workload.py", line 216, in loss_fn
[rank0]:     per_example_losses = F.cross_entropy(
[rank0]:                          ^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/functional.py", line 3458, in cross_entropy
[rank0]:     return torch._C._nn.cross_entropy_loss(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Expected floating point type for target with class probabilities, got Long

Run command to reproduce:

torchrun --redirects 1:0,2:0,3:0 --standalone --nnodes=1 --nproc_per_node=4 submission_runner.py --framework=pytorch --workload=imagenet_vit --submission_path=algorithms/baselines/external_tuning/pytorch_nadamw_full_budget.py --data_dir=/data/imagenet/jax --experiment_dir=/experiment_runs --experiment_name=algoperf_pytorch_lm_workload_tf32_resnet_fix/study_0 --overwrite=true --save_checkpoints=false --rng_seed=1103287440 --max_global_steps=18666 --imagenet_v2_data_dir=/data/imagenet/jax --torch_compile=true --tuning_ruleset=external --tuning_search_space=algorithms/baselines/external_tuning/tuning_search_space.json --num_tuning_trials=5 --hparam_end_index=1

…ared input pipeline

priyakasimbeg · 2026-01-10T00:09:45Z

@rka97 can you include a copy of the pip freeze of your environment?

rka97 · 2026-01-12T04:22:26Z

Making another PR, turns out unifying the pipeline doesn't work as well for ViT, but nevertheless we can still fix the slow pytorch workload issue with separate data loaders.

rka97 added 4 commits December 12, 2025 01:47

ImageNet caching for faster dataset access PyTorch

2f865a1

Refactor ImageNet input pipeline and augmentations

c2f443b

add debug scripts

3608624

Move datasets/ to algoperf/datasets (otherwise it gives off an error …

cbeb594

…because we now import the datasets library for the lm workload)

rka97 requested a review from a team as a code owner January 6, 2026 19:21

Merge branch 'a100' into a100_debug_resnet

b52e9b1

in docker startup script set imangenet data dir to jax sub dir for sh…

1693b4f

…ared input pipeline

rka97 closed this Jan 12, 2026

github-actions bot locked and limited conversation to collaborators Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unify resnet dataloader pipeline #896

Unify resnet dataloader pipeline #896

Uh oh!

rka97 commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

priyakasimbeg commented Jan 8, 2026

Uh oh!

priyakasimbeg commented Jan 10, 2026

Uh oh!

rka97 commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Unify resnet dataloader pipeline #896

Unify resnet dataloader pipeline #896

Uh oh!

Conversation

rka97 commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

priyakasimbeg commented Jan 8, 2026

Uh oh!

priyakasimbeg commented Jan 10, 2026

Uh oh!

rka97 commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Jan 6, 2026 •

edited

Loading