Skip to content

train stage1-loss is NAN #34

@likeatingcake

Description

@likeatingcake

[2025-08-27 10:12:22,229] [INFO] [loss_scaler.py:184:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4, reducing to 2
Steps: 0%|▏ | 16/6000 [00:55<4:54:46, 2.96s/it, lr=0, noise_step_loss=nan][2025-08-27 10:12:25,030] [INFO] [loss_scaler.py:184:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1
Steps: 0%|▏ | 17/6000 [00:58<4:50:05, 2.91s/it, lr=0, noise_step_loss=nan][rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/nvme0/yueyc/FaithDiff/./train_SDXL_stage_1.py", line 999, in
[rank0]: main()
[rank0]: File "/mnt/nvme0/yueyc/FaithDiff/./train_SDXL_stage_1.py", line 928, in main
[rank0]: accelerator.backward(loss)
[rank0]: File "/nvme/yueyc/.conda/envs/faithdiff/lib/python3.10/site-packages/accelerate/accelerator.py", line 2726, in backward
[rank0]: self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank0]: File "/nvme/yueyc/.conda/envs/faithdiff/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank0]: self.engine.step()
[rank0]: File "/nvme/yueyc/.conda/envs/faithdiff/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2470, in step
[rank0]: self._take_model_step(lr_kwargs)
[rank0]: File "/nvme/yueyc/.conda/envs/faithdiff/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2365, in _take_model_step
[rank0]: self.optimizer.step()
[rank0]: File "/nvme/yueyc/.conda/envs/faithdiff/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1968, in step
[rank0]: self._update_scale(self.overflow)
[rank0]: File "/nvme/yueyc/.conda/envs/faithdiff/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2212, in _update_scale
[rank0]: self.loss_scaler.update_scale(has_overflow)
[rank0]: File "/nvme/yueyc/.conda/envs/faithdiff/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 176, in update_scale
[rank0]: raise Exception(
[rank0]: Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Steps: 0%|▏ | 17/6000 [01:00<5:55:36, 3.57s/it, lr=0, noise_step_loss=nan]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions