Skip to content

Conversation

@henrytsui000
Copy link
Member

Stabilize training for v9-t, and fix some previous bugs.

Copilot AI review requested due to automatic review settings December 30, 2025 07:32
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prepares the codebase for v1.0 release by stabilizing training for v9-t models and fixing several bugs. The changes focus on implementing gradient accumulation with warmup, fixing a widespread typo, and enhancing distributed training support.

Key changes include:

  • Implementation of a new GradientAccumulation callback to dynamically adjust gradient accumulation during training warmup
  • Updates to the EMA callback to properly account for gradient accumulation steps
  • Correction of the "quite" → "quiet" typo throughout configuration and documentation

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
yolo/utils/model_utils.py Added GradientAccumulation callback class for dynamic gradient accumulation during warmup; updated EMA callback to work with gradient accumulation by tracking batch steps
yolo/utils/logging_utils.py Integrated GradientAccumulation into setup; corrected "quite" typo; changed batch progress description; removed clear_live() call
yolo/lazy.py Added configurable accelerator and device settings; enabled sync_batchnorm for distributed training; corrected "quite" typo
yolo/config/task/train.yaml Added equivalent_batch_size parameter to support gradient accumulation configuration
yolo/config/general.yaml Changed device configuration from fixed 0 to auto and added accelerator setting
yolo/config/config.py Added equivalent_batch_size field to DataConfig and accelerator field to Config
tests/test_utils/test_bounding_box_utils.py Removed .to(device) call from model creation in test (model doesn't need to be on device for this test)
tests/conftest.py Removed equivalent_batch_size from test configuration to prevent GradientAccumulation callback activation during tests
docs/HOWTO.md Corrected "quite" → "quiet" in example command
README.md Corrected "quite" → "quiet" in documentation; removed WIP and PWC badges
.github/workflows/deploy.yaml Corrected "quite" → "quiet" in workflow command

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +79 to +88
class GradientAccumulation(Callback):
def __init__(self, data_cfg: DataConfig, scheduler_cfg: SchedulerConfig):
super().__init__()
self.equivalent_batch_size = data_cfg.equivalent_batch_size
self.actual_batch_size = data_cfg.batch_size
self.warmup_epochs = getattr(scheduler_cfg.warmup, "epochs", 0)
self.current_batch = 0
self.max_accumulation = 1
self.warmup_batches = 0
logger.info(":arrows_counterclockwise: Enable Gradient Accumulation")
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GradientAccumulation class lacks a docstring explaining its purpose, parameters, and behavior. Consider adding documentation to describe how this callback manages gradient accumulation during training warmup.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +108
class GradientAccumulation(Callback):
def __init__(self, data_cfg: DataConfig, scheduler_cfg: SchedulerConfig):
super().__init__()
self.equivalent_batch_size = data_cfg.equivalent_batch_size
self.actual_batch_size = data_cfg.batch_size
self.warmup_epochs = getattr(scheduler_cfg.warmup, "epochs", 0)
self.current_batch = 0
self.max_accumulation = 1
self.warmup_batches = 0
logger.info(":arrows_counterclockwise: Enable Gradient Accumulation")

def setup(self, trainer: "Trainer", pl_module: "LightningModule", stage: str) -> None:
effective_batch_size = self.actual_batch_size * trainer.world_size
self.max_accumulation = max(1, round(self.equivalent_batch_size / effective_batch_size))
batches_per_epoch = int(len(pl_module.train_loader) / trainer.world_size)
self.warmup_batches = int(self.warmup_epochs * batches_per_epoch)

def on_train_epoch_start(self, trainer: "Trainer", pl_module: "LightningModule") -> None:
self.current_batch = trainer.global_step

def on_train_batch_start(self, trainer: "Trainer", pl_module: "LightningModule", *args, **kwargs) -> None:
if self.current_batch < self.warmup_batches:
current_accumulation = round(lerp(1, self.max_accumulation, self.current_batch, self.warmup_batches))
else:
current_accumulation = self.max_accumulation
trainer.accumulate_grad_batches = current_accumulation

def on_train_batch_end(self, trainer: "Trainer", pl_module: "LightningModule", *args, **kwargs) -> None:
self.current_batch += 1

Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new GradientAccumulation callback class lacks test coverage. Consider adding tests to verify the gradient accumulation logic, warmup behavior, and integration with the trainer's accumulate_grad_batches setting.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Copy link

Copilot AI commented Dec 30, 2025

@henrytsui000 I've opened a new pull request, #226, to work on those changes. Once the pull request is ready, I'll request review from you.

@henrytsui000 henrytsui000 merged commit c4cb5f6 into main Dec 30, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants