Conversation
|
The paper been accepted by ICLR 2024. The key idea is to split the backward computation into two parts, one that computes gradient for the input and another that
|
|
May I ask what led you to commit to this repository over the original one? Just curious about your thoughts! @QPHutu |
|
Thanks for the reply. There are 2 main reasons.
|
|
thanks for the PR! for merging we'd like to understand the impact a bit better. did you verify how model parallel training of the current models supported here (such as llama2) is impacted by your change? (in terms of speed, stability and also verify model behavior is unchanged?) indeed could be nice to also hear the feedback from the Nvidia/Megatron-LM team if you get a chance |


The change is a quick implementation to replace 1F1B with ZB-H1 proposed in Zero Bubble Pipeline Parallelism, which reduces the bubbles in pipeline parallelism.