Training with Deepspeed ZeRO-3 #97

TheBlackHacker · 2023-11-25T08:03:36Z

Hello, I know a lot of people want to train OpenChat model but accessing modern GPUs like A100s or H100s is seem difficult. So I tried using ZeRO-3 to train on a cheaper GPU system, 8xA10G with 24GB VRAM.

What I added in this pull request:

ochat/training_deepspeed/deepspeed_config_zero3.json

- Change ZeRO-2 strategy to ZeRO-3.
- Offload to CPU (If you want to use nvme, please edit in the config file).

ochat/training_deepspeed/train_zero3.py

- Change optimizer from torch.optim.AdamW to deepspeed.ops.adam.DeepSpeedCPUAdam
- Save model (in 16bit) on all rank.

README.md

- Added instructions for training with ZeRO-3.

Gaivoronsky

Great!

TheBlackHacker added 5 commits November 25, 2023 14:44

Create deepspeed_config_zero3.json

f293511

Create train_zero3 script

f3b9e07

Rename train_zero3 to train_zero3.py

b1a0e78

Update README.md, add ZeRO-3 instruction

133e933

Remove unnessesary comments

3f44c21

Gaivoronsky suggested changes Nov 27, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training with Deepspeed ZeRO-3 #97

Training with Deepspeed ZeRO-3 #97

Uh oh!

TheBlackHacker commented Nov 25, 2023 •

edited

Loading

Uh oh!

Gaivoronsky left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Training with Deepspeed ZeRO-3 #97

Are you sure you want to change the base?

Training with Deepspeed ZeRO-3 #97

Uh oh!

Conversation

TheBlackHacker commented Nov 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gaivoronsky left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheBlackHacker commented Nov 25, 2023 •

edited

Loading