I created a ray cluster with Helm and tried running the scripts below for multi-node training, but it got stuck at the first step of training (see the image below). Could you please help me with this?
train_1.5b_grpo_tool server.sh
train_1.5b_grpo.sh
ray-values-verl.yaml