Conversation
Tested, 100% the same as PPO.
Same up to 1e-5 on individual values. Because of differencing, this can end up in much larger differences (1e0) occasionally, after normalization.
Left without boot-strapping on last step, seems like it makes no difference. Boot-strapping on time-outs made a big difference though.
Tensor dict
…d history length. Implemented for dof_pos_target, dof_pos, dof_vel.
in preparation for domain randomization of gains
…f_pos_history` (used for smoothness reward). Some tuning. Change stance/swing reward to use the smoothed square wave.
| class runner: | ||
| policy_class_name = "ActorCritic" | ||
| algorithm_class_name = "PPO" | ||
| algorithm_class_name = "PPO2" |
There was a problem hiding this comment.
PPO2 splits up the critic update and the actor update, and is otherwise pretty much the same. This is also using the new tensorDict storage. You don't really need to look under the hood unless you want.
gym/envs/base/legged_robot.py
Outdated
| self._resample_commands(env_ids) | ||
| # * reset buffers | ||
| self.dof_pos_history[env_ids] = 0.0 | ||
| self.dof_pos_history[env_ids] = ( |
There was a problem hiding this comment.
changed to reset to current state rather than 0, so the first 3 time-steps don't result in bogus time-differences.
| ) | ||
| self.p_gains = torch.zeros( | ||
| self.num_actuators, dtype=torch.float, device=self.device | ||
| self.num_envs, self.num_actuators, dtype=torch.float, device=self.device |
There was a problem hiding this comment.
this is in prep to for randomizing gains
| self.num_actuators, | ||
| dtype=torch.float, | ||
| device=self.device, | ||
| self.num_envs, self.num_actuators, dtype=torch.float, device=self.device |
| self.num_envs, 1, dtype=torch.float, device=self.device | ||
| ) | ||
|
|
||
| # # * get the body_name to body_index dict |
There was a problem hiding this comment.
wtf is this? Pretty sure this was already there, not sure why it is popping up in the diff
|
|
||
| def _init_buffers(self): | ||
| super()._init_buffers() | ||
| self.oscillators = torch.zeros(self.num_envs, 2, device=self.device) |
There was a problem hiding this comment.
I have an oscillator for each leg. They are initialized at pi phase-shift, so it's the same, then we don't have to manually shift.
There was a problem hiding this comment.
So theres a sin and cos for each leg?
| return rew_vel + rew_pos - rew_base_vel * self._switch("stand") | ||
|
|
||
| def _reward_stance(self): | ||
| # phase = torch.maximum( |
| tracking_sigma = 0.5 | ||
|
|
||
| # a smooth switch based on |cmd| (commanded velocity). | ||
| switch_scale = 0.5 |
There was a problem hiding this comment.
see the switch function. Used this in Jenny's project to smoothly turn on/off rewards that we only want on or off when standing still or moving. switch_scale controls how quickly in moves from 0 to 1 (plot the function to see), and switch_threshold is when it starts to transition, based on |command|.
| critic_hidden_dims = [256, 256, 128] | ||
| # * can be elu, relu, selu, crelu, lrelu, tanh, sigmoid | ||
| activation = "elu" | ||
| activation = "tanh" |
There was a problem hiding this comment.
Did a sweep the other day, tanh had high rewards 8 times, compared to 4 for elu... very hand wavy.
elijahstangerjones
left a comment
There was a problem hiding this comment.
Looks good, thanks for porting this steve
| base_lin_vel = 1.5 | ||
| commands = 1 | ||
| base_height = BASE_HEIGHT_REF | ||
| dof_pos = [ |
There was a problem hiding this comment.
Is this useful if we have the autonormalization? My current policy just depends on the autonorm and its kinda nice not to have to worry about the scaling when sim2sim-ing
There was a problem hiding this comment.
try it. My "feeling" from Jenny's project is, absolutely yes. But it might also be that it's very important for the action scale (which was shared with this), but not for observations. That would make sense to me.
| """ | ||
| Returns the normalized version of the input. | ||
| """ | ||
| def forward(self, input): |
There was a problem hiding this comment.
have we tested how small these variances can get? I had some issues with sim2sim with normalizations with very small variances.
For safety in sim2sim I had clamped the variance to be no smaller than 0.1 (as per usual I was debugging 90 things at once so maybe this wasn't that important but still seems like a good idea for safety)
There was a problem hiding this comment.
I have not. I'll put it on the stack
| critic_hidden_dims = [256, 256, 128] | ||
| # * can be elu, relu, selu, crelu, lrelu, tanh, sigmoid | ||
| activation = "elu" | ||
| activation = "tanh" |
|
|
||
| def _init_buffers(self): | ||
| super()._init_buffers() | ||
| self.oscillators = torch.zeros(self.num_envs, 2, device=self.device) |
There was a problem hiding this comment.
So theres a sin and cos for each leg?
|
|
||
| class MITHumanoidRunnerCfg(LeggedRobotRunnerCfg): | ||
| seed = -1 | ||
| runner_class_name = "OnPolicyRunner" |
There was a problem hiding this comment.
Is this the right runner now? Whats my_runner?
There was a problem hiding this comment.
on this branch no difference, on a different one I added the PBRS into my runner, but left the base on "vanilla". Basically "My Runner" is where you play with whatever you want.
| class env(LeggedRobotCfg.env): | ||
| num_envs = 4096 | ||
| num_observations = 49 + 3 * 18 # 121 | ||
| num_actuators = 18 |
There was a problem hiding this comment.
I think right now the learnt urdf only has 10 joints so this should crash I think? But I think we should start using the arms now so we should turn those joints back on in the urdf
diff --git a/gym/envs/mit_humanoid/mit_humanoid_config.py b/gym/envs/mit_humanoid/mit_humanoid_config.py index 54813f7..368f2c3 100644 --- a/gym/envs/mit_humanoid/mit_humanoid_config.py +++ b/gym/envs/mit_humanoid/mit_humanoid_config.py @@ -142,7 +142,7 @@ class MITHumanoidCfg(LeggedRobotCfg): class asset(LeggedRobotCfg.asset): file = ( "{LEGGED_GYM_ROOT_DIR}/resources/robots/" - + "mit_humanoid/urdf/humanoid_R_sf.urdf" + + "mit_humanoid/urdf/humanoid_F_sf_learnt.urdf" ) # foot_collisionbox_names = ["foot"] foot_name = "foot" @@ -289,12 +289,12 @@ class MITHumanoidRunnerCfg(LeggedRobotRunnerCfg): use_clipped_value_loss = True clip_param = 0.2 entropy_coef = 0.01 - num_learning_epochs = 5 + num_learning_epochs = 4 # * mini batch size = num_envs*nsteps / nminibatches num_mini_batches = 4 - learning_rate = 5.0e-5 + learning_rate = 1.0e-6 schedule = "adaptive" # could be adaptive, fixed - gamma = 0.999 + gamma = 0.99 lam = 0.95 desired_kl = 0.01 max_grad_norm = 1.0 diff --git a/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf b/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf index 5e76f34..98a0773 100644 --- a/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf +++ b/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf @@ -570,7 +570,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="15_left_shoulder_pitch" - type="fixed"> + type="revolute"> <origin xyz="0.01346 0.17608 0.24657" rpy="0 0 0" /> @@ -615,7 +615,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="16_left_shoulder_abad" - type="fixed"> + type="revolute"> <origin xyz="0 .05760 0" rpy="0.0 0 0" /> @@ -668,7 +668,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="17_left_shoulder_yaw" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.10250" rpy="0.0 0 0" /> @@ -719,7 +719,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="18_left_elbow" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.15750" rpy="0 0 0.0" /> @@ -798,7 +798,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="11_right_shoulder_pitch" - type="fixed"> + type="revolute"> <origin xyz="0.01346 -0.17608 0.24657" rpy="0 0 0" /> @@ -843,7 +843,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="12_right_shoulder_abad" - type="fixed"> + type="revolute"> <origin xyz="0 -.05760 0" rpy="0.0 0 0" /> @@ -896,7 +896,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="13_right_shoulder_yaw" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.10250" rpy="0.0 0 0" /> @@ -947,7 +947,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="14_right_elbow" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.15750" rpy="0 0 0.0" />
…erically unstable for humanoid
put jacobian into MIT_humanoid base environment, rather than import; importing breaks when running --original_config
only for lander: needs to overload _check_terminations_and_timeouts to not actually reset so that reset can be handled only by runner (without breaking backwards compatibility) Also keeps a list of rewards, for finer reward integration, and can also do traj stats
Adjust some reward weights of action_rate (mini-cheetah) to work decently.
…ffer. Doesn't seem to make a difference in speed.
…n the runner. Seems to make a tiny difference in speed. More importantly, it removes the `eval` hack, which means we can profile individual reward functions. THIS BREAKS PBRS
Run
See file
mit_humanoid.py.Checklist before requesting a review
ruff format .manually