Humanoid by sheim · Pull Request #11 · sheim/QGym

sheim · 2024-03-19T21:02:49Z

Run

python script/train.py --task=humanoid

See file mit_humanoid.py.

Checklist before requesting a review

This is expected to break regression tests.
I have assigned a reviewer
I have added the PR to the project, and tagged with with priority
If it is a core feature, I have added tests.
I have set up pre-commit hooks with ruff, or run ruff format . manually

Tested, 100% the same as PPO.

Same up to 1e-5 on individual values. Because of differencing, this can end up in much larger differences (1e0) occasionally, after normalization.

Left without boot-strapping on last step, seems like it makes no difference. Boot-strapping on time-outs made a big difference though.

Tensor dict

…d history length. Implemented for dof_pos_target, dof_pos, dof_vel.

in preparation for domain randomization of gains

…f_pos_history` (used for smoothness reward). Some tuning. Change stance/swing reward to use the smoothed square wave.

sheim · 2024-03-19T21:03:34Z

gym/envs/base/fixed_robot_config.py

    class runner:
        policy_class_name = "ActorCritic"
-        algorithm_class_name = "PPO"
+        algorithm_class_name = "PPO2"


PPO2 splits up the critic update and the actor update, and is otherwise pretty much the same. This is also using the new tensorDict storage. You don't really need to look under the hood unless you want.

sheim · 2024-03-19T21:04:58Z

gym/envs/base/legged_robot.py

        self._resample_commands(env_ids)
        # * reset buffers
-        self.dof_pos_history[env_ids] = 0.0
+        self.dof_pos_history[env_ids] = (


changed to reset to current state rather than 0, so the first 3 time-steps don't result in bogus time-differences.

sheim · 2024-03-19T21:05:52Z

gym/envs/base/legged_robot.py

        )
        self.p_gains = torch.zeros(
-            self.num_actuators, dtype=torch.float, device=self.device
+            self.num_envs, self.num_actuators, dtype=torch.float, device=self.device


this is in prep to for randomizing gains

sheim · 2024-03-19T21:06:09Z

gym/envs/base/legged_robot.py

-            self.num_actuators,
-            dtype=torch.float,
-            device=self.device,
+            self.num_envs, self.num_actuators, dtype=torch.float, device=self.device


sheim · 2024-03-19T21:06:40Z

gym/envs/base/legged_robot.py

            self.num_envs, 1, dtype=torch.float, device=self.device
        )

+        # # * get the body_name to body_index dict


wtf is this? Pretty sure this was already there, not sure why it is popping up in the diff

sheim · 2024-03-19T21:08:13Z

gym/envs/mit_humanoid/mit_humanoid.py


    def _init_buffers(self):
        super()._init_buffers()
+        self.oscillators = torch.zeros(self.num_envs, 2, device=self.device)


I have an oscillator for each leg. They are initialized at pi phase-shift, so it's the same, then we don't have to manually shift.

So theres a sin and cos for each leg?

for the observations, yes

sheim · 2024-03-19T21:09:30Z

gym/envs/mit_humanoid/mit_humanoid.py

+        return rew_vel + rew_pos - rew_base_vel * self._switch("stand")
+
+    def _reward_stance(self):
+        # phase = torch.maximum(


will zap later

sheim · 2024-03-19T21:11:28Z

gym/envs/mit_humanoid/mit_humanoid_config.py

        tracking_sigma = 0.5

+        # a smooth switch based on |cmd| (commanded velocity).
+        switch_scale = 0.5


see the switch function. Used this in Jenny's project to smoothly turn on/off rewards that we only want on or off when standing still or moving. switch_scale controls how quickly in moves from 0 to 1 (plot the function to see), and switch_threshold is when it starts to transition, based on |command|.

sheim · 2024-03-19T21:12:02Z

gym/envs/mit_humanoid/mit_humanoid_config.py

+        critic_hidden_dims = [256, 256, 128]
        # * can be elu, relu, selu, crelu, lrelu, tanh, sigmoid
-        activation = "elu"
+        activation = "tanh"


Did a sweep the other day, tanh had high rewards 8 times, compared to 4 for elu... very hand wavy.

elijahstangerjones

Looks good, thanks for porting this steve

elijahstangerjones · 2024-03-22T19:10:55Z

gym/envs/mit_humanoid/mit_humanoid_config.py

+        base_lin_vel = 1.5
+        commands = 1
+        base_height = BASE_HEIGHT_REF
+        dof_pos = [


Is this useful if we have the autonormalization? My current policy just depends on the autonorm and its kinda nice not to have to worry about the scaling when sim2sim-ing

try it. My "feeling" from Jenny's project is, absolutely yes. But it might also be that it's very important for the action scale (which was shared with this), but not for observations. That would make sense to me.

elijahstangerjones · 2024-03-22T19:15:11Z

learning/modules/utils/normalize.py

-        """
-        Returns the normalized version of the input.
-        """
+    def forward(self, input):


have we tested how small these variances can get? I had some issues with sim2sim with normalizations with very small variances.

For safety in sim2sim I had clamped the variance to be no smaller than 0.1 (as per usual I was debugging 90 things at once so maybe this wasn't that important but still seems like a good idea for safety)

I have not. I'll put it on the stack

elijahstangerjones · 2024-03-22T19:15:55Z

gym/envs/mit_humanoid/mit_humanoid_config.py

+        critic_hidden_dims = [256, 256, 128]
        # * can be elu, relu, selu, crelu, lrelu, tanh, sigmoid
-        activation = "elu"
+        activation = "tanh"


elijahstangerjones · 2024-03-22T19:16:47Z

gym/envs/mit_humanoid/mit_humanoid.py


    def _init_buffers(self):
        super()._init_buffers()
+        self.oscillators = torch.zeros(self.num_envs, 2, device=self.device)


So theres a sin and cos for each leg?

elijahstangerjones · 2024-03-22T19:18:31Z

gym/envs/mit_humanoid/mit_humanoid_config.py


 class MITHumanoidRunnerCfg(LeggedRobotRunnerCfg):
    seed = -1
    runner_class_name = "OnPolicyRunner"


Is this the right runner now? Whats my_runner?

on this branch no difference, on a different one I added the PBRS into my runner, but left the base on "vanilla". Basically "My Runner" is where you play with whatever you want.

elijahstangerjones · 2024-03-22T19:22:08Z

gym/envs/mit_humanoid/mit_humanoid_config.py

    class env(LeggedRobotCfg.env):
        num_envs = 4096
-        num_observations = 49 + 3 * 18  # 121
        num_actuators = 18


I think right now the learnt urdf only has 10 joints so this should crash I think? But I think we should start using the arms now so we should turn those joints back on in the urdf

arms are enabled again.

diff --git a/gym/envs/mit_humanoid/mit_humanoid_config.py b/gym/envs/mit_humanoid/mit_humanoid_config.py index 54813f7..368f2c3 100644 --- a/gym/envs/mit_humanoid/mit_humanoid_config.py +++ b/gym/envs/mit_humanoid/mit_humanoid_config.py @@ -142,7 +142,7 @@ class MITHumanoidCfg(LeggedRobotCfg): class asset(LeggedRobotCfg.asset): file = ( "{LEGGED_GYM_ROOT_DIR}/resources/robots/" - + "mit_humanoid/urdf/humanoid_R_sf.urdf" + + "mit_humanoid/urdf/humanoid_F_sf_learnt.urdf" ) # foot_collisionbox_names = ["foot"] foot_name = "foot" @@ -289,12 +289,12 @@ class MITHumanoidRunnerCfg(LeggedRobotRunnerCfg): use_clipped_value_loss = True clip_param = 0.2 entropy_coef = 0.01 - num_learning_epochs = 5 + num_learning_epochs = 4 # * mini batch size = num_envs*nsteps / nminibatches num_mini_batches = 4 - learning_rate = 5.0e-5 + learning_rate = 1.0e-6 schedule = "adaptive" # could be adaptive, fixed - gamma = 0.999 + gamma = 0.99 lam = 0.95 desired_kl = 0.01 max_grad_norm = 1.0 diff --git a/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf b/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf index 5e76f34..98a0773 100644 --- a/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf +++ b/resources/robots/mit_humanoid/urdf/humanoid_F_sf_learnt.urdf @@ -570,7 +570,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="15_left_shoulder_pitch" - type="fixed"> + type="revolute"> <origin xyz="0.01346 0.17608 0.24657" rpy="0 0 0" /> @@ -615,7 +615,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="16_left_shoulder_abad" - type="fixed"> + type="revolute"> <origin xyz="0 .05760 0" rpy="0.0 0 0" /> @@ -668,7 +668,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="17_left_shoulder_yaw" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.10250" rpy="0.0 0 0" /> @@ -719,7 +719,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="18_left_elbow" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.15750" rpy="0 0 0.0" /> @@ -798,7 +798,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="11_right_shoulder_pitch" - type="fixed"> + type="revolute"> <origin xyz="0.01346 -0.17608 0.24657" rpy="0 0 0" /> @@ -843,7 +843,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="12_right_shoulder_abad" - type="fixed"> + type="revolute"> <origin xyz="0 -.05760 0" rpy="0.0 0 0" /> @@ -896,7 +896,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="13_right_shoulder_yaw" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.10250" rpy="0.0 0 0" /> @@ -947,7 +947,7 @@ Simple Foot: foot approximated as single box-contact --> </link> <joint name="14_right_elbow" - type="fixed"> + type="revolute"> <origin xyz="0 0 -.15750" rpy="0 0 0.0" />

…erically unstable for humanoid

put jacobian into MIT_humanoid base environment, rather than import; importing breaks when running --original_config

Layer norm

only for lander: needs to overload _check_terminations_and_timeouts to not actually reset so that reset can be handled only by runner (without breaking backwards compatibility) Also keeps a list of rewards, for finer reward integration, and can also do traj stats

Adjust some reward weights of action_rate (mini-cheetah) to work decently.

…ffer. Doesn't seem to make a difference in speed.

…n the runner. Seems to make a tiny difference in speed. More importantly, it removes the `eval` hack, which means we can profile individual reward functions. THIS BREAKS PBRS

sheim added 22 commits February 16, 2024 13:31

add new PPO2, for now just split up actor and critic.

e19ec21

Tested, 100% the same as PPO.

PPO2: split optimizers for critic and actor

1a7977e

cleaning

3253a65

Basic internals of dict_storage in place, fills correctly

e3c2589

WIP: change storage to be (n_steps, n_envs, var_dim), among other things

7c1166b

implemented compute advantages (returns), validated against existing.

fa05a31

Same up to 1e-5 on individual values. Because of differencing, this can end up in much larger differences (1e0) occasionally, after normalization.

critic update runs, but is very slow. Not clear why

5c5d38b

Learning critic with dict storage now working.

afa0be2

Left without boot-strapping on last step, seems like it makes no difference. Boot-strapping on time-outs made a big difference though.

add tensordict to reqs

12661f0

WIP: works now, but not with normalization turned on

679a729

fix normalization, make it work for (n, m, var_dim) shaped data

6dc86fd

some cleanup and fixes on normalization

9fc831e

Merge branch 'dev' into tensorDict

9524c81

Merge branch 'dev' into tensorDict

8129e1d

Merge pull request #5 from mit-biomimetics/tensorDict

6c5aab4

Tensor dict

add coupling to humanoid torques

6b9a2c9

add exp avg filtering

e52208f

add oscillators and phase-based rewards, some tuning

54b826d

some tuning

c9411cd

implement history with (approximated) arbitrary sampling frequency an…

90ff50c

…d history length. Implemented for dof_pos_target, dof_pos, dof_vel.

make pd gains specific for each robot

5020fd0

in preparation for domain randomization of gains

refactor sampled history to be more explicit and not overlap with `do…

82968b1

…f_pos_history` (used for smoothness reward). Some tuning. Change stance/swing reward to use the smoothed square wave.

sheim requested a review from elijahstangerjones March 19, 2024 21:02

sheim self-assigned this Mar 19, 2024

sheim commented Mar 19, 2024

View reviewed changes

elijahstangerjones approved these changes Mar 22, 2024

View reviewed changes

sheim added 4 commits March 23, 2024 12:29

properly roll histories, retune action_rate for mc

3ae04a0

hotfix old runner, and set as default for humanoid. New runner is num…

0aab100

…erically unstable for humanoid

WIP: more clean post actor/critic, exporting

de64f6c

sheim added 30 commits August 6, 2024 12:05

Merge branch 'dev' into humanoid

816a18b

fix hips_forward dimensionality

098989a

lander env

0611bcd

unify humanoid and lander

fd30865

put jacobian into MIT_humanoid base environment, rather than import; importing breaks when running --original_config

everything integrated and works

6248559

fix faster training configs for benchmarking

1540cd2

just tweaks

842b39e

SAC: typo and params that work for pendulum

33efaf5

update ruff settings

fd06555

address deprecation in torch.load and add lr to trackables for SAC

3550dfd

allow randomize_episode_counters for reset_to_uniform in pendulum

4247ea7

some slimmin and separation of utils

66c9e1b

add layer norm to create_MLP, +kwargs flexibility

dbd48c9

update for layernorm

7d97081

add layer norm to create_MLP, +kwargs flexibility

032ac8e

Merge pull request #29 from mit-biomimetics/layer_norm

f084aa1

Layer norm

wip

be2e990

Merge branch 'dev' into humanoid

3cd76b3

update torch.load with weights_only=True for depercation

91395f4

working SAC on pendulum, ready for merge

0979fab

Merge branch 'psdNetworks' into dev

54e7a8d

Merge branch 'dev' into humanoid

12e8327

update nn_params

f105c0e

apply stash w/ sim-freq reward sampling.

c04b563

Adjust some reward weights of action_rate (mini-cheetah) to work decently.

refactor skeleton with a super_init fix, and pre-initialize reward bu…

e395673

…ffer. Doesn't seem to make a difference in speed.

refactor: redo rewards computation, with a dict of reward functions i…

ea5cdff

…n the runner. Seems to make a tiny difference in speed. More importantly, it removes the `eval` hack, which means we can profile individual reward functions. THIS BREAKS PBRS

compute switch once per decimation (speedup ~10%)

7296c6c

fixed logging bug that scaled with wrong dt

69218dd

hardcode Jacobian (halves time for _apply_coupling)

b1bd4f2

Conversation

sheim commented Mar 19, 2024

Checklist before requesting a review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elijahstangerjones left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants