sparse constraints: count non-zeros for rownnz and rowadr by thowell · Pull Request #1202 · google-deepmind/mujoco_warp

thowell · 2026-03-04T13:02:26Z

for sparse constraints, count the number of non-zeros for rownnz and rowadr.

introduces the array efc_nnz (size nworld) in the make_constraint scope. for each world, this is the running count of non-zeros for efc_J/efc_J_colind
when sparse, a constraint computes the number of non-zeros rownnz, then calls atomic_add with efc_nnz and rownnz to get rowadr (note: multi-row constraints like equality connect can call with nrow * rownnz where nrow=3).

note: one potential side effect of this pattern is that the constraint memory in efc_J / efc_colind may not be sequential by constraint efc id since there are 2 separate atomic_add operations that may not be synced. all of the sparse matrix operations will be correct, but if one inspects the memory the order might not be sequential.

in a follow-up pr, we can add a parameter (something like nefcJmax / nefcJnnz) that determines the memory allocation for efc_J / efc_J_colind. overflow will be reported if rowadr + rownnz >= {allocated number of non-zeros}.

this is an alternative to #936. note: the tradeoff with the changes proposed in this pr is that the counting of non-zeros + atomic_operations + efc_nnz is expected to be computationally more expensive, but enables potentially more memory savings compared to the approach proposed in #936.

erikfrey · 2026-03-04T17:08:16Z

Very interesting! @thowell any impressions so far of performance tradeoff of this approach?

thowell · 2026-03-05T15:35:32Z

tl;dr
comparing humanoid/three humanoids:

this pr
main: dense / sparse
set maximum number of dofs for efc_J #936 sparse with / without setting nefcdof < nv

humanoid

the existing dense path is best overall [sps: 3,627,216] (not unexpected since this path has been optimized)
for sparse options, this pr has the best performance [sps: 2,782,191]

three humanoids

the overall best is this pr [sps: 621,672] compared to main<sparse: 475,032, dense: 492,304>

humanoid

mjwarp-testspeed benchmarks/humanoid/humanoid.xml --nworld=8192 --nconmax=24 --njmax=64 -o "opt.jacobian="sparse"" --event_trace

this pr with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/humanoid.xml...

Model
  nq: 28 nv: 27 nu: 21 nbody: 17 ngeom: 20
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 196608 njmax: 64
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.27 s
Total simulation time: 2.94 s
Total steps per second: 2,782,191
Total realtime factor: 13,910.95 x
Total time per step: 359.43 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 357.24
  forward: 354.48
    fwd_position: 71.17
      kinematics: 10.05
      com_pos: 5.67
      camlight: 1.81
      flex: 0.17
      crb: 8.41
      tendon_armature: 0.18
      collision: 10.13
        nxn_broadphase: 3.79
        convex_narrowphase: 0.17
        primitive_narrowphase: 5.27
      make_constraint: 30.74
      transmission: 2.24
    sensor_pos: 0.18
    fwd_velocity: 25.10
      com_vel: 5.03
      passive: 1.28
      rne: 7.03
      tendon_bias: 0.18
    sensor_vel: 0.17
    fwd_actuation: 1.89
    fwd_acceleration: 36.98
      xfrc_accumulate: 1.65
    solve: 216.97
      mul_m: 2.68
    sensor_acc: 0.18
  euler: 2.21

5e8c0e2 (main) with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/humanoid.xml...

Model
  nq: 28 nv: 27 nu: 21 nbody: 17 ngeom: 20
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 196608 njmax: 64
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.27 s
Total simulation time: 3.84 s
Total steps per second: 2,133,377
Total realtime factor: 10,666.89 x
Total time per step: 468.74 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 466.49
  forward: 463.80
    fwd_position: 106.10
      kinematics: 10.00
      com_pos: 5.66
      camlight: 1.79
      flex: 0.17
      crb: 8.41
      tendon_armature: 0.17
      collision: 10.13
        nxn_broadphase: 3.78
        convex_narrowphase: 0.18
        primitive_narrowphase: 5.27
      make_constraint: 65.65
      transmission: 2.37
    sensor_pos: 0.18
    fwd_velocity: 26.14
      com_vel: 6.12
      passive: 1.29
      rne: 6.98
      tendon_bias: 0.17
    sensor_vel: 0.17
    fwd_actuation: 1.89
    fwd_acceleration: 36.79
      xfrc_accumulate: 1.62
    solve: 290.49
      mul_m: 3.31
    sensor_acc: 0.17
  euler: 2.14

5e8c0e2 (main) with SPARSE_CONSTRAINT_JACOBIAN=False

Loading model from: benchmarks/humanoid/humanoid.xml...

Model
  nq: 28 nv: 27 nu: 21 nbody: 17 ngeom: 20
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 196608 njmax: 64
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.29 s
Total simulation time: 2.26 s
Total steps per second: 3,627,216
Total realtime factor: 18,136.08 x
Total time per step: 275.69 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 273.74
  forward: 271.02
    fwd_position: 79.28
      kinematics: 9.84
      com_pos: 5.66
      camlight: 1.80
      flex: 0.17
      crb: 8.36
      tendon_armature: 0.17
      collision: 10.09
        nxn_broadphase: 3.79
        convex_narrowphase: 0.17
        primitive_narrowphase: 5.23
      make_constraint: 39.08
      transmission: 2.33
    sensor_pos: 0.18
    fwd_velocity: 25.76
      com_vel: 5.82
      passive: 1.29
      rne: 6.96
      tendon_bias: 0.17
    sensor_vel: 0.18
    fwd_actuation: 1.97
    fwd_acceleration: 37.02
      xfrc_accumulate: 1.64
    solve: 124.65
      mul_m: 2.95
    sensor_acc: 0.17
  euler: 2.18

#936 9354443 with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/humanoid.xml...

Model
  nq: 28 nv: 27 nu: 21 nbody: 17 ngeom: 20
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 196608 njmax: 64
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.27 s
Total simulation time: 3.74 s
Total steps per second: 2,192,628
Total realtime factor: 10,963.14 x
Total time per step: 456.07 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 453.85
  forward: 451.16
    fwd_position: 103.45
      kinematics: 10.02
      com_pos: 5.67
      camlight: 1.80
      flex: 0.17
      crb: 8.44
      tendon_armature: 0.17
      collision: 10.16
        nxn_broadphase: 3.80
        convex_narrowphase: 0.17
        primitive_narrowphase: 5.29
      make_constraint: 62.91
      transmission: 2.35
    sensor_pos: 0.17
    fwd_velocity: 26.18
      com_vel: 6.11
      passive: 1.29
      rne: 6.99
      tendon_bias: 0.17
    sensor_vel: 0.18
    fwd_actuation: 1.89
    fwd_acceleration: 36.85
      xfrc_accumulate: 1.62
    solve: 280.42
      mul_m: 3.32
    sensor_acc: 0.17
  euler: 2.14

with nefcdof=16

mjwarp-testspeed benchmarks/humanoid/humanoid.xml --nworld=8192 --nconmax=24 --njmax=64 -o "opt.jacobian="sparse"" --event_trace --nefcdof=16

Loading model from: benchmarks/humanoid/humanoid.xml...

Model
  nq: 28 nv: 27 nu: 21 nbody: 17 ngeom: 20
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 196608 njmax: 64
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.26 s
Total simulation time: 2.95 s
Total steps per second: 2,777,934
Total realtime factor: 13,889.67 x
Total time per step: 359.98 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 357.98
  forward: 355.30
    fwd_position: 75.61
      kinematics: 9.88
      com_pos: 5.67
      camlight: 1.74
      flex: 0.17
      crb: 8.38
      tendon_armature: 0.17
      collision: 10.07
        nxn_broadphase: 3.77
        convex_narrowphase: 0.17
        primitive_narrowphase: 5.23
      make_constraint: 35.47
      transmission: 2.29
    sensor_pos: 0.17
    fwd_velocity: 26.15
      com_vel: 6.08
      passive: 1.28
      rne: 7.05
      tendon_bias: 0.18
    sensor_vel: 0.18
    fwd_actuation: 1.87
    fwd_acceleration: 36.97
      xfrc_accumulate: 1.63
    solve: 212.35
      mul_m: 3.18
    sensor_acc: 0.17
  euler: 2.14

three humanoids

mjwarp-testspeed benchmarks/humanoid/three_humanoids.xml --nworld=8192 --nconmax=100 --njmax=192 -o "opt.jacobian="sparse"" --event_trace

this pr with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/three_humanoids.xml...

Model
  nq: 84 nv: 81 nu: 63 nbody: 49 ngeom: 58
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 819200 njmax: 192
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.44 s
Total simulation time: 13.18 s
Total steps per second: 621,672
Total realtime factor: 3,108.36 x
Total time per step: 1608.57 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 1604.77
  forward: 1508.12
    fwd_position: 213.60
      kinematics: 23.51
      com_pos: 14.16
      camlight: 3.42
      flex: 0.18
      crb: 23.52
      tendon_armature: 0.17
      collision: 29.64
        nxn_broadphase: 17.90
        convex_narrowphase: 0.17
        primitive_narrowphase: 10.65
      make_constraint: 81.39
      transmission: 35.77
    sensor_pos: 0.17
    fwd_velocity: 68.31
      com_vel: 15.22
      passive: 2.09
      rne: 16.07
      tendon_bias: 0.18
    sensor_vel: 0.17
    fwd_actuation: 32.46
    fwd_acceleration: 92.88
      xfrc_accumulate: 5.72
    solve: 1098.47
      mul_m: 6.85
    sensor_acc: 0.17
  euler: 96.06

5e8c0e2 (main) with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/three_humanoids.xml...

Model
  nq: 84 nv: 81 nu: 63 nbody: 49 ngeom: 58
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 819200 njmax: 192
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.45 s
Total simulation time: 17.25 s
Total steps per second: 475,032
Total realtime factor: 2,375.16 x
Total time per step: 2105.12 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 2101.36
  forward: 2006.16
    fwd_position: 355.22
      kinematics: 23.25
      com_pos: 14.02
      camlight: 3.39
      flex: 0.17
      crb: 23.26
      tendon_armature: 0.18
      collision: 29.34
        nxn_broadphase: 17.76
        convex_narrowphase: 0.18
        primitive_narrowphase: 10.51
      make_constraint: 221.51
      transmission: 38.29
    sensor_pos: 0.17
    fwd_velocity: 67.61
      com_vel: 15.06
      passive: 2.06
      rne: 15.89
      tendon_bias: 0.18
    sensor_vel: 0.17
    fwd_actuation: 32.45
    fwd_acceleration: 91.54
      xfrc_accumulate: 5.66
    solve: 1456.94
      mul_m: 6.35
    sensor_acc: 0.18
  euler: 94.62

5e8c0e2 (main) with SPARSE_CONSTRAINT_JACOBIAN=False

Loading model from: benchmarks/humanoid/three_humanoids.xml...

Model
  nq: 84 nv: 81 nu: 63 nbody: 49 ngeom: 58
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 819200 njmax: 192
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.46 s
Total simulation time: 16.64 s
Total steps per second: 492,304
Total realtime factor: 2,461.52 x
Total time per step: 2031.26 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 2027.35
  forward: 1930.33
    fwd_position: 375.14
      kinematics: 23.07
      com_pos: 14.40
      camlight: 3.46
      flex: 0.18
      crb: 23.67
      tendon_armature: 0.17
      collision: 29.92
        nxn_broadphase: 18.03
        convex_narrowphase: 0.18
        primitive_narrowphase: 10.80
      make_constraint: 242.96
      transmission: 35.49
    sensor_pos: 0.18
    fwd_velocity: 68.67
      com_vel: 15.39
      passive: 2.11
      rne: 16.18
      tendon_bias: 0.18
    sensor_vel: 0.17
    fwd_actuation: 32.56
    fwd_acceleration: 93.59
      xfrc_accumulate: 5.78
    solve: 1357.94
      mul_m: 6.38
    sensor_acc: 0.17
  euler: 96.44

#936 9354443 with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/three_humanoids.xml...

Model
  nq: 84 nv: 81 nu: 63 nbody: 49 ngeom: 58
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 819200 njmax: 192
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.44 s
Total simulation time: 16.91 s
Total steps per second: 484,403
Total realtime factor: 2,422.01 x
Total time per step: 2064.40 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 2060.42
  forward: 1964.98
    fwd_position: 344.50
      kinematics: 23.30
      com_pos: 14.05
      camlight: 3.40
      flex: 0.18
      crb: 23.31
      tendon_armature: 0.17
      collision: 29.41
        nxn_broadphase: 17.79
        convex_narrowphase: 0.17
        primitive_narrowphase: 10.54
      make_constraint: 210.73
      transmission: 38.12
    sensor_pos: 0.17
    fwd_velocity: 67.76
      com_vel: 15.10
      passive: 2.07
      rne: 15.93
      tendon_bias: 0.18
    sensor_vel: 0.18
    fwd_actuation: 32.45
    fwd_acceleration: 91.81
      xfrc_accumulate: 5.67
    solve: 1426.04
      mul_m: 6.36
    sensor_acc: 0.17
  euler: 94.85

with nefcdof=32

mjwarp-testspeed benchmarks/humanoid/three_humanoids.xml --nworld=8192 --nconmax=100 --njmax=192 -o "opt.jacobian="sparse"" --event_trace --nefcdof=32

Loading model from: benchmarks/humanoid/three_humanoids.xml...

Model
  nq: 84 nv: 81 nu: 63 nbody: 49 ngeom: 58
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 819200 njmax: 192
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.44 s
Total simulation time: 14.61 s
Total steps per second: 560,799
Total realtime factor: 2,803.99 x
Total time per step: 1783.17 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 1779.71
  forward: 1681.05
    fwd_position: 293.63
      kinematics: 23.61
      com_pos: 14.43
      camlight: 3.55
      flex: 0.18
      crb: 23.93
      tendon_armature: 0.18
      collision: 30.30
        nxn_broadphase: 18.30
        convex_narrowphase: 0.18
        primitive_narrowphase: 10.90
      make_constraint: 158.61
      transmission: 37.05
    sensor_pos: 0.17
    fwd_velocity: 69.70
      com_vel: 15.47
      passive: 2.14
      rne: 16.55
      tendon_bias: 0.18
    sensor_vel: 0.18
    fwd_actuation: 32.58
    fwd_acceleration: 95.20
      xfrc_accumulate: 5.82
    solve: 1187.49
      mul_m: 6.46
    sensor_acc: 0.18
  euler: 98.07

adenzler-nvidia · 2026-03-06T10:10:12Z

wow, these numbers look massive! Amazing.

adenzler-nvidia

Looks good, numbers are very convincing. Maybe it's possible to pre-calculate a lot of these numbers even? Might not move the needle much in terms of performance, but we should check

adenzler-nvidia · 2026-03-06T10:11:53Z

mujoco_warp/_src/constraint.py

+        da1 = dof_parentid[da1]
+      if da2 == da:
+        da2 = dof_parentid[da2]
+      rownnz += 1


these numbers could even be pre-calculated, and don't need to be a runtime exploration?

yes, the number of non-zeros could be pre-computed. added todos to _equality_connect and _equality_weld.

erikfrey

Fantastic improvement! Just two nits

erikfrey · 2026-03-11T22:30:51Z

mujoco_warp/_src/constraint.py

    da2 = int(body_dofadr[body2] + body_dofnum[body2] - 1)

+    # count non-zeros
+    da1_save = da1


nit: just to reduce cognitive overhead a bit, why don't we flip this to avoid mental bookkeeping:

pda1, pda2 = da1, da2

then iterate over pda1 and pda2 for counting

(nit applies here and elsewhere)

erikfrey · 2026-03-11T23:06:27Z

mujoco_warp/_src/constraint.py

+      rownnz += 1
+
+    # get rowadr
+    rowadr_base = wp.atomic_add(efc_nnz_out, worldid, 3 * rownnz)


nit: for brevity, maybe just do:

rowadr = wp.atomic_add(efc_nnz_out, worldid, 3 * rownnz) efc_J_rowadr_out[worldid, efcid + 0] = rowadr efc_J_rowadr_out[worldid, efcid + 1] = rowadr + rownnz efc_J_rowadr_out[worldid, efcid + 2] = rowadr + 2 * rownnz

(nit applies here and elsewhere)

sparse constraints: count non-zeros for rownnz and rowadr

a6b7e1a

thowell requested a review from erikfrey March 4, 2026 13:02

thowell linked an issue Mar 4, 2026 that may be closed by this pull request

JacobianType.SPARSE #88

Open

4 tasks

thowell added this to the Sparsity milestone Mar 4, 2026

thowell marked this pull request as ready for review March 5, 2026 15:36

thowell requested a review from adenzler-nvidia March 5, 2026 15:36

adenzler-nvidia reviewed Mar 6, 2026

View reviewed changes

add todos

92e022b

erikfrey approved these changes Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparse constraints: count non-zeros for rownnz and rowadr#1202

sparse constraints: count non-zeros for rownnz and rowadr#1202
thowell wants to merge 2 commits intogoogle-deepmind:mainfrom
thowell:constraint_count_nonzeros

thowell commented Mar 4, 2026

Uh oh!

erikfrey commented Mar 4, 2026

Uh oh!

thowell commented Mar 5, 2026

Uh oh!

adenzler-nvidia commented Mar 6, 2026

Uh oh!

adenzler-nvidia left a comment

Uh oh!

adenzler-nvidia Mar 6, 2026

Uh oh!

thowell Mar 6, 2026

Uh oh!

erikfrey left a comment

Uh oh!

erikfrey Mar 11, 2026

Uh oh!

erikfrey Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thowell commented Mar 4, 2026

Uh oh!

erikfrey commented Mar 4, 2026

Uh oh!

thowell commented Mar 5, 2026

Uh oh!

adenzler-nvidia commented Mar 6, 2026

Uh oh!

adenzler-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

adenzler-nvidia Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

thowell Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

erikfrey left a comment

Choose a reason for hiding this comment

Uh oh!

erikfrey Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

erikfrey Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants