Skip to content

sparse constraints: count non-zeros for rownnz and rowadr#1202

Open
thowell wants to merge 2 commits intogoogle-deepmind:mainfrom
thowell:constraint_count_nonzeros
Open

sparse constraints: count non-zeros for rownnz and rowadr#1202
thowell wants to merge 2 commits intogoogle-deepmind:mainfrom
thowell:constraint_count_nonzeros

Conversation

@thowell
Copy link
Collaborator

@thowell thowell commented Mar 4, 2026

for sparse constraints, count the number of non-zeros for rownnz and rowadr.

  • introduces the array efc_nnz (size nworld) in the make_constraint scope. for each world, this is the running count of non-zeros for efc_J/efc_J_colind
  • when sparse, a constraint computes the number of non-zeros rownnz, then calls atomic_add with efc_nnz and rownnz to get rowadr (note: multi-row constraints like equality connect can call with nrow * rownnz where nrow=3).

note: one potential side effect of this pattern is that the constraint memory in efc_J / efc_colind may not be sequential by constraint efc id since there are 2 separate atomic_add operations that may not be synced. all of the sparse matrix operations will be correct, but if one inspects the memory the order might not be sequential.

in a follow-up pr, we can add a parameter (something like nefcJmax / nefcJnnz) that determines the memory allocation for efc_J / efc_J_colind. overflow will be reported if rowadr + rownnz >= {allocated number of non-zeros}.

this is an alternative to #936. note: the tradeoff with the changes proposed in this pr is that the counting of non-zeros + atomic_operations + efc_nnz is expected to be computationally more expensive, but enables potentially more memory savings compared to the approach proposed in #936.

@thowell thowell requested a review from erikfrey March 4, 2026 13:02
@thowell thowell linked an issue Mar 4, 2026 that may be closed by this pull request
4 tasks
@thowell thowell added this to the Sparsity milestone Mar 4, 2026
@erikfrey
Copy link
Collaborator

erikfrey commented Mar 4, 2026

Very interesting! @thowell any impressions so far of performance tradeoff of this approach?

@thowell
Copy link
Collaborator Author

thowell commented Mar 5, 2026

tl;dr
comparing humanoid/three humanoids:

humanoid

  • the existing dense path is best overall [sps: 3,627,216] (not unexpected since this path has been optimized)
  • for sparse options, this pr has the best performance [sps: 2,782,191]

three humanoids

  • the overall best is this pr [sps: 621,672] compared to main<sparse: 475,032, dense: 492,304>

humanoid

mjwarp-testspeed benchmarks/humanoid/humanoid.xml --nworld=8192 --nconmax=24 --njmax=64 -o "opt.jacobian="sparse"" --event_trace

this pr with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/humanoid.xml...

Model
  nq: 28 nv: 27 nu: 21 nbody: 17 ngeom: 20
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 196608 njmax: 64
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.27 s
Total simulation time: 2.94 s
Total steps per second: 2,782,191
Total realtime factor: 13,910.95 x
Total time per step: 359.43 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 357.24
  forward: 354.48
    fwd_position: 71.17
      kinematics: 10.05
      com_pos: 5.67
      camlight: 1.81
      flex: 0.17
      crb: 8.41
      tendon_armature: 0.18
      collision: 10.13
        nxn_broadphase: 3.79
        convex_narrowphase: 0.17
        primitive_narrowphase: 5.27
      make_constraint: 30.74
      transmission: 2.24
    sensor_pos: 0.18
    fwd_velocity: 25.10
      com_vel: 5.03
      passive: 1.28
      rne: 7.03
      tendon_bias: 0.18
    sensor_vel: 0.17
    fwd_actuation: 1.89
    fwd_acceleration: 36.98
      xfrc_accumulate: 1.65
    solve: 216.97
      mul_m: 2.68
    sensor_acc: 0.18
  euler: 2.21

5e8c0e2 (main) with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/humanoid.xml...

Model
  nq: 28 nv: 27 nu: 21 nbody: 17 ngeom: 20
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 196608 njmax: 64
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.27 s
Total simulation time: 3.84 s
Total steps per second: 2,133,377
Total realtime factor: 10,666.89 x
Total time per step: 468.74 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 466.49
  forward: 463.80
    fwd_position: 106.10
      kinematics: 10.00
      com_pos: 5.66
      camlight: 1.79
      flex: 0.17
      crb: 8.41
      tendon_armature: 0.17
      collision: 10.13
        nxn_broadphase: 3.78
        convex_narrowphase: 0.18
        primitive_narrowphase: 5.27
      make_constraint: 65.65
      transmission: 2.37
    sensor_pos: 0.18
    fwd_velocity: 26.14
      com_vel: 6.12
      passive: 1.29
      rne: 6.98
      tendon_bias: 0.17
    sensor_vel: 0.17
    fwd_actuation: 1.89
    fwd_acceleration: 36.79
      xfrc_accumulate: 1.62
    solve: 290.49
      mul_m: 3.31
    sensor_acc: 0.17
  euler: 2.14

5e8c0e2 (main) with SPARSE_CONSTRAINT_JACOBIAN=False

Loading model from: benchmarks/humanoid/humanoid.xml...

Model
  nq: 28 nv: 27 nu: 21 nbody: 17 ngeom: 20
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 196608 njmax: 64
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.29 s
Total simulation time: 2.26 s
Total steps per second: 3,627,216
Total realtime factor: 18,136.08 x
Total time per step: 275.69 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 273.74
  forward: 271.02
    fwd_position: 79.28
      kinematics: 9.84
      com_pos: 5.66
      camlight: 1.80
      flex: 0.17
      crb: 8.36
      tendon_armature: 0.17
      collision: 10.09
        nxn_broadphase: 3.79
        convex_narrowphase: 0.17
        primitive_narrowphase: 5.23
      make_constraint: 39.08
      transmission: 2.33
    sensor_pos: 0.18
    fwd_velocity: 25.76
      com_vel: 5.82
      passive: 1.29
      rne: 6.96
      tendon_bias: 0.17
    sensor_vel: 0.18
    fwd_actuation: 1.97
    fwd_acceleration: 37.02
      xfrc_accumulate: 1.64
    solve: 124.65
      mul_m: 2.95
    sensor_acc: 0.17
  euler: 2.18

#936 9354443 with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/humanoid.xml...

Model
  nq: 28 nv: 27 nu: 21 nbody: 17 ngeom: 20
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 196608 njmax: 64
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.27 s
Total simulation time: 3.74 s
Total steps per second: 2,192,628
Total realtime factor: 10,963.14 x
Total time per step: 456.07 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 453.85
  forward: 451.16
    fwd_position: 103.45
      kinematics: 10.02
      com_pos: 5.67
      camlight: 1.80
      flex: 0.17
      crb: 8.44
      tendon_armature: 0.17
      collision: 10.16
        nxn_broadphase: 3.80
        convex_narrowphase: 0.17
        primitive_narrowphase: 5.29
      make_constraint: 62.91
      transmission: 2.35
    sensor_pos: 0.17
    fwd_velocity: 26.18
      com_vel: 6.11
      passive: 1.29
      rne: 6.99
      tendon_bias: 0.17
    sensor_vel: 0.18
    fwd_actuation: 1.89
    fwd_acceleration: 36.85
      xfrc_accumulate: 1.62
    solve: 280.42
      mul_m: 3.32
    sensor_acc: 0.17
  euler: 2.14

with nefcdof=16

mjwarp-testspeed benchmarks/humanoid/humanoid.xml --nworld=8192 --nconmax=24 --njmax=64 -o "opt.jacobian="sparse"" --event_trace --nefcdof=16
Loading model from: benchmarks/humanoid/humanoid.xml...

Model
  nq: 28 nv: 27 nu: 21 nbody: 17 ngeom: 20
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 196608 njmax: 64
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.26 s
Total simulation time: 2.95 s
Total steps per second: 2,777,934
Total realtime factor: 13,889.67 x
Total time per step: 359.98 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 357.98
  forward: 355.30
    fwd_position: 75.61
      kinematics: 9.88
      com_pos: 5.67
      camlight: 1.74
      flex: 0.17
      crb: 8.38
      tendon_armature: 0.17
      collision: 10.07
        nxn_broadphase: 3.77
        convex_narrowphase: 0.17
        primitive_narrowphase: 5.23
      make_constraint: 35.47
      transmission: 2.29
    sensor_pos: 0.17
    fwd_velocity: 26.15
      com_vel: 6.08
      passive: 1.28
      rne: 7.05
      tendon_bias: 0.18
    sensor_vel: 0.18
    fwd_actuation: 1.87
    fwd_acceleration: 36.97
      xfrc_accumulate: 1.63
    solve: 212.35
      mul_m: 3.18
    sensor_acc: 0.17
  euler: 2.14

three humanoids

mjwarp-testspeed benchmarks/humanoid/three_humanoids.xml --nworld=8192 --nconmax=100 --njmax=192 -o "opt.jacobian="sparse"" --event_trace

this pr with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/three_humanoids.xml...

Model
  nq: 84 nv: 81 nu: 63 nbody: 49 ngeom: 58
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 819200 njmax: 192
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.44 s
Total simulation time: 13.18 s
Total steps per second: 621,672
Total realtime factor: 3,108.36 x
Total time per step: 1608.57 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 1604.77
  forward: 1508.12
    fwd_position: 213.60
      kinematics: 23.51
      com_pos: 14.16
      camlight: 3.42
      flex: 0.18
      crb: 23.52
      tendon_armature: 0.17
      collision: 29.64
        nxn_broadphase: 17.90
        convex_narrowphase: 0.17
        primitive_narrowphase: 10.65
      make_constraint: 81.39
      transmission: 35.77
    sensor_pos: 0.17
    fwd_velocity: 68.31
      com_vel: 15.22
      passive: 2.09
      rne: 16.07
      tendon_bias: 0.18
    sensor_vel: 0.17
    fwd_actuation: 32.46
    fwd_acceleration: 92.88
      xfrc_accumulate: 5.72
    solve: 1098.47
      mul_m: 6.85
    sensor_acc: 0.17
  euler: 96.06

5e8c0e2 (main) with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/three_humanoids.xml...

Model
  nq: 84 nv: 81 nu: 63 nbody: 49 ngeom: 58
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 819200 njmax: 192
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.45 s
Total simulation time: 17.25 s
Total steps per second: 475,032
Total realtime factor: 2,375.16 x
Total time per step: 2105.12 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 2101.36
  forward: 2006.16
    fwd_position: 355.22
      kinematics: 23.25
      com_pos: 14.02
      camlight: 3.39
      flex: 0.17
      crb: 23.26
      tendon_armature: 0.18
      collision: 29.34
        nxn_broadphase: 17.76
        convex_narrowphase: 0.18
        primitive_narrowphase: 10.51
      make_constraint: 221.51
      transmission: 38.29
    sensor_pos: 0.17
    fwd_velocity: 67.61
      com_vel: 15.06
      passive: 2.06
      rne: 15.89
      tendon_bias: 0.18
    sensor_vel: 0.17
    fwd_actuation: 32.45
    fwd_acceleration: 91.54
      xfrc_accumulate: 5.66
    solve: 1456.94
      mul_m: 6.35
    sensor_acc: 0.18
  euler: 94.62

5e8c0e2 (main) with SPARSE_CONSTRAINT_JACOBIAN=False

Loading model from: benchmarks/humanoid/three_humanoids.xml...

Model
  nq: 84 nv: 81 nu: 63 nbody: 49 ngeom: 58
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 819200 njmax: 192
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.46 s
Total simulation time: 16.64 s
Total steps per second: 492,304
Total realtime factor: 2,461.52 x
Total time per step: 2031.26 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 2027.35
  forward: 1930.33
    fwd_position: 375.14
      kinematics: 23.07
      com_pos: 14.40
      camlight: 3.46
      flex: 0.18
      crb: 23.67
      tendon_armature: 0.17
      collision: 29.92
        nxn_broadphase: 18.03
        convex_narrowphase: 0.18
        primitive_narrowphase: 10.80
      make_constraint: 242.96
      transmission: 35.49
    sensor_pos: 0.18
    fwd_velocity: 68.67
      com_vel: 15.39
      passive: 2.11
      rne: 16.18
      tendon_bias: 0.18
    sensor_vel: 0.17
    fwd_actuation: 32.56
    fwd_acceleration: 93.59
      xfrc_accumulate: 5.78
    solve: 1357.94
      mul_m: 6.38
    sensor_acc: 0.17
  euler: 96.44

#936 9354443 with SPARSE_CONSTRAINT_JACOBIAN=True

Loading model from: benchmarks/humanoid/three_humanoids.xml...

Model
  nq: 84 nv: 81 nu: 63 nbody: 49 ngeom: 58
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 819200 njmax: 192
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.44 s
Total simulation time: 16.91 s
Total steps per second: 484,403
Total realtime factor: 2,422.01 x
Total time per step: 2064.40 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 2060.42
  forward: 1964.98
    fwd_position: 344.50
      kinematics: 23.30
      com_pos: 14.05
      camlight: 3.40
      flex: 0.18
      crb: 23.31
      tendon_armature: 0.17
      collision: 29.41
        nxn_broadphase: 17.79
        convex_narrowphase: 0.17
        primitive_narrowphase: 10.54
      make_constraint: 210.73
      transmission: 38.12
    sensor_pos: 0.17
    fwd_velocity: 67.76
      com_vel: 15.10
      passive: 2.07
      rne: 15.93
      tendon_bias: 0.18
    sensor_vel: 0.18
    fwd_actuation: 32.45
    fwd_acceleration: 91.81
      xfrc_accumulate: 5.67
    solve: 1426.04
      mul_m: 6.36
    sensor_acc: 0.17
  euler: 94.85

with nefcdof=32

mjwarp-testspeed benchmarks/humanoid/three_humanoids.xml --nworld=8192 --nconmax=100 --njmax=192 -o "opt.jacobian="sparse"" --event_trace --nefcdof=32
Loading model from: benchmarks/humanoid/three_humanoids.xml...

Model
  nq: 84 nv: 81 nu: 63 nbody: 49 ngeom: 58
Option
  integrator: EULER
  cone: PYRAMIDAL
  solver: NEWTON iterations: 100 ls_iterations: 50
  is_sparse: True
  ls_parallel: False
  broadphase: NXN broadphase_filter: PLANE|SPHERE|OBB
Data
  nworld: 8192 naconmax: 819200 njmax: 192
Rolling out 1000 steps at dt = 0.005...

Summary for 8192 parallel rollouts

Total JIT time: 0.44 s
Total simulation time: 14.61 s
Total steps per second: 560,799
Total realtime factor: 2,803.99 x
Total time per step: 1783.17 ns
Total converged worlds: 8192 / 8192

Event trace:

step: 1779.71
  forward: 1681.05
    fwd_position: 293.63
      kinematics: 23.61
      com_pos: 14.43
      camlight: 3.55
      flex: 0.18
      crb: 23.93
      tendon_armature: 0.18
      collision: 30.30
        nxn_broadphase: 18.30
        convex_narrowphase: 0.18
        primitive_narrowphase: 10.90
      make_constraint: 158.61
      transmission: 37.05
    sensor_pos: 0.17
    fwd_velocity: 69.70
      com_vel: 15.47
      passive: 2.14
      rne: 16.55
      tendon_bias: 0.18
    sensor_vel: 0.18
    fwd_actuation: 32.58
    fwd_acceleration: 95.20
      xfrc_accumulate: 5.82
    solve: 1187.49
      mul_m: 6.46
    sensor_acc: 0.18
  euler: 98.07

@thowell thowell marked this pull request as ready for review March 5, 2026 15:36
@thowell thowell requested a review from adenzler-nvidia March 5, 2026 15:36
@adenzler-nvidia
Copy link
Collaborator

wow, these numbers look massive! Amazing.

Copy link
Collaborator

@adenzler-nvidia adenzler-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, numbers are very convincing. Maybe it's possible to pre-calculate a lot of these numbers even? Might not move the needle much in terms of performance, but we should check

da1 = dof_parentid[da1]
if da2 == da:
da2 = dof_parentid[da2]
rownnz += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these numbers could even be pre-calculated, and don't need to be a runtime exploration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the number of non-zeros could be pre-computed. added todos to _equality_connect and _equality_weld.

Copy link
Collaborator

@erikfrey erikfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic improvement! Just two nits

da2 = int(body_dofadr[body2] + body_dofnum[body2] - 1)

# count non-zeros
da1_save = da1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just to reduce cognitive overhead a bit, why don't we flip this to avoid mental bookkeeping:

pda1, pda2 = da1, da2

then iterate over pda1 and pda2 for counting

(nit applies here and elsewhere)

rownnz += 1

# get rowadr
rowadr_base = wp.atomic_add(efc_nnz_out, worldid, 3 * rownnz)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for brevity, maybe just do:

rowadr = wp.atomic_add(efc_nnz_out, worldid, 3 * rownnz)
efc_J_rowadr_out[worldid, efcid + 0] = rowadr
efc_J_rowadr_out[worldid, efcid + 1] = rowadr + rownnz
efc_J_rowadr_out[worldid, efcid + 2] = rowadr + 2 * rownnz

(nit applies here and elsewhere)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JacobianType.SPARSE

3 participants