Skip to content

Conversation

@7radians
Copy link

Hi all,

Not sure if others have run into this issue on Archer2 or elsewhere, but in case this fix is useful:

Context

My collaborator's ASE MD runs with MACE 0.3.13 / 0.3.14 failed when using a ROCm PyTorch build on Archer2. TorchScript was enforcing strict schema checks, silently ignoring unknown kwargs and omitting optional outputs, causing runtime errors and deadlocks.

This PR

Improves robustness of the ASE MACE calculator to handle these scenarios:

  1. Dynamic kwarg gating
    Inspect model.forward at runtime and pass compute_edge_forces / compute_atomic_stresses only if supported, eliminating unknown‑kwarg errors on TorchScripted models on ROCm builds.

  2. Safe output access
    Replace out["..."] with out.get("...") plus null checks for atomic_stresses and atomic_virials, preventing KeyErrors or hangs when keys are absent.

  3. Empty‑list stacking guard
    Before aggregating per‑model tensors with torch.stack(), verify that the corresponding list is non‑empty, avoiding deadlocks.


All changes are in mace/calculators/mace.py.

Tested on CUDA, ROCm, and CPUs on the following machines:

  • Archer2
  • LUMI
  • Kelvin2

The overheads should be (and appear to be) negligible.

@ilyes319
Copy link
Contributor

ilyes319 commented Jul 2, 2025

Hey @7radians thank you for that, this seems very weird indeed. Can you tell me what error appeared?

@7radians
Copy link
Author

7radians commented Jul 2, 2025

@ilyes319 here are the errors my collaborator had:
1)
RuntimeError: Unknown keyword argument 'compute_edge_forces' for operator 'forward'. Schema: forward(torch.mace.modules.models.___torch_mangle_161.ScaleShiftMACE self, Dict(str, Tensor) data, bool training=False, bool compute_force=True, bool compute_virials=False, bool compute_stress=False, bool compute_displacement=False, bool compute_hessian=False) -> Dict(str, Tensor?)
2)
File ".../mace/calculators/mace.py", line 360, in calculate if out["atomic_stresses"] is not None: KeyError: 'atomic_stresses'
3) when the above two were fixed, there was a deadlock which I figured were the empty lists when doing torch.stack()

@ilyes319
Copy link
Contributor

ilyes319 commented Jul 2, 2025

mmm were you using a model that was compiled before hand on an older version of mace?

@7radians
Copy link
Author

7radians commented Jul 2, 2025

The model was compiled with mace 0.3.13, and failed both with 0.3.13 and 0.3.14, giving the same errors

Develop patch
@7radians
Copy link
Author

7radians commented Jul 4, 2025

Some other commit tagged along from the main and caused errors with the heads, here is the clean fix based on the develop branch, hopefully that's less hassle @ilyes319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants