Fix validation: use relative error tolerance instead of absolute by dbsanfte · Pull Request #154 · eth-cscs/COSMA

dbsanfte · 2025-10-07T10:11:04Z

Problem

The validation code in cosma_utils.hpp was using absolute error tolerance (< 1e-8) to validate matrix multiplication results. This causes false negatives for large matrix multiplications where result values have magnitude ~10^4 or greater.

For example, with 32×896×896 float32 matrices:

Result values: ~27,000 magnitude
Actual errors: ~0.02 (relative error ~7e-7, well within float32 precision)
Absolute tolerance: 1e-8
Result: 93.6% "errors" reported, but computation was actually correct!

This is the root cause of issue #153 which appeared to be a K-split correctness bug.

Solution

Switch from absolute error to relative error validation:

// Before
isOK = isOK && (std::abs(globC[i] - globCcheck[i]) < epsilon);

// After  
double abs_error = std::abs(globC[i] - globCcheck[i]);
double scale = std::max(std::abs(globC[i]), std::abs(globCcheck[i]));
double rel_error = (scale > 1e-10) ? abs_error / scale : abs_error;
double tolerance = (sizeof(Scalar) == 4) ? 1e-5 : epsilon;
isOK = isOK && (rel_error < tolerance);

Key improvements:

Use relative error for numerical values with magnitude > 1e-10
Use appropriate tolerances for data type:
- Float32: 1e-5 (accounts for ~7 digits of precision)
- Float64: 1e-8 (accounts for ~15 digits of precision)
Fall back to absolute error for values near zero

Testing

Verified fix resolves the false negatives:

# Before fix: 93.8% errors (FALSE POSITIVE)
mpirun -np 2 cosma_miniapp -m 32 -n 896 -k 896 --test --type float
# Result is NOT OK
# Total errors: 26912 out of 28672 elements (93.8616%)

# After fix: PASSES
mpirun -np 2 cosma_miniapp -m 32 -n 896 -k 896 --test --type float  
# Result is OK
# Result is CORRECT!

Additional validation:

✅ 32×10000×896 float32: now passes (was 93.6% false errors)
✅ 32×896×896 float64: passes with stricter 1e-8 tolerance
✅ 32×32×32 float64: regression test still passes

Impact

This fixes validation for:

Large matrix dimensions (where results have large magnitude)
Float32 precision (which was essentially unusable before)
K-split and other distributed strategies (which were flagged incorrectly)

The actual COSMA algorithm was computing correct results all along - only the validation was broken.

Related Issues

Closes #153

simonpintarelli · 2025-10-07T10:59:11Z

cscs-ci run GH200

simonpintarelli · 2025-10-16T12:16:51Z

Thanks for the PR @dbsanfte!

I agree it's the correct way to use the relative tolerance and adjust for fp32.

The PR contains additional commits, e.g. adding a mutex for global_coords in the Mapper class. It's not clear to why these are required, as far as I can judge (I'm not very familiar with this code) it's not required for the standard API COSMA provides. In case these commits didn't slip in by accident, could you please open a separate PR for it?

dbsanfte · 2025-10-16T12:18:17Z

Oh those did slip in by accident, sorry. They're not strictly required for this pr.

The validation was using absolute error tolerance (1e-8) which fails for large matrix multiplication results (magnitude ~1e4). This caused false negatives where COSMA computed correct results but failed validation. Changes: - Switch from absolute error to relative error for validation - Use 1e-5 tolerance for float32 (appropriate for single precision) - Use 1e-8 tolerance for float64 (appropriate for double precision) - Handle small values near zero with absolute error fallback This fixes issue eth-cscs#153 where K-split strategy was incorrectly reported as producing 93.6% errors when actual relative errors were < 1e-6. Tested with: - 32x896x896 float32: now passes (was 93.8% false errors) - 32x10000x896 float32: now passes (was 93.6% false errors) - 32x32x32 float64: still passes (regression test)

dbsanfte · 2025-10-19T11:34:17Z

Fixed to only include the relevant change.

dbsanfte mentioned this pull request Oct 7, 2025

Validation tolerance for mathematical correctness should use relative error in most circumstances #153

Open

dbsanfte force-pushed the fix/k-split-coordinate-mapping branch from 13ed177 to ac569da Compare October 19, 2025 10:33

simonpintarelli mentioned this pull request Oct 20, 2025

Use relative tolerance for tests #156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix validation: use relative error tolerance instead of absolute#154

Fix validation: use relative error tolerance instead of absolute#154
dbsanfte wants to merge 1 commit intoeth-cscs:masterfrom
dbsanfte:fix/k-split-coordinate-mapping

dbsanfte commented Oct 7, 2025

Uh oh!

simonpintarelli commented Oct 7, 2025

Uh oh!

simonpintarelli commented Oct 16, 2025

Uh oh!

dbsanfte commented Oct 16, 2025

Uh oh!

dbsanfte commented Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

dbsanfte commented Oct 7, 2025

Problem

Solution

Testing

Impact

Related Issues

Uh oh!

simonpintarelli commented Oct 7, 2025

Uh oh!

simonpintarelli commented Oct 16, 2025

Uh oh!

dbsanfte commented Oct 16, 2025

Uh oh!

dbsanfte commented Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments