Fix validation: use relative error tolerance instead of absolute#154
Open
dbsanfte wants to merge 1 commit intoeth-cscs:masterfrom
Open
Fix validation: use relative error tolerance instead of absolute#154dbsanfte wants to merge 1 commit intoeth-cscs:masterfrom
dbsanfte wants to merge 1 commit intoeth-cscs:masterfrom
Conversation
Member
|
cscs-ci run GH200 |
Member
|
Thanks for the PR @dbsanfte! I agree it's the correct way to use the relative tolerance and adjust for fp32. The PR contains additional commits, e.g. adding a mutex for |
Contributor
Author
|
Oh those did slip in by accident, sorry. They're not strictly required for this pr. |
The validation was using absolute error tolerance (1e-8) which fails for large matrix multiplication results (magnitude ~1e4). This caused false negatives where COSMA computed correct results but failed validation. Changes: - Switch from absolute error to relative error for validation - Use 1e-5 tolerance for float32 (appropriate for single precision) - Use 1e-8 tolerance for float64 (appropriate for double precision) - Handle small values near zero with absolute error fallback This fixes issue eth-cscs#153 where K-split strategy was incorrectly reported as producing 93.6% errors when actual relative errors were < 1e-6. Tested with: - 32x896x896 float32: now passes (was 93.8% false errors) - 32x10000x896 float32: now passes (was 93.6% false errors) - 32x32x32 float64: still passes (regression test)
13ed177 to
ac569da
Compare
Contributor
Author
|
Fixed to only include the relevant change. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The validation code in
cosma_utils.hppwas using absolute error tolerance (< 1e-8) to validate matrix multiplication results. This causes false negatives for large matrix multiplications where result values have magnitude ~10^4 or greater.For example, with 32×896×896 float32 matrices:
This is the root cause of issue #153 which appeared to be a K-split correctness bug.
Solution
Switch from absolute error to relative error validation:
Key improvements:
1e-5(accounts for ~7 digits of precision)1e-8(accounts for ~15 digits of precision)Testing
Verified fix resolves the false negatives:
Additional validation:
Impact
This fixes validation for:
The actual COSMA algorithm was computing correct results all along - only the validation was broken.
Related Issues
Closes #153