-
Notifications
You must be signed in to change notification settings - Fork 6.6k
feat:Enhance the alarm kernel with recovered status notification capability for alarm rules #13539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Apologies for the oversight. While merging the latest master code, the @BanyanDB.Group annotation in the AlarmRecoveryRecordclass was accidentally missed, which caused the e2e test failure @wankai123 @wu-sheng |
|
Take your time. |
| Long recoveryTime = getAlarmRecoveryTime(alarmRecord.getUuid(), duration); | ||
| AlarmMessage alarmMessage = buildAlarmMessage(alarmRecord, recoveryTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns about the way you are doing this. Querying status from a list usually results a bad performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should at least get the alarm list first. Then use the UUID list to retrieve the recovery list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the helpful feedback. I've pushed new commits to address the points you raised. Please take another look when you have a moment, and let me know if anything else needs adjustment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I'm not entirely sure if this is an issue on my end. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Is there anything I need to do on my side to allow them to run to completion?
|
Please fix CI. |
It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Could you please spare a moment to guide me on what I need to do to get them to run to completion? |
|
Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout |
It seems unrelated to the test cases. I observed that some test cases had been verified successfully before the 18-minute mark, but the test did not continue execute. like [E2E test (Alarm ES, test/e2e-v2/cases/alarm/es/e2e.yaml)] (https://github.com/apache/skywalking/actions/runs/18516094658/job/52781047577#logs) which just cost 10minute to detect recovery. |
|
Another PR just passed all the tests and merged. I assume if there is anything wrong, it is in this change. |
Thank you for the helpful feedback. |
|
They are not cancelled this tine, but failed. |
I have added the unit tests to cover all the different cases you mentioned. The tests now verify the status changes of the AlarmStateMachineafter each match and misMatch. |
Done. Please review when you have time. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 132 out of 132 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
oap-server/server-alarm-plugin/src/test/java/org/apache/skywalking/oap/server/core/alarm/provider/wechat/WechatHookCallbackTest.java:1
- The test is passing the wrong list to
doAlarmRecovery. It should passalarmRecoveryMessagesinstead ofalarmMessages.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| this.alarmRulesWatcher = alarmRulesWatcher; | ||
| this.alarmSettingMap = new HashMap<>(); | ||
| this.alarmServiceStubMap = new HashMap<>(); | ||
| this.grpcClientMap = new HashMap<>(); |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field alarmSettingMap is not initialized in the constructor before being used. It should be initialized as this.alarmSettingMap = new HashMap<>(); before the conditional block that uses alarmSettingMap.
| this.grpcClientMap = new HashMap<>(); | |
| this.grpcClientMap = new HashMap<>(); | |
| this.alarmSettingMap = new HashMap<>(); |
| if (log.isTraceEnabled()) { | ||
| log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}", | ||
| ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState); | ||
| } |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate if (log.isTraceEnabled()) check on lines 498 and 499. Remove the inner duplicate check.
| if (log.isTraceEnabled()) { | |
| log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}", | |
| ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState); | |
| } | |
| log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}", | |
| ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState); |
| "ruleName": "service_resp_time_rule", | ||
| "alarmMessage": "alarmMessage xxxx", | ||
| "startTime": 1560524171000, | ||
| "recoveryTime": 15596606810000, |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example recovery timestamp 15596606810000 appears to be in the future (approximately year 2464). This should be a realistic timestamp that comes after the startTime value of 1560524171000.
| "recoveryTime": 15596606810000, | |
| "recoveryTime": 1560524271000, |
| if (log.isTraceEnabled()) { | ||
| log.trace("RuleName:{} AlarmEntity {} {} {} expired", ruleName, alarmEntity.getName(), | ||
| alarmEntity.getId0(), alarmEntity.getId1()); | ||
| } |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The expired entities are being logged but then removed from the window. The removal happens after the forEach completes. Consider adding a return statement after logging to skip further processing of expired entities in the same iteration.
| } | |
| } | |
| return; |
| action: http | ||
| interval: 3s | ||
| times: 10 | ||
| times: -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change so many irrelevant e2e files, and set it to -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for updating these e2e files is that in previous versions, the e2e test did not stop the HTTP action when expected. It continued running until the entire test ended. This issue has been fixed in Pull Requests #132 and #134 . Therefore, to maintain the original behavior, we need to set this value to -1 in the e2e.yaml configuration.
We can explore different ways to implement this if you have suggestions. Thanks. @wankai123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kezhenxu94 could you clarify here? For keeping default behaviors consistently, it should mean nothing changed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to modify the times in all places in this PR, as I suggested apache/skywalking-infra-e2e#134 (comment)
It was a bug that times didn’t take effect in e2e, but in this repo, if setting times to 10 was intended, which means after 10 times the trigger should stop, then the tests should passed without changes, otherwise, setting times to 10 was wrong and should be updated to -1 like this PR did
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youjie23 how many tests were failed in this PR if you don’t change the times to -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a total of 36 failed GHA cases , but some are related to the ES container startup, so we cannot determine the exact number.
We can roll back this change for all cases except the alarm-related ones. Then, we can rerun the GHA with the latest code, and I will adjust the configurations for the failed cases accordingly.
Does that sound acceptable to everyone? @wu-sheng @kezhenxu94 @wankai123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine to keep endless retry as they were to be like this.
We could try to change that to limited numbers in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine to keep endless retry as they were to be like this. We could try to change that to limited numbers in another PR.
Noted. I've updated the references to the UI module in the latest commit. Once this PR is completed, I will create a new PR to follow up on the matter you mentioned. Thanks.
|
The others LGTM |
This comment was marked as outdated.
This comment was marked as outdated.
|
@youjie23 apache/skywalking-booster-ui#505 the UI has been merged, please sync the UI commit into this PR. Thanks. |
Updated. Thanks. |
| * KubernetesCoordinator: make self instance return real pod IP address instead of `127.0.0.1`. | ||
| * Enhance the alarm kernel with recovered status notification capability | ||
|
|
||
| #### UI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### UI | |
| #### UI | |
| * Fix the missing icon in new native trace view. |
According to apache/skywalking-booster-ui@3092725...6eaf7fe, this submodule update includes two commits.
|
|
||
| ## Alarm state transition | ||
| The overall alarm state transition after the introduction of alarm restoration detection and notification since version 10.4.0 is as follows: | ||
| ```mermaid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 133 out of 133 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Most others seem good to me. Please fix docs, and we could merge this now. |
Updated. Please review when you have time. Thanks. |

Submodule PR:
skywalking-booster-ui#505
skywalking-query-protocol#153
If this is non-trivial feature, paste the links/URLs to the design doc.
Update the documentation to include this new feature.
Tests(including UT, IT, E2E) are added to verify the new feature.
If it's UI related, attach the screenshots below.
If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes [Feature] Enhance the alarm kernel with recovered status notification capability for alarm rules. #13492.
Update the
CHANGESlog.