Skip to content

Conversation

@youjie23
Copy link
Contributor

@youjie23 youjie23 commented Oct 11, 2025

  • ​​Add​​ alarm recovery detection with a recovery-observation-period(default 0).
  • Store​​ the alarm recovery record with the same UUID as the related alarm record.
  • Notify​​ hooks using a recovery-text-template or recovery-urls. , and the notification includes the recoveryTime.

Submodule PR:

@wu-sheng wu-sheng added backend OAP backend related. feature New feature labels Oct 11, 2025
@youjie23
Copy link
Contributor Author

youjie23 commented Oct 11, 2025

Apologies for the oversight. While merging the latest master code, the @BanyanDB.Group annotation in the AlarmRecoveryRecordclass was accidentally missed, which caused the e2e test failure @wankai123 @wu-sheng
​​I will fix it immediately and re-run the tests.​

@wu-sheng
Copy link
Member

Take your time.

Comment on lines 96 to 97
Long recoveryTime = getAlarmRecoveryTime(alarmRecord.getUuid(), duration);
AlarmMessage alarmMessage = buildAlarmMessage(alarmRecord, recoveryTime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns about the way you are doing this. Querying status from a list usually results a bad performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should at least get the alarm list first. Then use the UUID list to retrieve the recovery list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the helpful feedback. I've pushed new commits to address the points you raised. Please take another look when you have a moment, and let me know if anything else needs adjustment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I'm not entirely sure if this is an issue on my end. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Is there anything I need to do on my side to allow them to run to completion?

@wu-sheng
Copy link
Member

Please fix CI.

@youjie23
Copy link
Contributor Author

Please fix CI.

It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Could you please spare a moment to guide me on what I need to do to get them to run to completion?

@wu-sheng
Copy link
Member

Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout

@youjie23
Copy link
Contributor Author

youjie23 commented Oct 15, 2025

Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout

It seems unrelated to the test cases. I observed that some test cases had been verified successfully before the 18-minute mark, but the test did not continue execute. like [E2E test (Alarm ES, test/e2e-v2/cases/alarm/es/e2e.yaml)] (https://github.com/apache/skywalking/actions/runs/18516094658/job/52781047577#logs) which just cost 10minute to detect recovery.
And it’s not just the alarm case that gets stuck. Other verified cases also did not continue to execute. like E2E test (Log FluentBit ES 8.8.1, test/e2e-v2/cases/log/fluent-bit/e2e.yaml, ES_VERSION=8.8.1)

@wu-sheng
Copy link
Member

Another PR just passed all the tests and merged. I assume if there is anything wrong, it is in this change.

@youjie23
Copy link
Contributor Author

Another PR just passed all the tests and merged. I assume if there is anything wrong, it is in this change.

Thank you for the helpful feedback.
Fixed in the skywalking-infra-e2e #133

@wu-sheng
Copy link
Member

They are not cancelled this tine, but failed.
please take a look.

@wu-sheng wu-sheng added this to the 10.4.0 milestone Nov 10, 2025
@youjie23
Copy link
Contributor Author

youjie23 commented Nov 12, 2025

Let's add the different cases in the UT and check if the alarm window status changes as expected:

  1. silencePeriod and recoveryObservationPeriod are not set.
  2. Only set silencePeriod.
  3. Only set recoveryObservationPeriod.
  4. silencePeriod > recoveryObservationPeriod.
  5. recoveryObservationPeriod > silencePeriod.

The status changes should include the AlarmStateMachine current status after each match or misMatch

I have added the unit tests to cover all the different cases you mentioned. The tests now verify the status changes of the AlarmStateMachineafter each match and misMatch.
The changes are in RunningRuleTest. Please review when you have time. Thanks.

@youjie23
Copy link
Contributor Author

Also, with #13570 is going to be merged, this new status should be reflected into query APIs.

Done. Please review when you have time. Thanks.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 132 out of 132 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

oap-server/server-alarm-plugin/src/test/java/org/apache/skywalking/oap/server/core/alarm/provider/wechat/WechatHookCallbackTest.java:1

  • The test is passing the wrong list to doAlarmRecovery. It should pass alarmRecoveryMessages instead of alarmMessages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

this.alarmRulesWatcher = alarmRulesWatcher;
this.alarmSettingMap = new HashMap<>();
this.alarmServiceStubMap = new HashMap<>();
this.grpcClientMap = new HashMap<>();
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field alarmSettingMap is not initialized in the constructor before being used. It should be initialized as this.alarmSettingMap = new HashMap<>(); before the conditional block that uses alarmSettingMap.

Suggested change
this.grpcClientMap = new HashMap<>();
this.grpcClientMap = new HashMap<>();
this.alarmSettingMap = new HashMap<>();

Copilot uses AI. Check for mistakes.
Comment on lines 499 to 502
if (log.isTraceEnabled()) {
log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",
ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);
}
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate if (log.isTraceEnabled()) check on lines 498 and 499. Remove the inner duplicate check.

Suggested change
if (log.isTraceEnabled()) {
log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",
ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);
}
log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",
ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);

Copilot uses AI. Check for mistakes.
"ruleName": "service_resp_time_rule",
"alarmMessage": "alarmMessage xxxx",
"startTime": 1560524171000,
"recoveryTime": 15596606810000,
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example recovery timestamp 15596606810000 appears to be in the future (approximately year 2464). This should be a realistic timestamp that comes after the startTime value of 1560524171000.

Suggested change
"recoveryTime": 15596606810000,
"recoveryTime": 1560524271000,

Copilot uses AI. Check for mistakes.
if (log.isTraceEnabled()) {
log.trace("RuleName:{} AlarmEntity {} {} {} expired", ruleName, alarmEntity.getName(),
alarmEntity.getId0(), alarmEntity.getId1());
}
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The expired entities are being logged but then removed from the window. The removal happens after the forEach completes. Consider adding a return statement after logging to skip further processing of expired entities in the same iteration.

Suggested change
}
}
return;

Copilot uses AI. Check for mistakes.
action: http
interval: 3s
times: 10
times: -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change so many irrelevant e2e files, and set it to -1?

Copy link
Contributor Author

@youjie23 youjie23 Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for updating these e2e files is that in previous versions, the e2e test did not stop the HTTP action when expected. It continued running until the entire test ended. This issue has been fixed in Pull Requests #132 and #134 . Therefore, to maintain the original behavior, we need to set this value to -1 in the e2e.yaml configuration.

We can explore different ways to implement this if you have suggestions. Thanks. @wankai123

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kezhenxu94 could you clarify here? For keeping default behaviors consistently, it should mean nothing changed, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to modify the times in all places in this PR, as I suggested apache/skywalking-infra-e2e#134 (comment)

It was a bug that times didn’t take effect in e2e, but in this repo, if setting times to 10 was intended, which means after 10 times the trigger should stop, then the tests should passed without changes, otherwise, setting times to 10 was wrong and should be updated to -1 like this PR did

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youjie23 how many tests were failed in this PR if you don’t change the times to -1?

Copy link
Contributor Author

@youjie23 youjie23 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a total of 36 failed GHA cases , but some are related to the ES container startup, so we cannot determine the exact number.
We can roll back this change for all cases except the alarm-related ones. Then, we can rerun the GHA with the latest code, and I will adjust the configurations for the failed cases accordingly.
Does that sound acceptable to everyone? @wu-sheng @kezhenxu94 @wankai123

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to keep endless retry as they were to be like this.
We could try to change that to limited numbers in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to keep endless retry as they were to be like this. We could try to change that to limited numbers in another PR.

Noted. I've updated the references to the UI module in the latest commit. Once this PR is completed, I will create a new PR to follow up on the matter you mentioned. Thanks.

@wankai123
Copy link
Member

The others LGTM

@youjie23

This comment was marked as outdated.

@wankai123
Copy link
Member

@youjie23 apache/skywalking-booster-ui#505 the UI has been merged, please sync the UI commit into this PR. Thanks.

@youjie23
Copy link
Contributor Author

youjie23 commented Nov 16, 2025

@youjie23 apache/skywalking-booster-ui#505 the UI has been merged, please sync the UI commit into this PR. Thanks.

Updated. Thanks.

* KubernetesCoordinator: make self instance return real pod IP address instead of `127.0.0.1`.
* Enhance the alarm kernel with recovered status notification capability

#### UI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### UI
#### UI
* Fix the missing icon in new native trace view.

According to apache/skywalking-booster-ui@3092725...6eaf7fe, this submodule update includes two commits.


## Alarm state transition
The overall alarm state transition after the introduction of alarm restoration detection and notification since version 10.4.0 is as follows:
```mermaid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended English term for triggering event.

  • Expression evaluated true
  • Expression true, but in silence period
  • Expression evaluated false
  • Expression false, in recovery window

And from the preview page, you can see, these two labels are hard to tell from which transition.

Image

@wu-sheng wu-sheng requested a review from Copilot November 17, 2025 03:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 133 out of 133 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wu-sheng
Copy link
Member

Most others seem good to me. Please fix docs, and we could merge this now.

@youjie23
Copy link
Contributor Author

Most others seem good to me. Please fix docs, and we could merge this now.

Updated. Please review when you have time. Thanks.

@wu-sheng wu-sheng requested a review from wankai123 November 17, 2025 06:32
@wankai123 wankai123 merged commit cfbb00d into apache:master Nov 17, 2025
351 of 353 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Enhance the alarm kernel with recovered status notification capability for alarm rules.

4 participants