feat:Enhance the alarm kernel with recovered status notification capability for alarm rules #13539

youjie23 · 2025-10-11T10:03:50Z

Add alarm recovery detection with a recovery-observation-period(default 0).
Store the alarm recovery record with the same UUID as the related alarm record.
Notify hooks using a recovery-text-template or recovery-urls. , and the notification includes the recoveryTime.

Submodule PR:

skywalking-booster-ui#505
skywalking-query-protocol#153
If this is non-trivial feature, paste the links/URLs to the design doc.
Update the documentation to include this new feature.
Tests(including UT, IT, E2E) are added to verify the new feature.
If it's UI related, attach the screenshots below.
If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes [Feature] Enhance the alarm kernel with recovered status notification capability for alarm rules. #13492.
Update the CHANGES log.

…apache#13492

youjie23 · 2025-10-11T10:43:09Z

Apologies for the oversight. While merging the latest master code, the @BanyanDB.Group annotation in the AlarmRecoveryRecordclass was accidentally missed, which caused the e2e test failure @wankai123 @wu-sheng
I will fix it immediately and re-run the tests.

wu-sheng · 2025-10-11T10:44:01Z

Take your time.

…apache#13492

wu-sheng · 2025-10-11T13:50:46Z

...a/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/BanyanDBAlarmQueryDAO.java

+            Long recoveryTime = getAlarmRecoveryTime(alarmRecord.getUuid(), duration);
+            AlarmMessage alarmMessage = buildAlarmMessage(alarmRecord, recoveryTime);


I have concerns about the way you are doing this. Querying status from a list usually results a bad performance.

You should at least get the alarm list first. Then use the UUID list to retrieve the recovery list.

Thank you for the helpful feedback. I've pushed new commits to address the points you raised. Please take another look when you have a moment, and let me know if anything else needs adjustment.

It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I'm not entirely sure if this is an issue on my end. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Is there anything I need to do on my side to allow them to run to completion?

…apache#13492

wu-sheng · 2025-10-15T06:57:57Z

Please fix CI.

youjie23 · 2025-10-15T07:21:39Z

Please fix CI.

It appears that the e2e test job on the GitHub Actions workflow was blocked and then got canceled. I've sampled all the alarm e2e tests and some other tests that were not marked as completed; they all seemed to have passed verification. Could you please spare a moment to guide me on what I need to do to get them to run to completion?

wu-sheng · 2025-10-15T10:42:08Z

Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout

youjie23 · 2025-10-15T10:52:57Z

Are you setting the recovery quickly enough? They are running for over one hour, and be cancelled due to preset timeout

It seems unrelated to the test cases. I observed that some test cases had been verified successfully before the 18-minute mark, but the test did not continue execute. like [E2E test (Alarm ES, test/e2e-v2/cases/alarm/es/e2e.yaml)] (https://github.com/apache/skywalking/actions/runs/18516094658/job/52781047577#logs) which just cost 10minute to detect recovery.
And it’s not just the alarm case that gets stuck. Other verified cases also did not continue to execute. like E2E test (Log FluentBit ES 8.8.1, test/e2e-v2/cases/log/fluent-bit/e2e.yaml, ES_VERSION=8.8.1)

wu-sheng · 2025-10-15T13:19:02Z

Another PR just passed all the tests and merged. I assume if there is anything wrong, it is in this change.

…apache#13492

youjie23 · 2025-10-18T16:37:59Z

Another PR just passed all the tests and merged. I assume if there is anything wrong, it is in this change.

Thank you for the helpful feedback.
Fixed in the skywalking-infra-e2e #133

wu-sheng · 2025-10-19T05:54:03Z

They are not cancelled this tine, but failed.
please take a look.

…apache#13492

youjie23 · 2025-11-12T10:16:05Z

Let's add the different cases in the UT and check if the alarm window status changes as expected:

silencePeriod and recoveryObservationPeriod are not set.

Only set silencePeriod.

Only set recoveryObservationPeriod.

silencePeriod > recoveryObservationPeriod.

recoveryObservationPeriod > silencePeriod.

The status changes should include the AlarmStateMachine current status after each match or misMatch

I have added the unit tests to cover all the different cases you mentioned. The tests now verify the status changes of the AlarmStateMachineafter each match and misMatch.
The changes are in RunningRuleTest. Please review when you have time. Thanks.

youjie23 · 2025-11-12T10:17:24Z

Also, with #13570 is going to be merged, this new status should be reflected into query APIs.

Done. Please review when you have time. Thanks.

Copilot

Pull Request Overview

Copilot reviewed 132 out of 132 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

oap-server/server-alarm-plugin/src/test/java/org/apache/skywalking/oap/server/core/alarm/provider/wechat/WechatHookCallbackTest.java:1

The test is passing the wrong list to doAlarmRecovery. It should pass alarmRecoveryMessages instead of alarmMessages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-12T13:09:08Z

...in/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/grpc/GRPCCallback.java

        this.alarmRulesWatcher = alarmRulesWatcher;
-        this.alarmSettingMap = new HashMap<>();
        this.alarmServiceStubMap = new HashMap<>();
        this.grpcClientMap = new HashMap<>();


The field alarmSettingMap is not initialized in the constructor before being used. It should be initialized as this.alarmSettingMap = new HashMap<>(); before the conditional block that uses alarmSettingMap.

Suggested change

this.grpcClientMap = new HashMap<>();

this.grpcClientMap = new HashMap<>();

this.alarmSettingMap = new HashMap<>();

Copilot · 2025-11-12T13:09:09Z

...m-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java

+                    if (log.isTraceEnabled()) {
+                        log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",
+                                ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);
+                    }


Duplicate if (log.isTraceEnabled()) check on lines 498 and 499. Remove the inner duplicate check.

Suggested change

if (log.isTraceEnabled()) {

log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",

ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);

}

log.trace("RuleName:{} AlarmEntity {} {} {} onMatch silenceCountdown:{} currentState:{}",

ruleName, entity.getName(), entity.getId0(), entity.getId1(), silenceCountdown, currentState);

Copilot · 2025-11-12T13:09:09Z

docs/en/setup/backend/backend-alarm.md

+  "ruleName": "service_resp_time_rule",
  "alarmMessage": "alarmMessage xxxx",
  "startTime": 1560524171000,
+  "recoveryTime": 15596606810000, 


The example recovery timestamp 15596606810000 appears to be in the future (approximately year 2464). This should be a realistic timestamp that comes after the startTime value of 1560524171000.

Suggested change

"recoveryTime": 15596606810000,

"recoveryTime": 1560524271000,

Copilot · 2025-11-12T13:09:10Z

...m-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java

+                if (log.isTraceEnabled()) {
+                    log.trace("RuleName:{} AlarmEntity {} {} {} expired", ruleName, alarmEntity.getName(),
+                            alarmEntity.getId0(), alarmEntity.getId1());
+                }


[nitpick] The expired entities are being logged but then removed from the window. The removal happens after the forEach completes. Consider adding a return statement after logging to skip further processing of expired entities in the same iteration.

Suggested change

}

}

return;

wankai123 · 2025-11-13T00:43:23Z

test/e2e-v2/cases/apisix/otel-collector/e2e.yaml

  action: http
  interval: 3s
-  times: 10
+  times: -1


Why change so many irrelevant e2e files, and set it to -1?

The reason for updating these e2e files is that in previous versions, the e2e test did not stop the HTTP action when expected. It continued running until the entire test ended. This issue has been fixed in Pull Requests #132 and #134 . Therefore, to maintain the original behavior, we need to set this value to -1 in the e2e.yaml configuration.

We can explore different ways to implement this if you have suggestions. Thanks. @wankai123

@kezhenxu94 could you clarify here? For keeping default behaviors consistently, it should mean nothing changed, right?

We need to modify the times in all places in this PR, as I suggested apache/skywalking-infra-e2e#134 (comment)

It was a bug that times didn’t take effect in e2e, but in this repo, if setting times to 10 was intended, which means after 10 times the trigger should stop, then the tests should passed without changes, otherwise, setting times to 10 was wrong and should be updated to -1 like this PR did

@youjie23 how many tests were failed in this PR if you don’t change the times to -1?

There are a total of 36 failed GHA cases , but some are related to the ES container startup, so we cannot determine the exact number.
We can roll back this change for all cases except the alarm-related ones. Then, we can rerun the GHA with the latest code, and I will adjust the configurations for the failed cases accordingly.
Does that sound acceptable to everyone? @wu-sheng @kezhenxu94 @wankai123

I am fine to keep endless retry as they were to be like this.
We could try to change that to limited numbers in another PR.

I am fine to keep endless retry as they were to be like this. We could try to change that to limited numbers in another PR.

Noted. I've updated the references to the UI module in the latest commit. Once this PR is completed, I will create a new PR to follow up on the matter you mentioned. Thanks.

wankai123 · 2025-11-13T01:21:17Z

The others LGTM

wankai123 · 2025-11-14T02:21:52Z

@youjie23 apache/skywalking-booster-ui#505 the UI has been merged, please sync the UI commit into this PR. Thanks.

youjie23 · 2025-11-16T16:42:56Z

@youjie23 apache/skywalking-booster-ui#505 the UI has been merged, please sync the UI commit into this PR. Thanks.

Updated. Thanks.

wu-sheng · 2025-11-16T23:41:36Z

docs/en/changes/changes.md

 * KubernetesCoordinator: make self instance return real pod IP address instead of `127.0.0.1`.
+* Enhance the alarm kernel with recovered status notification capability

 #### UI


Suggested change

#### UI

#### UI

* Fix the missing icon in new native trace view.

According to apache/skywalking-booster-ui@3092725...6eaf7fe, this submodule update includes two commits.

wu-sheng · 2025-11-16T23:49:16Z

docs/en/setup/backend/backend-alarm.md

+
+## Alarm state transition
+The overall alarm state transition after the introduction of alarm restoration detection and notification since version 10.4.0 is as follows:
+```mermaid


Recommended English term for triggering event.

Expression evaluated true

Expression true, but in silence period

Expression evaluated false

Expression false, in recovery window

And from the preview page, you can see, these two labels are hard to tell from which transition.

Copilot

Pull Request Overview

Copilot reviewed 133 out of 133 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wu-sheng · 2025-11-17T06:06:16Z

Most others seem good to me. Please fix docs, and we could merge this now.

youjie23 · 2025-11-17T06:20:10Z

Most others seem good to me. Please fix docs, and we could merge this now.

Updated. Please review when you have time. Thanks.

youjie23 added 6 commits October 11, 2025 11:06

enhance the alarm kernel with recovered status notification capability …

7481ca8

…apache#13492

enhance the alarm kernel with recovered status notification capability …

4b54c18

…apache#13492

enhance the alarm kernel with recovered status notification capability …

638668f

…apache#13492

enhance the alarm kernel with recovered status notification capability …

0acfbe5

…apache#13492

enhance the alarm kernel with recovered status notification capability …

92cfeed

…apache#13492

enhance the alarm kernel with recovered status notification capability …

edc2722

…apache#13492

wu-sheng requested review from wankai123 and wu-sheng October 11, 2025 10:10

wu-sheng added backend OAP backend related. feature New feature labels Oct 11, 2025

youjie23 added 2 commits October 11, 2025 20:10

enhance the alarm kernel with recovered status notification capability …

a7edf5c

…apache#13492

enhance the alarm kernel with recovered status notification capability …

f140f6e

…apache#13492

wu-sheng reviewed Oct 11, 2025

View reviewed changes

youjie23 added 2 commits October 15, 2025 10:39

enhance the alarm kernel with recovered status notification capability …

cf0570b

…apache#13492

Merge branch 'master' into master

a53f9c2

wu-sheng and others added 4 commits October 15, 2025 21:19

Merge branch 'master' into master

d4ad7c0

enhance the alarm kernel with recovered status notification capability …

5829a48

…apache#13492

Merge branch 'master' of github.com:youjie23/skywalking

9b10401

Merge branch 'master' into master

602262d

youjie23 mentioned this pull request Oct 23, 2025

feat: allow times to be <= 0 to simulate endless trigger apache/skywalking-infra-e2e#134

Merged

youjie23 closed this Oct 25, 2025

wu-sheng added this to the 10.4.0 milestone Nov 10, 2025

youjie23 added 2 commits November 12, 2025 18:06

enhance the alarm kernel with recovered status notification capability …

ca113a5

…apache#13492

Merge branch 'master' into master

f65414b

wu-sheng requested review from Copilot and wankai123 November 12, 2025 13:02

Copilot AI reviewed Nov 12, 2025

View reviewed changes

fix Copilot review and CI fail

4c1e2c6

wankai123 reviewed Nov 13, 2025

View reviewed changes

Merge branch 'master' into master

06a96e8

This comment was marked as outdated.

Sign in to view

Merge branch 'master' into master

37cc68a

youjie23 added 2 commits November 17, 2025 00:39

Sync UI

3b8e9c5

Merge branch 'master' of github.com:youjie23/skywalking

eb77ce3

wu-sheng reviewed Nov 16, 2025

View reviewed changes

wu-sheng requested a review from Copilot November 17, 2025 03:36

Copilot AI reviewed Nov 17, 2025

View reviewed changes

docs:update changes.md and backend-alarm.md

dad19b8

wu-sheng approved these changes Nov 17, 2025

View reviewed changes

wu-sheng requested a review from wankai123 November 17, 2025 06:32

wankai123 approved these changes Nov 17, 2025

View reviewed changes

wankai123 merged commit cfbb00d into apache:master Nov 17, 2025
351 of 353 checks passed

youjie23 mentioned this pull request Nov 17, 2025

Revert: changes to e2e test trigger times (introduced in #13539) #13580

Closed

		Long recoveryTime = getAlarmRecoveryTime(alarmRecord.getUuid(), duration);
		AlarmMessage alarmMessage = buildAlarmMessage(alarmRecord, recoveryTime);

	this.grpcClientMap = new HashMap<>();
	this.grpcClientMap = new HashMap<>();
	this.alarmSettingMap = new HashMap<>();

	"recoveryTime": 15596606810000,
	"recoveryTime": 1560524271000,

	#### UI
	#### UI
	* Fix the missing icon in new native trace view.

feat:Enhance the alarm kernel with recovered status notification capability for alarm rules #13539

feat:Enhance the alarm kernel with recovered status notification capability for alarm rules #13539

Uh oh!

Conversation

youjie23 commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youjie23 commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wu-sheng commented Oct 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wu-sheng commented Oct 15, 2025

Uh oh!

youjie23 commented Oct 15, 2025

Uh oh!

wu-sheng commented Oct 15, 2025

Uh oh!

youjie23 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wu-sheng commented Oct 15, 2025

Uh oh!

youjie23 commented Oct 18, 2025

Uh oh!

wu-sheng commented Oct 19, 2025

Uh oh!

youjie23 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youjie23 commented Nov 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youjie23 Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youjie23 Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wankai123 commented Nov 13, 2025

Uh oh!

This comment was marked as outdated.

wankai123 commented Nov 14, 2025

youjie23 commented Oct 11, 2025 •

edited

Loading

youjie23 commented Oct 11, 2025 •

edited

Loading

youjie23 commented Oct 15, 2025 •

edited

Loading

youjie23 commented Nov 12, 2025 •

edited

Loading

youjie23 Nov 13, 2025 •

edited

Loading

youjie23 Nov 14, 2025 •

edited

Loading

youjie23 commented Nov 16, 2025 •

edited

Loading