Alarms list updates for the EU by Chr1st0ph3rTurn3r · Pull Request #916 · 128technology/docs

Chr1st0ph3rTurn3r · 2026-01-27T21:59:03Z

No description provided.

migolnikov · 2026-01-28T07:50:40Z

docs/events-alarm-descriptions.md

+
+| Category | Description | Threshold / Cause | Troubleshooting |
+|----------|-------------------|-------------|-----|
+| System | System memory exceeded: `{threshold}%` | Threshold is 90%. The alarm triggers above 90% memory usage, and resets/clears when memory usage drops below 80%. | A process is consuming excessive memory. Locate the system processes consuming large amounts of system memory by running `show stats process memory rss` from the PCLI. | 


this command does not have a complete list of processes. It is useful, but just as useful (and perhaps more so) would be to run top. I do not have a great suggesting on how to word it.

migolnikov · 2026-01-28T07:53:49Z

docs/events-alarm-descriptions.md

+| System | Node `{node}` with version `{nodeVersion}` does not match the HA peer with version `{peerVersion}`|||
+| System | Unable to communicate with Chassis Manager | SSR-4xx series only ||
+| System | The following chassis sensor(s) are approaching critical temperatures: `{sensors}` | SSR-4xx series only | System performance has been throttled to mitigate heat. The system will shut down if the temperature continues to rise. |
+| System | Node `{node}` went offline | Issued when an HA node goes offline. The HA peer node has shut down or stopped running. | Verify that the HA peer node is powered on and running. If the node is running verify that the SSR service is running without error by issuing the command `systemctl status 128T`. If the system appears to be running correctly, check connectivity between the systems by issuing the PCLI command `show system connectivity` on both nodes. |


lines 16 and 17 are duplicates of each other. Is this on purpose?

migolnikov · 2026-01-28T07:55:46Z

docs/events-alarm-descriptions.md

+| System | The following chassis sensor(s) are approaching critical temperatures: `{sensors}` | SSR-4xx series only | System performance has been throttled to mitigate heat. The system will shut down if the temperature continues to rise. |
+| System | Node `{node}` went offline | Issued when an HA node goes offline. The HA peer node has shut down or stopped running. | Verify that the HA peer node is powered on and running. If the node is running verify that the SSR service is running without error by issuing the command `systemctl status 128T`. If the system appears to be running correctly, check connectivity between the systems by issuing the PCLI command `show system connectivity` on both nodes. |
+| System | Node `{node}` went offline | Issued when an HA node goes offline. Connectivity between HA nodes is down. | HA node connectivity can be evaluated with the PCLI command `show system connectivity`. If the state to the peer node is not `connected` check the inter node tunnel status by running the PCLI command `show system connectivity internal`. All tunnels to the peer node should report “connected”. If connectivity is down, verify links between the systems. If they are up, please contact Juniper support. |
+| System | `{node}`: Internal Synchronization database is disconnected | | Verify connectivity between hardware nodes is healthy. Check for additional related alarms. If connectivity is present, please contact customer support for additional assistance.|


I believe the intent is to say "ha" nodes, not "hardware" nodes

migolnikov · 2026-01-28T07:58:30Z

docs/events-alarm-descriptions.md

+| Peer | Peer `{peer}` metadata-key is invalid: `{state}` - `{detail}` ||
+| Peer | The following certificates are expired: `{value}` ||
+| Peer | The following certificates are revoked: `{value}` ||
+| Platform | Security Rekey failed for: `<node-name(s)>` | Issued when a conductor fails to distribute newly created security keys during rekey process to any managed routers. | Make sure failed nodes are running and have connectivity to the conductor. If the problem persists, please contact Juniper customer support. |


In some places we say "Contact customer support for assistance.", in others, "please contact Juniper customer support.". I would suggest consistency and removal of reference to Juniper.

migolnikov · 2026-01-28T07:59:05Z

docs/events-alarm-descriptions.md

+| Peer | The following certificates are expired: `{value}` ||
+| Peer | The following certificates are revoked: `{value}` ||
+| Platform | Security Rekey failed for: `<node-name(s)>` | Issued when a conductor fails to distribute newly created security keys during rekey process to any managed routers. | Make sure failed nodes are running and have connectivity to the conductor. If the problem persists, please contact Juniper customer support. |
+| Platform | Configure security key distribution failed for: `{configKeyError}` | | TBD |


migolnikov · 2026-01-28T08:19:08Z

docs/events-alarm-descriptions.md

+| System | Unable to handle configuration request from Conductor |||
+| System | Received correct configuration update from conductor but unable to commit | | |
+| System | Received configuration from conductor but unable to commit in local override | A local override has been definied which conflicts a configuration received from conductor. | Review your local override configuration. |
+| System | Failed to parse configuration from conductor. | ? | Verify whether the conductor and routers are running a compatible software release. If not, upgrade where appropriate. If they are and the issue persists, please contact customer support. |


remove question marks

migolnikov · 2026-01-28T08:21:58Z

docs/events-alarm-descriptions.md

+| Peer | Peer `{peer}` path is down | A single path is marked down by BFD. The source of the alarm includes the Node/interface/IP/VLAN. **Path health has degraded sufficiently and is impacting performance.** | Using the GUI, click the Home icon and select the appropriate view for the current environment. Examine the graph for any anomalies at the time of the alarm. If the loss is 5% or higher the path has degraded. |
+| Peer | The following certificates are expiring in less than 7 days: `{value}` | A valid certificate must be obtained from a Certificate Authority before valid secure communication can take place. | Verify certificate key exchange values. The `security metadata-key regenerate` command can be issued to force the active node to immediately regenerate the metadata key. |
+| Peer | Peer `{name}` path MTU is unresolvable | Maximum Transmit Unit for packet size is unable to be determined. | Set the MTU for the device-interface statically. |
+| Platform | Application script on interface `{name}` exited unexpectedly || This should not happen if there is no software or hardware defect and the systems were properly sized. Please contact customer support. |


consider rewording "This should not happen if there is no software or hardware defect and the systems were properly sized."

migolnikov · 2026-01-28T08:24:06Z

docs/events-alarm-descriptions.md

+| Platform | Application script on interface `{name}` exited unexpectedly || This should not happen if there is no software or hardware defect and the systems were properly sized. Please contact customer support. |
+| Platform | Application script on interface `{name}` was restarted after unexpected exit |||
+| Platform | External modification of `{resolvConfPath}` detected |||
+| Platform | Data plane CPU exceeded: `{threshold}`% | The threshold is 85% | The SSR tracks CPU resources. The alarm may indicate one or more of the following: <br/>- Insufficient CPU <br/>- Hardware issue<br/>- Software issue<br/> Contact customer support for assistance.  |


I do not agree with "The SSR tracks CPU resources. The alarm may indicate one or more of the following:

Insufficient CPU

Hardware issue

Software issue"

I think this should be talking about making sure enough cores are allocated to datapath to support the amount of forwarding traffic this router is seeing

migolnikov · 2026-01-28T08:25:01Z

docs/events-alarm-descriptions.md

+| Platform | Application script on interface `{name}` was restarted after unexpected exit |||
+| Platform | External modification of `{resolvConfPath}` detected |||
+| Platform | Data plane CPU exceeded: `{threshold}`% | The threshold is 85% | The SSR tracks CPU resources. The alarm may indicate one or more of the following: <br/>- Insufficient CPU <br/>- Hardware issue<br/>- Software issue<br/> Contact customer support for assistance.  |
+| Platform | Traffic Engineering CPU exceeded: `{threshold}`% | The threshold is 85% | The SSR tracks CPU resources. The alarm may indicate one or more of the following: <br/>- Insufficient CPU <br/>- Hardware issue<br/>- Software issue<br/> Contact customer support for assistance.  |


same as above

migolnikov · 2026-01-28T08:27:20Z

docs/events-alarm-descriptions.md

+| GIID | DHCP address for interface [`<interface name>`] has not been resolved. | Issued when DHCP address for interface is unresolved. Interface configured to obtain address dynamically using DHCP but was not able to acquire one in time. | Ensure the interface is operationally up. <br/>Ensure the interface is connected to a network with a DHCP server and the server will accept the node’s request for DHCP address.<br/>Collect the DHCP statistics to check for any failures.<br/>Collect packet traces on the DHCP interface to investigate any protocol level failures. |
+| Redundancy | Process `{process}` on node `{node}` is now active | handleProcessLeaderChange ||
+| Redundancy | Node `{node}` is now active for shared interface | handleInterfaceLeaderChange ||
+| Service | `{node}`: SessionProc CPU usage alarm || This should not happen if there is no software or hardware defect and the systems were properly sized. Please contact customer support. |


consider rewording "This should not happen if there is no software or hardware defect and the systems were properly sized."

…ll need more info to fill out empty fields.

Chr1st0ph3rTurn3r added 2 commits January 26, 2026 16:18

integrating updated list of alarms with existing information

e099899

first draft of new alarms doc for review

68a2332

Chr1st0ph3rTurn3r requested a review from migolnikov January 27, 2026 21:59

Chr1st0ph3rTurn3r requested a review from MichaelBaj as a code owner January 27, 2026 21:59

migolnikov reviewed Jan 28, 2026

View reviewed changes

Chr1st0ph3rTurn3r and others added 4 commits January 28, 2026 08:19

Merge branch 'master' into alarms-list-updates

9117292

made several text edits, rearranged some information for clarity, sti…

eb26047

…ll need more info to fill out empty fields.

interim commit

c3d29ef

entering more info

294947e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alarms list updates for the EU#916

Alarms list updates for the EU#916
Chr1st0ph3rTurn3r wants to merge 6 commits intomasterfrom
alarms-list-updates

Chr1st0ph3rTurn3r commented Jan 27, 2026

Uh oh!

migolnikov Jan 28, 2026

Uh oh!

migolnikov Jan 28, 2026

Uh oh!

migolnikov Jan 28, 2026

Uh oh!

migolnikov Jan 28, 2026

Uh oh!

migolnikov Jan 28, 2026

Uh oh!

migolnikov Jan 28, 2026

Uh oh!

migolnikov Jan 28, 2026

Uh oh!

migolnikov Jan 28, 2026

Uh oh!

migolnikov Jan 28, 2026

Uh oh!

migolnikov Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Chr1st0ph3rTurn3r commented Jan 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants