Skip to content

Alarms list updates for the EU#916

Open
Chr1st0ph3rTurn3r wants to merge 6 commits intomasterfrom
alarms-list-updates
Open

Alarms list updates for the EU#916
Chr1st0ph3rTurn3r wants to merge 6 commits intomasterfrom
alarms-list-updates

Conversation

@Chr1st0ph3rTurn3r
Copy link
Contributor

No description provided.


| Category | Description | Threshold / Cause | Troubleshooting |
|----------|-------------------|-------------|-----|
| System | System memory exceeded: `{threshold}%` | Threshold is 90%. The alarm triggers above 90% memory usage, and resets/clears when memory usage drops below 80%. | A process is consuming excessive memory. Locate the system processes consuming large amounts of system memory by running `show stats process memory rss` from the PCLI. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this command does not have a complete list of processes. It is useful, but just as useful (and perhaps more so) would be to run top. I do not have a great suggesting on how to word it.

| System | Node `{node}` with version `{nodeVersion}` does not match the HA peer with version `{peerVersion}`|||
| System | Unable to communicate with Chassis Manager | SSR-4xx series only ||
| System | The following chassis sensor(s) are approaching critical temperatures: `{sensors}` | SSR-4xx series only | System performance has been throttled to mitigate heat. The system will shut down if the temperature continues to rise. |
| System | Node `{node}` went offline | Issued when an HA node goes offline. The HA peer node has shut down or stopped running. | Verify that the HA peer node is powered on and running. If the node is running verify that the SSR service is running without error by issuing the command `systemctl status 128T`. If the system appears to be running correctly, check connectivity between the systems by issuing the PCLI command `show system connectivity` on both nodes. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines 16 and 17 are duplicates of each other. Is this on purpose?

| System | The following chassis sensor(s) are approaching critical temperatures: `{sensors}` | SSR-4xx series only | System performance has been throttled to mitigate heat. The system will shut down if the temperature continues to rise. |
| System | Node `{node}` went offline | Issued when an HA node goes offline. The HA peer node has shut down or stopped running. | Verify that the HA peer node is powered on and running. If the node is running verify that the SSR service is running without error by issuing the command `systemctl status 128T`. If the system appears to be running correctly, check connectivity between the systems by issuing the PCLI command `show system connectivity` on both nodes. |
| System | Node `{node}` went offline | Issued when an HA node goes offline. Connectivity between HA nodes is down. | HA node connectivity can be evaluated with the PCLI command `show system connectivity`. If the state to the peer node is not `connected` check the inter node tunnel status by running the PCLI command `show system connectivity internal`. All tunnels to the peer node should report “connected”. If connectivity is down, verify links between the systems. If they are up, please contact Juniper support. |
| System | `{node}`: Internal Synchronization database is disconnected | | Verify connectivity between hardware nodes is healthy. Check for additional related alarms. If connectivity is present, please contact customer support for additional assistance.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the intent is to say "ha" nodes, not "hardware" nodes

| Peer | Peer `{peer}` metadata-key is invalid: `{state}` - `{detail}` ||
| Peer | The following certificates are expired: `{value}` ||
| Peer | The following certificates are revoked: `{value}` ||
| Platform | Security Rekey failed for: `<node-name(s)>` | Issued when a conductor fails to distribute newly created security keys during rekey process to any managed routers. | Make sure failed nodes are running and have connectivity to the conductor. If the problem persists, please contact Juniper customer support. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some places we say "Contact customer support for assistance.", in others, "please contact Juniper customer support.". I would suggest consistency and removal of reference to Juniper.

| Peer | The following certificates are expired: `{value}` ||
| Peer | The following certificates are revoked: `{value}` ||
| Platform | Security Rekey failed for: `<node-name(s)>` | Issued when a conductor fails to distribute newly created security keys during rekey process to any managed routers. | Make sure failed nodes are running and have connectivity to the conductor. If the problem persists, please contact Juniper customer support. |
| Platform | Configure security key distribution failed for: `{configKeyError}` | | TBD |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove TBD

| System | Unable to handle configuration request from Conductor |||
| System | Received correct configuration update from conductor but unable to commit | | |
| System | Received configuration from conductor but unable to commit in local override | A local override has been definied which conflicts a configuration received from conductor. | Review your local override configuration. |
| System | Failed to parse configuration from conductor. | ? | Verify whether the conductor and routers are running a compatible software release. If not, upgrade where appropriate. If they are and the issue persists, please contact customer support. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove question marks

| Peer | Peer `{peer}` path is down | A single path is marked down by BFD. The source of the alarm includes the Node/interface/IP/VLAN. **Path health has degraded sufficiently and is impacting performance.** | Using the GUI, click the Home icon and select the appropriate view for the current environment. Examine the graph for any anomalies at the time of the alarm. If the loss is 5% or higher the path has degraded. |
| Peer | The following certificates are expiring in less than 7 days: `{value}` | A valid certificate must be obtained from a Certificate Authority before valid secure communication can take place. | Verify certificate key exchange values. The `security metadata-key regenerate` command can be issued to force the active node to immediately regenerate the metadata key. |
| Peer | Peer `{name}` path MTU is unresolvable | Maximum Transmit Unit for packet size is unable to be determined. | Set the MTU for the device-interface statically. |
| Platform | Application script on interface `{name}` exited unexpectedly || This should not happen if there is no software or hardware defect and the systems were properly sized. Please contact customer support. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider rewording "This should not happen if there is no software or hardware defect and the systems were properly sized."

| Platform | Application script on interface `{name}` exited unexpectedly || This should not happen if there is no software or hardware defect and the systems were properly sized. Please contact customer support. |
| Platform | Application script on interface `{name}` was restarted after unexpected exit |||
| Platform | External modification of `{resolvConfPath}` detected |||
| Platform | Data plane CPU exceeded: `{threshold}`% | The threshold is 85% | The SSR tracks CPU resources. The alarm may indicate one or more of the following: <br/>- Insufficient CPU <br/>- Hardware issue<br/>- Software issue<br/> Contact customer support for assistance. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not agree with "The SSR tracks CPU resources. The alarm may indicate one or more of the following:

  • Insufficient CPU
  • Hardware issue
  • Software issue"

I think this should be talking about making sure enough cores are allocated to datapath to support the amount of forwarding traffic this router is seeing

| Platform | Application script on interface `{name}` was restarted after unexpected exit |||
| Platform | External modification of `{resolvConfPath}` detected |||
| Platform | Data plane CPU exceeded: `{threshold}`% | The threshold is 85% | The SSR tracks CPU resources. The alarm may indicate one or more of the following: <br/>- Insufficient CPU <br/>- Hardware issue<br/>- Software issue<br/> Contact customer support for assistance. |
| Platform | Traffic Engineering CPU exceeded: `{threshold}`% | The threshold is 85% | The SSR tracks CPU resources. The alarm may indicate one or more of the following: <br/>- Insufficient CPU <br/>- Hardware issue<br/>- Software issue<br/> Contact customer support for assistance. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

| GIID | DHCP address for interface [`<interface name>`] has not been resolved. | Issued when DHCP address for interface is unresolved. Interface configured to obtain address dynamically using DHCP but was not able to acquire one in time. | Ensure the interface is operationally up. <br/>Ensure the interface is connected to a network with a DHCP server and the server will accept the node’s request for DHCP address.<br/>Collect the DHCP statistics to check for any failures.<br/>Collect packet traces on the DHCP interface to investigate any protocol level failures. |
| Redundancy | Process `{process}` on node `{node}` is now active | handleProcessLeaderChange ||
| Redundancy | Node `{node}` is now active for shared interface | handleInterfaceLeaderChange ||
| Service | `{node}`: SessionProc CPU usage alarm || This should not happen if there is no software or hardware defect and the systems were properly sized. Please contact customer support. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider rewording "This should not happen if there is no software or hardware defect and the systems were properly sized."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants