Conversation
docs/events-alarm-descriptions.md
Outdated
|
|
||
| | Category | Description | Threshold / Cause | Troubleshooting | | ||
| |----------|-------------------|-------------|-----| | ||
| | System | System memory exceeded: `{threshold}%` | Threshold is 90%. The alarm triggers above 90% memory usage, and resets/clears when memory usage drops below 80%. | A process is consuming excessive memory. Locate the system processes consuming large amounts of system memory by running `show stats process memory rss` from the PCLI. | |
There was a problem hiding this comment.
this command does not have a complete list of processes. It is useful, but just as useful (and perhaps more so) would be to run top. I do not have a great suggesting on how to word it.
docs/events-alarm-descriptions.md
Outdated
| | System | Node `{node}` with version `{nodeVersion}` does not match the HA peer with version `{peerVersion}`||| | ||
| | System | Unable to communicate with Chassis Manager | SSR-4xx series only || | ||
| | System | The following chassis sensor(s) are approaching critical temperatures: `{sensors}` | SSR-4xx series only | System performance has been throttled to mitigate heat. The system will shut down if the temperature continues to rise. | | ||
| | System | Node `{node}` went offline | Issued when an HA node goes offline. The HA peer node has shut down or stopped running. | Verify that the HA peer node is powered on and running. If the node is running verify that the SSR service is running without error by issuing the command `systemctl status 128T`. If the system appears to be running correctly, check connectivity between the systems by issuing the PCLI command `show system connectivity` on both nodes. | |
There was a problem hiding this comment.
lines 16 and 17 are duplicates of each other. Is this on purpose?
docs/events-alarm-descriptions.md
Outdated
| | System | The following chassis sensor(s) are approaching critical temperatures: `{sensors}` | SSR-4xx series only | System performance has been throttled to mitigate heat. The system will shut down if the temperature continues to rise. | | ||
| | System | Node `{node}` went offline | Issued when an HA node goes offline. The HA peer node has shut down or stopped running. | Verify that the HA peer node is powered on and running. If the node is running verify that the SSR service is running without error by issuing the command `systemctl status 128T`. If the system appears to be running correctly, check connectivity between the systems by issuing the PCLI command `show system connectivity` on both nodes. | | ||
| | System | Node `{node}` went offline | Issued when an HA node goes offline. Connectivity between HA nodes is down. | HA node connectivity can be evaluated with the PCLI command `show system connectivity`. If the state to the peer node is not `connected` check the inter node tunnel status by running the PCLI command `show system connectivity internal`. All tunnels to the peer node should report “connected”. If connectivity is down, verify links between the systems. If they are up, please contact Juniper support. | | ||
| | System | `{node}`: Internal Synchronization database is disconnected | | Verify connectivity between hardware nodes is healthy. Check for additional related alarms. If connectivity is present, please contact customer support for additional assistance.| |
There was a problem hiding this comment.
I believe the intent is to say "ha" nodes, not "hardware" nodes
docs/events-alarm-descriptions.md
Outdated
| | Peer | Peer `{peer}` metadata-key is invalid: `{state}` - `{detail}` || | ||
| | Peer | The following certificates are expired: `{value}` || | ||
| | Peer | The following certificates are revoked: `{value}` || | ||
| | Platform | Security Rekey failed for: `<node-name(s)>` | Issued when a conductor fails to distribute newly created security keys during rekey process to any managed routers. | Make sure failed nodes are running and have connectivity to the conductor. If the problem persists, please contact Juniper customer support. | |
There was a problem hiding this comment.
In some places we say "Contact customer support for assistance.", in others, "please contact Juniper customer support.". I would suggest consistency and removal of reference to Juniper.
docs/events-alarm-descriptions.md
Outdated
| | Peer | The following certificates are expired: `{value}` || | ||
| | Peer | The following certificates are revoked: `{value}` || | ||
| | Platform | Security Rekey failed for: `<node-name(s)>` | Issued when a conductor fails to distribute newly created security keys during rekey process to any managed routers. | Make sure failed nodes are running and have connectivity to the conductor. If the problem persists, please contact Juniper customer support. | | ||
| | Platform | Configure security key distribution failed for: `{configKeyError}` | | TBD | |
docs/events-alarm-descriptions.md
Outdated
| | System | Unable to handle configuration request from Conductor ||| | ||
| | System | Received correct configuration update from conductor but unable to commit | | | | ||
| | System | Received configuration from conductor but unable to commit in local override | A local override has been definied which conflicts a configuration received from conductor. | Review your local override configuration. | | ||
| | System | Failed to parse configuration from conductor. | ? | Verify whether the conductor and routers are running a compatible software release. If not, upgrade where appropriate. If they are and the issue persists, please contact customer support. | |
docs/events-alarm-descriptions.md
Outdated
| | Peer | Peer `{peer}` path is down | A single path is marked down by BFD. The source of the alarm includes the Node/interface/IP/VLAN. **Path health has degraded sufficiently and is impacting performance.** | Using the GUI, click the Home icon and select the appropriate view for the current environment. Examine the graph for any anomalies at the time of the alarm. If the loss is 5% or higher the path has degraded. | | ||
| | Peer | The following certificates are expiring in less than 7 days: `{value}` | A valid certificate must be obtained from a Certificate Authority before valid secure communication can take place. | Verify certificate key exchange values. The `security metadata-key regenerate` command can be issued to force the active node to immediately regenerate the metadata key. | | ||
| | Peer | Peer `{name}` path MTU is unresolvable | Maximum Transmit Unit for packet size is unable to be determined. | Set the MTU for the device-interface statically. | | ||
| | Platform | Application script on interface `{name}` exited unexpectedly || This should not happen if there is no software or hardware defect and the systems were properly sized. Please contact customer support. | |
There was a problem hiding this comment.
consider rewording "This should not happen if there is no software or hardware defect and the systems were properly sized."
docs/events-alarm-descriptions.md
Outdated
| | Platform | Application script on interface `{name}` exited unexpectedly || This should not happen if there is no software or hardware defect and the systems were properly sized. Please contact customer support. | | ||
| | Platform | Application script on interface `{name}` was restarted after unexpected exit ||| | ||
| | Platform | External modification of `{resolvConfPath}` detected ||| | ||
| | Platform | Data plane CPU exceeded: `{threshold}`% | The threshold is 85% | The SSR tracks CPU resources. The alarm may indicate one or more of the following: <br/>- Insufficient CPU <br/>- Hardware issue<br/>- Software issue<br/> Contact customer support for assistance. | |
There was a problem hiding this comment.
I do not agree with "The SSR tracks CPU resources. The alarm may indicate one or more of the following:
- Insufficient CPU
- Hardware issue
- Software issue"
I think this should be talking about making sure enough cores are allocated to datapath to support the amount of forwarding traffic this router is seeing
docs/events-alarm-descriptions.md
Outdated
| | Platform | Application script on interface `{name}` was restarted after unexpected exit ||| | ||
| | Platform | External modification of `{resolvConfPath}` detected ||| | ||
| | Platform | Data plane CPU exceeded: `{threshold}`% | The threshold is 85% | The SSR tracks CPU resources. The alarm may indicate one or more of the following: <br/>- Insufficient CPU <br/>- Hardware issue<br/>- Software issue<br/> Contact customer support for assistance. | | ||
| | Platform | Traffic Engineering CPU exceeded: `{threshold}`% | The threshold is 85% | The SSR tracks CPU resources. The alarm may indicate one or more of the following: <br/>- Insufficient CPU <br/>- Hardware issue<br/>- Software issue<br/> Contact customer support for assistance. | |
docs/events-alarm-descriptions.md
Outdated
| | GIID | DHCP address for interface [`<interface name>`] has not been resolved. | Issued when DHCP address for interface is unresolved. Interface configured to obtain address dynamically using DHCP but was not able to acquire one in time. | Ensure the interface is operationally up. <br/>Ensure the interface is connected to a network with a DHCP server and the server will accept the node’s request for DHCP address.<br/>Collect the DHCP statistics to check for any failures.<br/>Collect packet traces on the DHCP interface to investigate any protocol level failures. | | ||
| | Redundancy | Process `{process}` on node `{node}` is now active | handleProcessLeaderChange || | ||
| | Redundancy | Node `{node}` is now active for shared interface | handleInterfaceLeaderChange || | ||
| | Service | `{node}`: SessionProc CPU usage alarm || This should not happen if there is no software or hardware defect and the systems were properly sized. Please contact customer support. | |
There was a problem hiding this comment.
consider rewording "This should not happen if there is no software or hardware defect and the systems were properly sized."
…ll need more info to fill out empty fields.
No description provided.