Skip to content

7.1.3-r2 documentation#890

Open
Chr1st0ph3rTurn3r wants to merge 77 commits intomasterfrom
7.1.0-r2-documentation
Open

7.1.3-r2 documentation#890
Chr1st0ph3rTurn3r wants to merge 77 commits intomasterfrom
7.1.0-r2-documentation

Conversation

@Chr1st0ph3rTurn3r
Copy link
Contributor

No description provided.


## Download Failover Resiliency

SSR images can be downloaded from a variety of sources, depending on software access mode (eg. internet-only, prefer-conductor, conductor-only, offline-mode): the HA peer, both conductor nodes, artifactory, and the mist proxy to artifactory (cloud deployments only).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Mist be capitalized?


SSR images can be downloaded from a variety of sources, depending on software access mode (eg. internet-only, prefer-conductor, conductor-only, offline-mode): the HA peer, both conductor nodes, artifactory, and the mist proxy to artifactory (cloud deployments only).

To improve resiliency to network connectivity issues, the SSR queries available versions from all sources before beginning the download. It compiles a list of sources where the requested version is available and begins the download. If more than 50% of requests to a source fail within a window of 10 requests, the SSR marks that source unavailable and moves on to the next source. The following priority order is used for sources:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind the size of the window is more of an implementation detail and may be subject to change based on tuning. We may want to be less specific about that in case we decide to adjust it in the future. But this may be fine too. Not sure how likely we are to need to adjust it


In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off.

When the timeout is enabled, the SSR waits for a configurable amount of time (default is 10800s) for the download to complete. When the timeout value is reached, the download is marked as **Failed** and the retry delay begins.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite accurate. The retry delay will begin once we have marked all download sources as unavailable, as described in the failover resilience section. If enabled, once this timeout is hit, the download will be entirely stopped and marked as a failure. Or in other words, the retries happen inside of this timeout, not after it.


### Sequenced HA Download

The SSR supports sequenced downloading; one node of an HA pair downloads an image from the remote repository, and the other node waits for it to complete. Once that download is complete, the second node downloads it from the first. When targeting an HA router, the download is sequenced by default. To disable this sequencing, use `request system software download simultaneous disable`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once note about the second node downloads it from the first. The peer is the first place that an HA router will attempt to download from, so in most cases this would be the case, but if for whatever reason the connection to the peer went down, the router would move on and continue downloading from the conductor or remote sources. Not sure if that needs to be clarified or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the download happen over the HA sync connection or the HA fabric?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's the HA sync connection


## Configuration

Three components: Onboarding conductor, router, Operational conductor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a customer specific tpopology. We shoudn;t limit this doc to just this use case. The doc should only talk about the router and conductor.


The next step in the process is to generate an onboarding token from conductor Web interface, command line, or using APIs. The generated tokens are signed by the conductor’s private key so that they cannot be altered once generated. The SSR supports two modes; Authority Wide and Router Specific tokens. These are mutually exclusive and are defined in the configuration.

#### Authority-Wide Tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concept is removed from the FS and should be deleted from the doc. We will only support per router tokens.

@Chr1st0ph3rTurn3r Chr1st0ph3rTurn3r requested review from BenMatase and agrawalkaushik and removed request for plessard128 November 24, 2025 18:13
Copy link
Contributor

@BenMatase BenMatase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like there is some duplicate information in sco doc


### Prerequisites

- The `secure-conductor-onboarding mode` must be enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like there is only at the authority level. We don't have a mode at the authority at this time


To provide a secure and mutually authenticated onboarding mechanism, the following information must be configured.

- Pre-shared key: The onboarding pre-shared key is a 48-character alpha-numeric string, configured at the authority or the router level. This key is mandatory for the SCO process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the authority level for now


- Pre-shared key: The onboarding pre-shared key is a 48-character alpha-numeric string, configured at the authority or the router level. This key is mandatory for the SCO process.
- Conductor Public certificate: A public-private key certificate.
- Conductor CA certificate: Optionally, you can configure a public certificate signed by a preferred CA signing authority.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not optional


After the user generates an onboarding token, enter the token and other onboarding details in the onboarding UI or using CLI commands. There are two methods to onboard a router:

- Using the Command line: `secure-conductor-onboarding-token` command and `onboarding-config.json`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Using the Command line: `create secure-conductor-onboarding token` command and `onboarding-config.json`.

4. The router connects to the conductor over port 930 using the SSH keys exchanged in previous steps.
5. The router is prepped and initialized by the conductor. During this process, the system goes through the reboot cycle.

Once the secure SSH tunnels are established, the SCO workflow concludes. All future communication between the router and conductor will occur on standard SSR to conductor ports such as 930, 4505, 4506, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SCO happens, won't use 4505/4506 from that point on. Everything is over 930


`configure authority router secure-conductor-onboarding pre-shared-secret`

The pre-shared secret is a 48-character alpha-numeric string. When enabled, any empty PSK will auto generate a random 48-byte alphanumeric string using the FIPS-approved, highly secure DRBG function from OpenSSL. Once generated, the key does not automatically change. It can be updated by the user if necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not complete yet


### Token Contents

The next step in the process is to generate an onboarding token from the conductor Web interface, command line, or using APIs. The generated tokens are signed by the conductor’s private key so that they cannot be altered once generated. The SSR supports two modes; Authority-wide and Router-specific tokens. These are mutually exclusive and are defined in the configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doc needs to be scrubbed of "authority wide" tokens for now


The following parameters are required, and are configured at the Router level.

`configure authority router secure-conductor-onboarding mode`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not match the func spec exactly, but the router level path is at configure authority router system secure-conductor-onboarding. This applies to the other paths in the doc

### Auto-resume Download on WAN Failures

In the event that all sources have reached the threshold of consecutive failures and a download attempt has failed, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off.
In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off. Use the `software-update download enable-timeout` command to enable the retry feature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enable-timeout field is separate from retries. The only thing it enables is the timeout described in the next paragraph, and retries will happen regardless of whether the timeout is enabled

In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off. Use the `software-update download enable-timeout` command to enable the retry feature.

When the timeout is enabled, the SSR waits for a configurable amount of time (default is 10800s) for the download to complete. When the timeout value is reached, the download is marked as **Failed** and the retry delay begins.
When the timeout is enabled (software-update download enable-timeout true) the SSR will wait for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth noting that the timeout is enabled by default?

The retry delay time is the longest time to wait between retry attempts. For example, the initial retry delay starts at 30 seconds. With each failure the delay is increased exponentially. However, when that calculated value reaches the maximum retry delay time, successive wait times for additional attempts do not exceed the maximium retry delay time. The default is 3600 seconds. A maximum number of times to retry can also be configured.

The retry timeout can be disabled. If it is disabled, the download will retry indefinitely.
If the retry timeout is disabled, the download will retry indefinitely

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, the timeout is a separate mechanism from the retries, so I wouldn't necessarily describe it as a retry timeout. And the download would only retry indefinitely if both the timeout is disabled and the attempts is configured to 0.


### Sequenced HA Download

The SSR supports sequenced downloading; one node of an HA pair downloads an image from the remote repository, and the other node waits for it to complete. Once that download is complete, the second node downloads it from the first. When targeting an HA router, the download is sequenced by default. To disable this sequencing, use `request system software download simultaneous disable`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I ended up making the download unsequenced by default. I may change that in the future, but in the beta we're giving Swift, it will be unsequenced.
In order to do a sequenced download, you would use request system software download router RouterName version SSR-X.Y.Z sequenced


After the user generates an onboarding token, enter the token and other onboarding details in the onboarding UI or using CLI commands. There are two methods to onboard a router:

- Using the Command line: `create secure-conductor-onboarding-token` command and `onboarding-config.json`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command still needs to be fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong with it? I copied your command from the earlier review. Am I missing something?


To enable this feature on the conductor, verify the following:

- The `secure conductor onboarding mode` should not be disabled (see above).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be removed. The conductor/whole authority doesn't have a mode


The CA certificate is read from disk at the location given in `secure-conductor-onboarding ca-certificate`.

## Token Management
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is a dup of the Token Creation section and can be removed

In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off.

When the timeout is enabled (software-update download enable-timeout true) the SSR will wait for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".
The timeout is enabled by default (`software-update download enable-timeout true`). The SSR waits for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is accurate, but something I hadn't thought of when reviewing before is that the retry configuration in the paragraph below is probably more significant than the timeout configuration, so I might swap the two paragraphs.

When the timeout is enabled (software-update download enable-timeout true) the SSR will wait for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".
The timeout is enabled by default (`software-update download enable-timeout true`). The SSR waits for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".

The retry delay time is the longest time to wait between retry attempts. For example, the initial retry delay starts at 30 seconds. With each failure the delay is increased exponentially. However, when that calculated value reaches the maximum retry delay time, successive wait times for additional attempts do not exceed the maximium retry delay time. The default is 3600 seconds. A maximum number of times to retry can also be configured.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in maximium


If the retry timeout is disabled, the download will retry indefinitely

Use the command `configure authority router system software-update download enable-timeout [enabled]` to enable auto-resume. The command parameters are listed below:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enable-timeout field doesn't really enable auto-resume. It's just a way you can tune the behavior to meet your needs. Maybe something along the lines of this would be more accurate?

Use the command configure authority router system software-update download to adjust the download retry behavior. The command parameters are listed below:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maximum-retry-delay

@MichaelBaj
Copy link
Contributor

  • Add PMTU documentation
  • Add documentation for SHA384 and SHA512

| `key-exchange-algorithm` | The algorithm to use for exchanging keys between peers. Algorithm types include: `diffie-hellman`, `ml-kem`, or `diffie-hellman-ml-kem`. |
| `ml-kem` | Use the `ml-kem-key-size` parameter to define the key size to use. Possible values in order of increasing security strength and decreasing performance are 512, 768 or 1024. |
| `diffie-hellman` | Use the diffie-hellman-key-size parameter to define the key size to use. Possible values in order of increasing security strength and decreasing performance are 1024, 2048 or 4096. |
| `diffie-hellman-ml-kem` | Use this parameter if you require hybrid mode cryptography. This employs both methods of encryption for greater security. Be aware that there is a performance impact with this selection. The above values are used and set individually in the configuration. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has the performance impact been measured yet?


A Trusted Platform Module (TPM) is a secure cryptoprocessor that stores cryptographic keys. It serves as a secure storage mechanism for essential security artifacts such as digital certificates.

## TPM-Based Certificates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include some language for the vTPM here as well. I believe @haberkornsam had some info for the public cloud docs. He's out sick so will review with him next week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the blurb on vTPM below is good. We could maybe include something about the Endorsement key and attestation keys and how its required for the vTPM to be initialized with an Endorsement seed and we will generate an EK and AK. But that is also getting pretty technical and not SSR specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.