Skip to content

Improve Docker downgrade robustness and version reporting#72

Merged
dragonfire1119 merged 2 commits intomasterfrom
fix-install-hangs
Dec 6, 2025
Merged

Improve Docker downgrade robustness and version reporting#72
dragonfire1119 merged 2 commits intomasterfrom
fix-install-hangs

Conversation

@dragonfire1119
Copy link
Contributor

This pull request updates both the main fix script and its test script for CasaOS Docker version management, focusing on improving reliability and visibility of Docker and containerd installation and status. The main enhancements include more robust package installation logic, expanded handling of the docker-ce-rootless-extras package, improved process management for containerd, and better reporting of package and binary versions for debugging.

Reliability and Installation Improvements

  • Added retry logic for Docker and containerd package installation, with verification steps to ensure correct versions are installed even if apt-get returns warnings; now also traps to always restore service auto-start policy.
  • Expanded all package management commands (apt-mark hold/unhold, apt-get remove/install) to include docker-ce-rootless-extras for consistency and completeness. [1] [2] [3] [4] [5] [6] [7]

Containerd Process and Version Handling

  • Improved containerd restart logic: now forcefully kills lingering containerd and containerd-shim processes, waits longer for readiness, and verifies the running binary version after restart.
  • Both scripts now report and verify the installed containerd package and binary versions, with explicit checks for expected versions in the test script. [1] [2] [3] [4]

Debugging and Status Reporting

  • Docker status and journal logs are now appended to /tmp/docker-install.log for easier troubleshooting.
  • The test script shows containerd package and binary versions alongside Docker details in status outputs. [1] [2]

User Messaging and Version Bump

  • Script version updated from 2025.12.0 to 2025.12.1 in all user-facing messages. [1] [2]
  • Added clearer step-by-step messages for systemd and containerd management. [1] [2]

These changes together make the Docker downgrade and upgrade processes more robust, transparent, and easier to debug, especially regarding containerd management and version accuracy.

Enhances the downgrade_docker function to retry apt-get installs, verify package installation after errors, and provide more detailed status and version reporting for containerd. Also updates package hold/unhold logic to include docker-ce-rootless-extras and improves logging in both run.sh and test-script.sh for better diagnostics and validation.
@coderabbitai
Copy link

coderabbitai bot commented Dec 6, 2025

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved Docker and containerd installation reliability with retry logic and enhanced validation checks.
    • Strengthened post-installation verification with explicit version confirmation and diagnostic messaging.
    • Enhanced containerd restart handling and process cleanup for more robust system state management.
  • Chores

    • Version updated to 2025.12.1.

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

Updated scripts to bump the version to 2025.12.1 and overhaul Docker downgrade/install flows: added multi-attempt install retries, trap-based policy restoration, expanded docker-ce-rootless-extras handling, enhanced containerd management and verification, and richer step-labeled diagnostics.

Changes

Cohort / File(s) Change Summary
Docker Installation Flow Overhaul
casaos-fix-docker-api-version/run.sh
Version banner updated to 2025.12.1. Reworked downgrade/install flow with a multi-attempt install loop, per-attempt diagnostics, explicit success checks for docker-ce and containerd.io, trap to always restore policy-rc.d, expanded package hold/unhold and apt-remove to include docker-ce-rootless-extras, step-labeled sequence (Step 7.x), enhanced containerd lifecycle handling (force-stop lingering processes, restart, longer readiness waits), and extended final reporting including containerd versions.
Test Script Containerd Integration
casaos-fix-docker-api-version/test-script.sh
Added discovery and display of containerd package/binary versions, integrated containerd verification (expected 1.7.28) into upgrade/install/test flows, included docker-ce-rootless-extras in unhold/install sequences, and propagated containerd status outputs across test/status/fix verification steps.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant InstallScript as run.sh
    participant APT as apt/dpkg
    participant System as systemd/processes
    participant Docker as dockerd
    participant Containerd as containerd

    User->>InstallScript: invoke install/downgrade
    InstallScript->>APT: unhold/remove packages (includes docker-ce-rootless-extras)
    InstallScript->>APT: attempt install (loop: max_install_attempts)
    alt install attempt fails
        APT-->>InstallScript: failure + diagnostics
        InstallScript->>InstallScript: record attempt, log diagnostics
        InstallScript->>APT: retry (if attempts remain)
    else install succeeds
        APT-->>InstallScript: installed docker-ce, containerd.io
        InstallScript->>System: restore policy-rc.d (trap ensures restoration)
        InstallScript->>System: restart/reload daemon
        System->>Containerd: stop lingering processes / start containerd
        Containerd-->>InstallScript: readiness (wait & verify version)
        InstallScript->>Docker: start/reload dockerd
        Docker-->>InstallScript: readiness (verify docker version)
        InstallScript->>User: final report (docker + containerd versions)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay special attention to: retry loop boundaries and exit conditions in run.sh.
  • Trap handler and guarantee of policy-rc.d restoration on all early exits.
  • Containerd force-stop logic and waiting/verification timings.
  • Consistency of added step labels and user-facing diagnostic messages.
  • Synchronization between run.sh and test-script.sh for package names and version checks.

Poem

🐰 I hopped into scripts with a careful thump,
Tucked retries and traps in a cozy bump,
Containerd twirled, dockerd gave a cheer,
Versions aligned—2025.12.1 is here!
Hopping off now with a carrot and a drum 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: improving Docker downgrade robustness and adding version reporting for Docker and containerd components.
Description check ✅ Passed The description is directly related to the changeset, detailing specific improvements in installation logic, containerd handling, debugging, and version management.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-install-hangs

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 0b9a1f6 and 85c8264.

📒 Files selected for processing (2)
  • casaos-fix-docker-api-version/run.sh (9 hunks)
  • casaos-fix-docker-api-version/test-script.sh (7 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
casaos-fix-docker-api-version/run.sh (1)
casaos-fix-docker-api-version/test-script.sh (2)
  • apply_docker_api_override (199-266)
  • remove_docker_api_override (269-299)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test-fix-script
  • GitHub Check: test-fix-script
🔇 Additional comments (10)
casaos-fix-docker-api-version/test-script.sh (1)

475-487: ✓ Containerd version verification logic is well-aligned with PR objectives.

Lines 475–487 correctly validate that containerd package and binary match expected version 1.7.28, consistent with the PR's goal to verify containerd versions alongside Docker 28.x. The regex patterns handle both versioned (v1.7.28) and unversioned (1.7.28) binary output formats appropriately.

casaos-fix-docker-api-version/run.sh (9)

1195-1200: ✓ Trap-based policy-rc.d restoration is a robust improvement.

Line 1199 correctly sets trap "allow_service_autostart" RETURN to ensure the policy file is restored even if the downgrade_docker function returns early (via any return statement). This pattern guarantees cleanup and prevents leftover policy-rc.d blocking service auto-start on future invocations.


1201-1250: ✓ Multi-attempt install with validation logic handles apt warnings gracefully.

The retry loop (lines 1201–1250) is well-designed:

  • Permits up to 2 install attempts before giving up
  • After each failed apt-get, explicitly validates that docker-ce and containerd.io packages are installed with correct versions using dpkg -l checks (lines 1224–1243)
  • Treats packages as successfully installed if both dpkg queries pass, even if apt-get returned an error
  • This handles the real-world scenario where apt may exit non-zero due to non-critical post-install script issues but the packages are installed

This aligns well with the PR objective to "improve installation reliability including verification steps to ensure correct versions are installed even if apt-get returns warnings."


1329-1374: ✓ Aggressive containerd restart with process termination and polling is justified.

Lines 1329–1374 implement a forceful but necessary containerd restart sequence:

  • Lines 1336–1340: Kill any lingering containerd processes with pkill -9 to ensure the old binary is not still running
  • Lines 1343–1346: Stop containerd-shim processes (container support processes)
  • Lines 1355–1363: Use a 15-iteration polling loop (up to 15 seconds) to wait for containerd readiness instead of a fixed sleep
  • This ensures the new containerd binary is loaded and ready before Docker starts

The aggressive termination is appropriate here because systemctl stop may not force-unload a busy binary. The polling loop is an improvement over fixed sleep times.


1290-1291: ✓ docker-ce-rootless-extras properly included in apt-mark hold.

Line 1291 correctly adds docker-ce-rootless-extras to the hold list alongside docker-ce, docker-ce-cli, containerd.io, and other packages. This ensures consistency with the unhold command at line 1113 and the removal command at line 1127.


2164-2174: ✓ Final version reporting additions provide comprehensive diagnostic output.

Lines 2164–2174 introduce explicit reporting of containerd package version and containerd binary version at the completion of the script:

local containerd_pkg_version
containerd_pkg_version=$(dpkg -l containerd.io 2>/dev/null | awk 'NR>5 {print $3; exit}')
local containerd_bin_version
containerd_bin_version=$(timeout 5 containerd --version 2>/dev/null | head -n1)
...
echo "Containerd Package Version: ${containerd_pkg_version:-unknown}"
echo "Containerd Binary Version: ${containerd_bin_version:-Unable to get containerd binary version}"

This aligns directly with the PR objective to "report and verify containerd package and binary versions" and improves troubleshooting visibility. The fallback messages handle cases where containerd is unavailable.


1313-1318: ✓ Step labeling (Step 7.0, 7.1, 7.2, etc.) improves clarity in systemd/containerd management.

Lines 1313–1318 add explicit step labels ("Step 7.0: Reloading systemd daemon..." and "✓ Systemd daemon reloaded") and include a sleep 2 after daemon-reload to allow systemd to fully process the changes. This defensive pause prevents race conditions where Docker/containerd services might attempt to start before systemd has fully reloaded units.


1415-1419: ✓ Logging to /tmp/docker-install.log improves troubleshooting.

Lines 1415 and 1418 append Docker status and journalctl logs to /tmp/docker-install.log, which is also written to during the apt-get install phase (line 1212). Centralizing diagnostic output in a single file makes it easier for users to diagnose failures without hunting through journal logs manually.


1207-1212: ✓ Apt-get output capture to /tmp/docker-install.log aids diagnosis.

Line 1212 uses tee /tmp/docker-install.log to simultaneously display apt-get output to the terminal and save it to a log file. This is valuable for post-mortem analysis if installation fails. Combined with lines 1415 and 1418, this creates a comprehensive log of the installation process.


1369-1374: The review comment is incorrect: the code already separates declaration from assignment and does not violate SC2155.

The code at lines 1369-1374 declares containerd_version on a separate line before assignment, which already follows best practices and does not trigger ShellCheck SC2155. SC2155 specifically warns about local var=$(command) on the same line; this code avoids that pattern entirely.

The suggested fix changes the semantics: $(command) || var="unknown" only triggers the fallback if the assignment itself fails (which is unlikely), not when the command fails. The current pattern—$(command || echo "unknown")—correctly provides the fallback within the command substitution, which is the intended behavior. No changes are needed.

Likely an incorrect or invalid review comment.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
casaos-fix-docker-api-version/run.sh (1)

1329-1350: Containerd force-kill and restart logic is thorough but has a subtle race condition.

Lines 1336–1340 use pkill -9 containerd to forcefully terminate lingering processes before restarting. However, there's a potential race:

  1. Line 1332: systemctl stop containerd is issued
  2. Line 1336-1340: Check for lingering processes and force-kill them
  3. Line 1350: systemctl start containerd is issued

If systemctl is still in the process of stopping the service (Step 1), the force-kill (Step 2) may interfere. A safer approach:

# After systemctl stop, wait a bit longer and ensure processes are dead
for i in {1..5}; do
  if ! pgrep -x containerd >/dev/null 2>&1; then
    break
  fi
  sleep 1
done

# Then force-kill if still running
if pgrep -x containerd >/dev/null 2>&1; then
  pkill -9 containerd
  sleep 2
fi

Currently, the code checks for processes immediately after the 2-second sleep, which may be adequate. Adding a small loop between the sleep and force-kill would be more robust:

  sleep 2
+
+  # Wait for process to exit gracefully
+  for i in {1..3}; do
+    if ! pgrep -x containerd >/dev/null 2>&1; then
+      break
+    fi
+    sleep 1
+  done
  
  # Kill any lingering containerd processes to ensure old binary is not running
  if pgrep -x containerd >/dev/null 2>&1; then

This is a minor refinement; the current code will work but is not belt-and-suspenders.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between bf8775e and 0b9a1f6.

📒 Files selected for processing (2)
  • casaos-fix-docker-api-version/run.sh (9 hunks)
  • casaos-fix-docker-api-version/test-script.sh (7 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
casaos-fix-docker-api-version/run.sh (1)
casaos-fix-docker-api-version/test-script.sh (2)
  • apply_docker_api_override (197-264)
  • remove_docker_api_override (267-297)
🪛 Shellcheck (0.11.0)
casaos-fix-docker-api-version/test-script.sh

[warning] 100-100: Declare and assign separately to avoid masking return values.

(SC2155)


[warning] 101-101: Declare and assign separately to avoid masking return values.

(SC2155)


[warning] 428-428: Declare and assign separately to avoid masking return values.

(SC2155)


[warning] 429-429: Declare and assign separately to avoid masking return values.

(SC2155)

casaos-fix-docker-api-version/run.sh

[warning] 1371-1371: Declare and assign separately to avoid masking return values.

(SC2155)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test-fix-script
  • GitHub Check: test-fix-script
🔇 Additional comments (12)
casaos-fix-docker-api-version/test-script.sh (2)

310-310: Good: docker-ce-rootless-extras now consistently included in all package operations.

The test script now mirrors the main script's expanded package management. This ensures all Docker-related packages are handled uniformly across upgrade, installation, and removal operations. The consistency is important for reproducible test scenarios.

Also applies to: 344-345, 573-573, 618-619


472-483: Containerd version checks are thorough but rely on exact version matching.

Lines 472–483 validate that containerd package and binary versions match the expected 1.7.28. The validation uses grep -q to check for version strings, which is reasonable, but be aware:

  • If the awk command on line 101 or 429 returns an empty string (e.g., if containerd --version fails or is not installed), the fallback to ${containerd_bin:-unknown} provides safe handling.
  • The checks use substring matching (grep -q "^1.7.28"), which is appropriate for this use case.
casaos-fix-docker-api-version/run.sh (10)

1198-1200: Excellent: Trap mechanism ensures service auto-start policy is always restored.

Line 1199 uses trap "allow_service_autostart" RETURN to guarantee cleanup even if the function exits early due to errors. This is a robust pattern that prevents leaving the system in an inconsistent state (policy-rc.d preventing service restarts).


1205-1250: Retry logic with per-attempt package verification is well-designed.

The new installation loop (lines 1205–1250) handles non-critical apt-get failures gracefully:

  • Attempts installation up to 2 times (configurable via max_install_attempts)
  • After each attempt, explicitly verifies if the required packages are installed with correct versions (lines 1220–1243)
  • Treats successful package installation as success even if apt-get returns an error (common when post-install scripts have non-fatal warnings)
  • Provides clear messaging about what was attempted and why

This is a significant robustness improvement.


1252-1275: Potential issue: PIPESTATUS check may not reflect apt-get exit code if tee was used.

At line 1254, the code checks ${PIPESTATUS[0]} to detect timeout (exit code 124). However, line 1212 uses tee /tmp/docker-install.log in the apt-get pipeline, which resets PIPESTATUS to tee's exit code (usually 0 if file write succeeds), not apt-get's.

To reliably detect timeouts:

- if timeout 600 $SUDO apt-get install ... 2>&1 | tee /tmp/docker-install.log; then
+ if timeout 600 $SUDO apt-get install ... 2>&1 | tee /tmp/docker-install.log; then
+   exit_code=$?
+ else
+   exit_code=$?
+ fi
+ if [ $exit_code -eq 124 ]; then

Or, simpler:

  local apt_exit_code=0
  if timeout 600 $SUDO apt-get install ... 2>&1 | tee /tmp/docker-install.log; then
    apt_exit_code=$?
    ...
  else
    apt_exit_code=$?
  fi

However, since the code checks apt_success via package verification (lines 1220–1243) before checking PIPESTATUS, the practical impact is low—the verification step will catch if packages weren't installed. The timeout check at line 1254 only triggers if neither package got installed, which is correct defensive coding, but the PIPESTATUS check may not work as intended.

Consider capturing the exit code explicitly or relying solely on the package verification checks (lines 1220–1243), which are more robust.


1355-1373: Containerd polling loop with 15 retries is robust but wait time is short (15 seconds total).

Lines 1355–1373 poll for containerd readiness with a 1-second interval for up to 15 attempts (15 seconds total). This is a good pattern—active polling is preferable to fixed sleeps. However:

  • 15 seconds may be tight on slow systems or in containers with resource contention
  • The loop uses systemctl is-active --quiet as the only readiness check, which is sufficient
  • If containerd fails to start, the warning at line 1366 allows the script to continue, which is acceptable since Docker will also be started and will detect containerd issues

No changes needed, but be aware that the timeout is conservative. If you see flaky failures in slow environments, increase the loop limit or interval.


2163-2173: Final containerd version reporting is valuable for diagnostics, but has SC2155 issues.

Lines 2163–2173 add final reporting of containerd package and binary versions at script completion:

local containerd_pkg_version
containerd_pkg_version=$(dpkg -l containerd.io 2>/dev/null | awk 'NR>5 {print $3; exit}')
local containerd_bin_version
containerd_bin_version=$(timeout 5 containerd --version 2>/dev/null | head -n1)

Good: The declarations are separated from assignments (fixing the SC2155 issue correctly). The reporting provides clear diagnostics.

However, there's a subtle inconsistency: Lines 2164 and 2166 declare variables correctly separated, but in the context of the script, it's cleaner to ensure all variable declarations follow the same pattern consistently throughout.


1113-1113: apt-mark unhold command expanded to include docker-ce-rootless-extras.

Line 1113 now includes docker-ce-rootless-extras in the apt-mark unhold command. This is consistent with the hold/unhold pattern throughout the script and ensures all Docker-related packages are managed uniformly.


1127-1127: apt-get remove now includes docker-ce-rootless-extras for complete cleanup.

Line 1127 removes all Docker packages including docker-ce-rootless-extras. This is important for a clean downgrade—leaving behind the rootless-extras package could cause confusion or conflicts.

Potential edge case: If docker-ce-rootless-extras is not installed on older systems, apt-get remove will still succeed (apt-get is idempotent for remove operations). No issue here.


1291-1291: apt-mark hold now includes docker-ce-rootless-extras.

Line 1291 holds all Docker packages including docker-ce-rootless-extras to prevent automatic upgrades. Consistent with the unhold operations.


62-62: Version bumps are consistent across all banner/usage locations.

All four locations (lines 62, 1714, 1722, 1730) correctly update the version string from 2025.12.0 to 2025.12.1. Consistency check passed.

Also applies to: 1714-1714, 1722-1722, 1730-1730


1413-1417: Docker status and journal logs are appended to /tmp/docker-install.log for troubleshooting.

Lines 1413–1417 append Docker status and journal logs to /tmp/docker-install.log (using tee -a), which complements the earlier apt-get output logging at line 1212. This is excellent for diagnostics when installation fails.

Refactored variable assignments in run.sh and test-script.sh to use separate declaration and assignment for better readability and shell compatibility. Updated the containerd version check in test-script.sh to accept both '1.7.28' and 'v1.7.28' formats.
@dragonfire1119 dragonfire1119 merged commit d241a43 into master Dec 6, 2025
3 checks passed
@dragonfire1119 dragonfire1119 deleted the fix-install-hangs branch December 6, 2025 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant