Skip to content

feat(kubernetes): add agent execution mode, executor-agent, sidecar refactor, GKE Sandbox, image & deps upgrades#33

Open
gafda wants to merge 9 commits intoaron-muon:mainfrom
nosportugal:feat-taints
Open

feat(kubernetes): add agent execution mode, executor-agent, sidecar refactor, GKE Sandbox, image & deps upgrades#33
gafda wants to merge 9 commits intoaron-muon:mainfrom
nosportugal:feat-taints

Conversation

@gafda
Copy link

@gafda gafda commented Feb 18, 2026

This pull request updates the development and runtime environments for multiple languages, modernizes package dependencies, and enhances configuration flexibility, particularly for Redis and Kubernetes. The changes include updates to Dockerfiles for newer base images and language versions, significant dependency version bumps and additions, and expanded sample configuration for advanced deployment scenarios.

Key changes:

1. Dependency and Environment Updates

  • Dockerfiles for C/C++, D, Fortran, and R: Switched to trixie-debian13-dev base images to ensure compilers and development libraries are available at runtime, improving compatibility and security. (docker/c-cpp.Dockerfile, docker/d.Dockerfile, docker/fortran.Dockerfile, docker/r.Dockerfile) [1] [2] [3] [4] [5]
  • Go Environment: Upgraded Go from 1.25 to 1.26, updated the Go Dockerfile stages, and refreshed Go module dependencies with new and reorganized packages. (docker/go.Dockerfile, docker/requirements/go.mod) [1] [2] [3] [4]
  • PHP: Bumped PHP version from 8.4.17 to 8.5.3 for the latest features and security patches. (docker/php.Dockerfile)
  • Python: Modernized and reorganized Python package requirements across core, analysis, documents, utilities, and visualization, with many version bumps and new packages for better functionality and compatibility. (docker/requirements/python-core.txt, docker/requirements/python-analysis.txt, docker/requirements/python-documents.txt, docker/requirements/python-utilities.txt, docker/requirements/python-visualization.txt) [1] [2] [3] [4] [5]
  • Node.js: Expanded and reorganized the list of global packages for a more comprehensive JavaScript environment. (docker/requirements/nodejs.txt)
  • Rust: Updated Rust version from 1.92 to 1.93 and refreshed dependencies in Cargo.toml for new features and bug fixes. (docker/rust.Dockerfile, docker/requirements/rust-Cargo.toml) [1] [2] [3]

2. Configuration Enhancements

  • Redis Configuration: The .env.example file now documents advanced Redis deployment options, including cluster and sentinel modes, TLS/SSL settings, and key prefixing, making it easier to configure Redis in complex environments. [1] [2]
  • Kubernetes Execution: Added detailed Kubernetes execution configuration options to .env.example, including support for agent and nsenter modes, sidecar image selection, image pull policies, and GKE Sandbox compatibility notes.
Mode Description
agent (default) Executor agent runs inside the main container. No nsenter, no extra capabilities. Compatible with gVisor/GKE Sandbox.
nsenter (legacy) Sidecar uses nsenter to enter the main container namespace. Requires SYS_PTRACE, SYS_ADMIN, SYS_CHROOT capabilities.

3. Docker Image Naming and CI/CD

  • Sidecar Image Naming: Updated GitHub Actions workflow to use the new sidecar image name kubecoderun-sidecar-agent instead of kubecoderun-sidecar, aligning with the agent-based execution model. (.github/workflows/docker-publish.yml) [1] [2]

4. Python Runtime Optimization

  • Python Dockerfile: Cleaned up and reorganized installed runtime libraries, focusing on core utilities, image processing, XML/HTML processing, cryptography, and font support. Removed unnecessary tools from the final image for a leaner runtime. (docker/python.Dockerfile) [1] [2]

These changes collectively modernize the development environments, improve documentation and configuration for advanced deployments, and ensure compatibility with newer language and library versions.

- Add configuration options for GKE Sandbox in KubernetesConfig
- Update pod/job manifest creation to support:
  * runtimeClassName for gVisor runtime
  * sandbox.gke.io/runtime annotation
  * nodeSelector for sandbox-enabled nodes
  * tolerations for GKE sandbox and custom taints
- Add GKE Sandbox settings to Helm values.yaml and configmap
- Update KubernetesManager, PodSpec, and PoolConfig models
- Parse JSON configuration for node selectors and tolerations
- Enable easy activation/deactivation via configuration flags

GKE Sandbox provides additional kernel isolation using gVisor for
untrusted workloads. When enabled, execution pods will:
- Run with gVisor runtime (runtimeClassName: gvisor)
- Be scheduled on sandbox-enabled nodes
- Tolerate GKE sandbox taints automatically
- Support custom node pool taints for dedicated execution nodes

Configuration example in values.yaml:
  execution:
    gkeSandbox:
      enabled: true
      runtimeClassName: gvisor
      nodeSelector: {}
      customTolerations:
        - key: pool
          operator: Equal
          value: sandbox
          effect: NoSchedule
* Introduce  to support agent and nsenter modes.
* Implement agent mode with a lightweight executor agent running in the main container.
* Add  for configuring the executor agent's HTTP server port.
* Enhance security by dropping all capabilities in agent mode and ensuring no privilege escalation.
* Support image pull secrets for private registries via .
* Update documentation to reflect new execution modes and security configurations.
* Modify Helm chart to include image pull secrets configuration.
* Change default sidecar image from  to .
* Update environment variable names for executor port from  to .
* Add documentation for building sidecar images and configuring Helm charts for execution modes.
* Introduce GKE Sandbox support with configuration details and limitations.
* Update related code and tests to reflect changes in image names and environment variables.
- Updated Dockerfiles for C/C++, D, Fortran, R, and Sidecar to use the trixie-dev variant.
- Ensures compilers and development libraries are available at runtime.
* Updated base images to trixie-debian13-dev for C/C++, D, Fortran, R, and Rust.
* Upgraded PHP version to 8.5.3.
* Enhanced Node.js and Python requirements with new packages and versions.
* Improved Rust dependencies for better compatibility and performance.
* Updated Go version in executor-agent to 1.26.
* Introduced K8S_IMAGE_PULL_POLICY and K8S_IMAGE_PULL_SECRETS in configuration.
* Updated relevant classes and methods to handle new fields.
* Enhanced validation for execution mode and sidecar image consistency.
* Added unit tests to ensure correct handling of image pull settings.
- Add REDIS_MODE (standalone/cluster/sentinel) to RedisConfig
- Add TLS/SSL configuration (REDIS_TLS_ENABLED, certs, CA, insecure)
- Add Redis Cluster support (REDIS_CLUSTER_NODES) via RedisCluster client
- Add Redis Sentinel support (REDIS_SENTINEL_NODES/MASTER/PASSWORD)
- Update RedisPool to support all three modes with TLS
- Migrate FileService to use shared RedisPool instead of standalone client
- Update Settings class with all new Redis fields
- Update .env.example with new Redis configuration options
- Update docs/CONFIGURATION.md with Cluster, Sentinel, and TLS sections
- Update docs/SECURITY.md with TLS configuration reference
- Update Helm values.yaml, configmap.yaml, and _helpers.tpl
- Default remains standalone Redis for full backward compatibility

feat: add optional Redis key prefix support (REDIS_KEY_PREFIX)

- Add key_prefix field to RedisConfig and Settings
- Add make_key() helper to RedisPool for centralized key prefixing
- Update all services to use prefixed keys: session, state, file,
  health, api_key_manager, detailed_metrics, metrics
- Update .env.example, docs, and Helm chart with new setting
@gafda gafda marked this pull request as ready for review February 23, 2026 10:16
@gafda gafda requested a review from aron-muon as a code owner February 23, 2026 10:16
Copilot AI review requested due to automatic review settings February 23, 2026 10:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a significant architectural enhancement by adding an agent-based execution mode as the default, alongside comprehensive Redis deployment mode support and extensive dependency upgrades. The changes modernize the security model by eliminating the need for Linux capabilities in the default execution path, enable GKE Sandbox (gVisor) compatibility, and provide flexibility for Redis clustering and high-availability deployments.

Changes:

  • Introduced agent execution mode (default) that eliminates nsenter, Linux capabilities, and privilege escalation requirements, with nsenter mode retained for backward compatibility
  • Added comprehensive Redis deployment modes (standalone, cluster, sentinel) with TLS/SSL support and optional key prefixing for multi-tenant deployments
  • Implemented GKE Sandbox (gVisor) support with runtime class, node selectors, and tolerations for kernel-level isolation
  • Upgraded language runtimes and dependencies: Go 1.25→1.26, PHP 8.4.17→8.5.3, Rust 1.92→1.93, Python packages modernized
  • Refactored sidecar to multi-target Docker build producing both agent and nsenter variants from a single Dockerfile

Reviewed changes

Copilot reviewed 49 out of 50 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/services/kubernetes/client.py Enhanced pod manifest creation with agent/nsenter mode support, GKE Sandbox configuration, and image pull secrets
docker/sidecar/main.py Added execute_via_agent() function and mode routing logic for dual execution mode support
docker/sidecar/executor-agent/main.go New Go HTTP server for agent mode execution without nsenter or capabilities
docker/sidecar/Dockerfile Refactored to multi-target build: sidecar-agent (default) and sidecar-nsenter (legacy)
src/core/pool.py Complete rewrite supporting Redis standalone/cluster/sentinel modes with TLS and key prefixing
src/config/redis.py New configuration model for Redis deployment modes, TLS, and advanced features
src/services/*.py Updated all Redis-using services (state, session, metrics, api_key_manager) to use key prefixing
src/main.py Added validation for execution mode/sidecar image consistency and image pull secrets parsing
tests/unit/test_kubernetes_client.py Comprehensive tests for agent/nsenter modes and GKE Sandbox configuration
scripts/build-images.sh Enhanced to support Docker multi-target builds with --target flag
.github/workflows/docker-publish.yml Updated sidecar image name to kubecoderun-sidecar-agent
helm-deployments/kubecoderun/values.yaml Added Redis mode configuration, GKE Sandbox settings, and execution mode options
docs/SECURITY.md, docs/CONFIGURATION.md, docs/ARCHITECTURE.md Extensive documentation updates for new execution modes and Redis features
Comments suppressed due to low confidence (1)

.github/workflows/docker-publish.yml:153

  • The CI/CD workflow only builds the kubecoderun-sidecar-agent image but not the kubecoderun-sidecar-nsenter variant. While the build script supports both targets and the Dockerfile defines both, the nsenter sidecar won't be available in the registry for users who need legacy nsenter mode. Consider adding a separate job to build the nsenter variant, or document that users must build it locally if needed.
  sidecar:
    needs: changes
    if: |
      needs.changes.outputs.is_cross_repo_pr != 'true' &&
      (needs.changes.outputs.sidecar == 'true' || needs.changes.outputs.force_all == 'true')
    uses: ./.github/workflows/docker-build-reusable.yml
    secrets: inherit
    with:
      image_name: kubecoderun-sidecar-agent
      dockerfile: docker/sidecar/Dockerfile
      context: docker/sidecar
      image_tag: ${{ needs.changes.outputs.image_tag }}
      is_release: ${{ needs.changes.outputs.is_release == 'true' }}
      version: ${{ needs.changes.outputs.version }}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* Validate GKE Sandbox compatibility with nsenter execution mode
* Log warnings when incompatible execution modes are used
* Update Dockerfile capabilities for nsenter
* Enhance RedisPool type hints for better type checking
* Add unit tests for GKE Sandbox and nsenter mode interactions
…me by default

Three issues prevented connecting to GCP Memorystore Redis with TLS:

1. _validate_redis_connection() used redis.from_url() without passing
   ssl_ca_certs / ssl_cert_reqs, so certificate verification fell back
   to the system CA bundle which doesn't include managed-service CAs.

2. get_tls_kwargs() set ssl_check_hostname=True when tls_insecure=False.
   Managed Redis services (GCP Memorystore, AWS ElastiCache) and Redis
   Cluster node discovery return IPs that don't match certificate CN/SAN,
   causing CERTIFICATE_VERIFY_FAILED.  Hostname checking is now off by
   default (matching redis-py) and controlled by the new
   REDIS_TLS_CHECK_HOSTNAME setting.

3. REDIS_HOST could contain a URL scheme (rediss://host) which was
   passed through to ClusterNode or URL construction.  A field
   validator now strips accidental schemes from the host value.
Copilot AI review requested due to automatic review settings February 26, 2026 18:13
@gafda
Copy link
Author

gafda commented Feb 26, 2026

@aron-muon

@aron-muon
Copy link
Owner

@aron-muon

Hello, let me take a look here. Nice to see that Nos engineers are using Kubecoderun - I am a Nos customer myself.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 51 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +466 to +470
pod_annotations = annotations or {}
if gke_sandbox_enabled:
# GKE Sandbox annotation for gVisor runtime
pod_annotations["sandbox.gke.io/runtime"] = "gvisor"

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod_annotations is assigned as annotations or {} and then mutated when GKE Sandbox is enabled. If the caller passes a dict, this mutates the caller-owned object and can leak the sandbox annotation to other manifests. Consider copying the dict (e.g., shallow copy) before adding/modifying keys.

Copilot uses AI. Check for mistakes.
Comment on lines +428 to +437
if custom_tolerations:
# Add custom node pool taints (e.g., pool=sandbox)
for tol in custom_tolerations:
tolerations.append(
client.V1Toleration(
key=tol.get("key"),
operator=tol.get("operator", "Equal"),
value=tol.get("value"),
effect=tol.get("effect", "NoSchedule"),
)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom_tolerations entries are converted into V1Toleration objects using tol.get("key"). If a dict is missing key (or it’s empty), the manifest will contain an invalid toleration and the pod/job creation will fail at runtime. Consider validating required fields (at least key) and raising/logging a clear error before building the manifest.

Copilot uses AI. Check for mistakes.
Comment on lines +664 to +677
# Parse JSON strings for node selector and tolerations
sandbox_node_selector = None
if self.gke_sandbox_node_selector:
try:
sandbox_node_selector = json.loads(self.gke_sandbox_node_selector)
except json.JSONDecodeError:
pass

custom_tolerations = None
if self.gke_sandbox_custom_tolerations:
try:
custom_tolerations = json.loads(self.gke_sandbox_custom_tolerations)
except json.JSONDecodeError:
pass
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid JSON in GKE_SANDBOX_NODE_SELECTOR / GKE_SANDBOX_CUSTOM_TOLERATIONS is silently ignored (JSONDecodeError is swallowed). This makes misconfiguration hard to detect and can lead to unexpected scheduling behavior. Consider logging a warning or raising a validation error when these JSON values cannot be parsed.

Copilot uses AI. Check for mistakes.
Comment on lines +340 to +377
# GKE Sandbox (gVisor) Configuration
# Provides additional kernel isolation for untrusted workloads using gVisor
# See: https://docs.cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods
gkeSandbox:
# Enable GKE Sandbox for execution pods
enabled: true

# Runtime class name (default: gvisor for GKE)
runtimeClassName: "gvisor"

# Node selector for sandbox-enabled nodes
# GKE automatically adds sandbox.gke.io/runtime=gvisor to sandbox nodes
# Add additional selectors here if needed (e.g., for specific node pools)
nodeSelector:
sandbox.gke.io/runtime: gvisor
# Example:
# cloud.google.com/gke-nodepool: sandbox-pool

# Custom tolerations for node pool taints
# GKE automatically adds toleration for sandbox.gke.io/runtime=gvisor
# Use this for additional custom taints (e.g., dedicated sandbox node pools)
customTolerations:
# Example:
# - key: pool
# operator: Equal
# value: sandbox
# effect: NoSchedule
# Custom node pool taint
- key: pool
operator: Equal
value: sandbox
effect: NoSchedule
# GKE Sandbox standard taint
- key: sandbox.gke.io/runtime
operator: Equal
value: gvisor
effect: NoSchedule

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GKE Sandbox is enabled by default here (enabled: true) and nodeSelector/customTolerations are populated with sandbox-specific values. This will make a default Helm install schedule execution pods only onto gVisor nodes and can leave pods Pending on clusters without a sandbox node pool. Consider defaulting gkeSandbox.enabled to false and leaving nodeSelector/customTolerations empty unless explicitly configured.

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +66
# Image registry configuration
# Format: {image_registry}-{language}:{image_tag}
# e.g., aronmuon/kubecoderun-python:latest
image_registry: str = "aronmuon/kubecoderun"
image_tag: str = "latest"
image_pull_policy: str = "IfNotPresent"

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KubernetesConfig.image_pull_policy defaults to "IfNotPresent", but Settings (K8S_IMAGE_PULL_POLICY) and the docs describe a default of "Always". Having different defaults depending on which config path is used is error-prone. Consider aligning these defaults (either change this dataclass default or update Settings/docs to match).

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +103
if err != nil || !strings.HasPrefix(absDir, "/mnt/data") {
writeJSON(w, http.StatusBadRequest, ExecuteResponse{
ExitCode: 1, Stderr: fmt.Sprintf("Invalid working directory: must be within /mnt/data, got %q", workingDir),
})
return
}
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The working directory check uses strings.HasPrefix(absDir, "/mnt/data"), which also allows paths like "/mnt/data2". This weakens the sandbox boundary. Consider validating with a path-aware check (e.g., allow exactly "/mnt/data" or paths under "/mnt/data/" using filepath.Rel/Clean) before setting cmd.Dir.

Suggested change
if err != nil || !strings.HasPrefix(absDir, "/mnt/data") {
writeJSON(w, http.StatusBadRequest, ExecuteResponse{
ExitCode: 1, Stderr: fmt.Sprintf("Invalid working directory: must be within /mnt/data, got %q", workingDir),
})
return
}
if err != nil {
writeJSON(w, http.StatusBadRequest, ExecuteResponse{
ExitCode: 1, Stderr: fmt.Sprintf("Invalid working directory: must be within /mnt/data, got %q", workingDir),
})
return
}
absDir = filepath.Clean(absDir)
sandboxRoot := "/mnt/data"
if absDir != sandboxRoot && !strings.HasPrefix(absDir, sandboxRoot+string(os.PathSeparator)) {
writeJSON(w, http.StatusBadRequest, ExecuteResponse{
ExitCode: 1, Stderr: fmt.Sprintf("Invalid working directory: must be within /mnt/data, got %q", workingDir),
})
return
}

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +122
env := os.Environ()
for k, v := range req.Env {
env = append(env, fmt.Sprintf("%s=%s", k, v))
}
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Env overrides are merged by appending new KEY=VALUE pairs to os.Environ(). Many runtimes resolve duplicate environment keys by taking the first occurrence, so the intended overrides (e.g., network isolation GOPROXY=off) may not actually apply. Consider replacing existing keys when merging overrides instead of appending duplicates.

Suggested change
env := os.Environ()
for k, v := range req.Env {
env = append(env, fmt.Sprintf("%s=%s", k, v))
}
env := os.Environ()
// First, replace existing keys in the inherited environment with overrides.
for i, e := range env {
key, _, ok := strings.Cut(e, "=")
if !ok {
continue
}
if v, found := req.Env[key]; found {
env[i] = fmt.Sprintf("%s=%s", key, v)
delete(req.Env, key)
}
}
// Then append any remaining override keys that did not exist in the base environment.
for k, v := range req.Env {
env = append(env, fmt.Sprintf("%s=%s", k, v))
}

Copilot uses AI. Check for mistakes.
Comment on lines +211 to +214
| `REDIS_TLS_CHECK_HOSTNAME` | `false` | Verify server hostname against certificate CN/SAN |

> When `REDIS_TLS_ENABLED=true` the generated URL uses the `rediss://` scheme automatically.

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REDIS_TLS_CHECK_HOSTNAME is documented and defaulted to false, meaning TLS connections to Redis will not verify the server hostname even when REDIS_TLS_ENABLED=true. Without hostname verification, an attacker with network access and any valid CA-signed certificate can perform a man-in-the-middle attack and intercept or modify Redis traffic. Consider enabling hostname verification by default (setting REDIS_TLS_CHECK_HOSTNAME=true for production) and requiring explicit opt-out only for environments that truly cannot support it, with strong warnings in both code and docs.

Suggested change
| `REDIS_TLS_CHECK_HOSTNAME` | `false` | Verify server hostname against certificate CN/SAN |
> When `REDIS_TLS_ENABLED=true` the generated URL uses the `rediss://` scheme automatically.
| `REDIS_TLS_CHECK_HOSTNAME` | `true` | Verify server hostname against certificate CN/SAN |
> When `REDIS_TLS_ENABLED=true` the generated URL uses the `rediss://` scheme automatically.
> For production, you should keep `REDIS_TLS_CHECK_HOSTNAME=true`; only set it to `false` in exceptional legacy environments that cannot support proper certificates, as this disables protection against man-in-the-middle attacks.

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +70
# Hostname verification is off by default because managed Redis services
# and Redis Cluster mode expose node IPs that don't match cert CN/SAN.
# REDIS_TLS_CHECK_HOSTNAME=false
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example Redis TLS configuration keeps REDIS_TLS_CHECK_HOSTNAME disabled by default (false), matching the code’s default of skipping hostname verification for TLS connections. This weakens TLS server authentication and allows a network attacker with any valid CA-signed certificate to MITM Redis traffic and access or alter sensitive data. For production, consider enabling hostname verification by default (setting REDIS_TLS_CHECK_HOSTNAME=true or omitting it when the code default is secure) and clearly marking any hostname-skipping mode as a short-term, debug-only override.

Suggested change
# Hostname verification is off by default because managed Redis services
# and Redis Cluster mode expose node IPs that don't match cert CN/SAN.
# REDIS_TLS_CHECK_HOSTNAME=false
# Hostname verification is enabled by default for strong TLS security.
# Set REDIS_TLS_CHECK_HOSTNAME=false only as a short-term, debug-only override
# when dealing with misconfigured certificates, and never in production.
REDIS_TLS_CHECK_HOSTNAME=true

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants