feat(kubernetes): add agent execution mode, executor-agent, sidecar refactor, GKE Sandbox, image & deps upgrades#33
Conversation
- Add configuration options for GKE Sandbox in KubernetesConfig
- Update pod/job manifest creation to support:
* runtimeClassName for gVisor runtime
* sandbox.gke.io/runtime annotation
* nodeSelector for sandbox-enabled nodes
* tolerations for GKE sandbox and custom taints
- Add GKE Sandbox settings to Helm values.yaml and configmap
- Update KubernetesManager, PodSpec, and PoolConfig models
- Parse JSON configuration for node selectors and tolerations
- Enable easy activation/deactivation via configuration flags
GKE Sandbox provides additional kernel isolation using gVisor for
untrusted workloads. When enabled, execution pods will:
- Run with gVisor runtime (runtimeClassName: gvisor)
- Be scheduled on sandbox-enabled nodes
- Tolerate GKE sandbox taints automatically
- Support custom node pool taints for dedicated execution nodes
Configuration example in values.yaml:
execution:
gkeSandbox:
enabled: true
runtimeClassName: gvisor
nodeSelector: {}
customTolerations:
- key: pool
operator: Equal
value: sandbox
effect: NoSchedule
* Introduce to support agent and nsenter modes. * Implement agent mode with a lightweight executor agent running in the main container. * Add for configuring the executor agent's HTTP server port. * Enhance security by dropping all capabilities in agent mode and ensuring no privilege escalation. * Support image pull secrets for private registries via . * Update documentation to reflect new execution modes and security configurations. * Modify Helm chart to include image pull secrets configuration.
* Change default sidecar image from to . * Update environment variable names for executor port from to . * Add documentation for building sidecar images and configuring Helm charts for execution modes. * Introduce GKE Sandbox support with configuration details and limitations. * Update related code and tests to reflect changes in image names and environment variables.
- Updated Dockerfiles for C/C++, D, Fortran, R, and Sidecar to use the trixie-dev variant. - Ensures compilers and development libraries are available at runtime.
* Updated base images to trixie-debian13-dev for C/C++, D, Fortran, R, and Rust. * Upgraded PHP version to 8.5.3. * Enhanced Node.js and Python requirements with new packages and versions. * Improved Rust dependencies for better compatibility and performance. * Updated Go version in executor-agent to 1.26.
* Introduced K8S_IMAGE_PULL_POLICY and K8S_IMAGE_PULL_SECRETS in configuration. * Updated relevant classes and methods to handle new fields. * Enhanced validation for execution mode and sidecar image consistency. * Added unit tests to ensure correct handling of image pull settings.
- Add REDIS_MODE (standalone/cluster/sentinel) to RedisConfig - Add TLS/SSL configuration (REDIS_TLS_ENABLED, certs, CA, insecure) - Add Redis Cluster support (REDIS_CLUSTER_NODES) via RedisCluster client - Add Redis Sentinel support (REDIS_SENTINEL_NODES/MASTER/PASSWORD) - Update RedisPool to support all three modes with TLS - Migrate FileService to use shared RedisPool instead of standalone client - Update Settings class with all new Redis fields - Update .env.example with new Redis configuration options - Update docs/CONFIGURATION.md with Cluster, Sentinel, and TLS sections - Update docs/SECURITY.md with TLS configuration reference - Update Helm values.yaml, configmap.yaml, and _helpers.tpl - Default remains standalone Redis for full backward compatibility feat: add optional Redis key prefix support (REDIS_KEY_PREFIX) - Add key_prefix field to RedisConfig and Settings - Add make_key() helper to RedisPool for centralized key prefixing - Update all services to use prefixed keys: session, state, file, health, api_key_manager, detailed_metrics, metrics - Update .env.example, docs, and Helm chart with new setting
There was a problem hiding this comment.
Pull request overview
This pull request introduces a significant architectural enhancement by adding an agent-based execution mode as the default, alongside comprehensive Redis deployment mode support and extensive dependency upgrades. The changes modernize the security model by eliminating the need for Linux capabilities in the default execution path, enable GKE Sandbox (gVisor) compatibility, and provide flexibility for Redis clustering and high-availability deployments.
Changes:
- Introduced agent execution mode (default) that eliminates nsenter, Linux capabilities, and privilege escalation requirements, with nsenter mode retained for backward compatibility
- Added comprehensive Redis deployment modes (standalone, cluster, sentinel) with TLS/SSL support and optional key prefixing for multi-tenant deployments
- Implemented GKE Sandbox (gVisor) support with runtime class, node selectors, and tolerations for kernel-level isolation
- Upgraded language runtimes and dependencies: Go 1.25→1.26, PHP 8.4.17→8.5.3, Rust 1.92→1.93, Python packages modernized
- Refactored sidecar to multi-target Docker build producing both agent and nsenter variants from a single Dockerfile
Reviewed changes
Copilot reviewed 49 out of 50 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/services/kubernetes/client.py |
Enhanced pod manifest creation with agent/nsenter mode support, GKE Sandbox configuration, and image pull secrets |
docker/sidecar/main.py |
Added execute_via_agent() function and mode routing logic for dual execution mode support |
docker/sidecar/executor-agent/main.go |
New Go HTTP server for agent mode execution without nsenter or capabilities |
docker/sidecar/Dockerfile |
Refactored to multi-target build: sidecar-agent (default) and sidecar-nsenter (legacy) |
src/core/pool.py |
Complete rewrite supporting Redis standalone/cluster/sentinel modes with TLS and key prefixing |
src/config/redis.py |
New configuration model for Redis deployment modes, TLS, and advanced features |
src/services/*.py |
Updated all Redis-using services (state, session, metrics, api_key_manager) to use key prefixing |
src/main.py |
Added validation for execution mode/sidecar image consistency and image pull secrets parsing |
tests/unit/test_kubernetes_client.py |
Comprehensive tests for agent/nsenter modes and GKE Sandbox configuration |
scripts/build-images.sh |
Enhanced to support Docker multi-target builds with --target flag |
.github/workflows/docker-publish.yml |
Updated sidecar image name to kubecoderun-sidecar-agent |
helm-deployments/kubecoderun/values.yaml |
Added Redis mode configuration, GKE Sandbox settings, and execution mode options |
docs/SECURITY.md, docs/CONFIGURATION.md, docs/ARCHITECTURE.md |
Extensive documentation updates for new execution modes and Redis features |
Comments suppressed due to low confidence (1)
.github/workflows/docker-publish.yml:153
- The CI/CD workflow only builds the
kubecoderun-sidecar-agentimage but not thekubecoderun-sidecar-nsentervariant. While the build script supports both targets and the Dockerfile defines both, the nsenter sidecar won't be available in the registry for users who need legacy nsenter mode. Consider adding a separate job to build the nsenter variant, or document that users must build it locally if needed.
sidecar:
needs: changes
if: |
needs.changes.outputs.is_cross_repo_pr != 'true' &&
(needs.changes.outputs.sidecar == 'true' || needs.changes.outputs.force_all == 'true')
uses: ./.github/workflows/docker-build-reusable.yml
secrets: inherit
with:
image_name: kubecoderun-sidecar-agent
dockerfile: docker/sidecar/Dockerfile
context: docker/sidecar
image_tag: ${{ needs.changes.outputs.image_tag }}
is_release: ${{ needs.changes.outputs.is_release == 'true' }}
version: ${{ needs.changes.outputs.version }}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* Validate GKE Sandbox compatibility with nsenter execution mode * Log warnings when incompatible execution modes are used * Update Dockerfile capabilities for nsenter * Enhance RedisPool type hints for better type checking * Add unit tests for GKE Sandbox and nsenter mode interactions
…me by default Three issues prevented connecting to GCP Memorystore Redis with TLS: 1. _validate_redis_connection() used redis.from_url() without passing ssl_ca_certs / ssl_cert_reqs, so certificate verification fell back to the system CA bundle which doesn't include managed-service CAs. 2. get_tls_kwargs() set ssl_check_hostname=True when tls_insecure=False. Managed Redis services (GCP Memorystore, AWS ElastiCache) and Redis Cluster node discovery return IPs that don't match certificate CN/SAN, causing CERTIFICATE_VERIFY_FAILED. Hostname checking is now off by default (matching redis-py) and controlled by the new REDIS_TLS_CHECK_HOSTNAME setting. 3. REDIS_HOST could contain a URL scheme (rediss://host) which was passed through to ClusterNode or URL construction. A field validator now strips accidental schemes from the host value.
|
Hello, let me take a look here. Nice to see that Nos engineers are using Kubecoderun - I am a Nos customer myself. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 50 out of 51 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pod_annotations = annotations or {} | ||
| if gke_sandbox_enabled: | ||
| # GKE Sandbox annotation for gVisor runtime | ||
| pod_annotations["sandbox.gke.io/runtime"] = "gvisor" | ||
|
|
There was a problem hiding this comment.
pod_annotations is assigned as annotations or {} and then mutated when GKE Sandbox is enabled. If the caller passes a dict, this mutates the caller-owned object and can leak the sandbox annotation to other manifests. Consider copying the dict (e.g., shallow copy) before adding/modifying keys.
| if custom_tolerations: | ||
| # Add custom node pool taints (e.g., pool=sandbox) | ||
| for tol in custom_tolerations: | ||
| tolerations.append( | ||
| client.V1Toleration( | ||
| key=tol.get("key"), | ||
| operator=tol.get("operator", "Equal"), | ||
| value=tol.get("value"), | ||
| effect=tol.get("effect", "NoSchedule"), | ||
| ) |
There was a problem hiding this comment.
custom_tolerations entries are converted into V1Toleration objects using tol.get("key"). If a dict is missing key (or it’s empty), the manifest will contain an invalid toleration and the pod/job creation will fail at runtime. Consider validating required fields (at least key) and raising/logging a clear error before building the manifest.
| # Parse JSON strings for node selector and tolerations | ||
| sandbox_node_selector = None | ||
| if self.gke_sandbox_node_selector: | ||
| try: | ||
| sandbox_node_selector = json.loads(self.gke_sandbox_node_selector) | ||
| except json.JSONDecodeError: | ||
| pass | ||
|
|
||
| custom_tolerations = None | ||
| if self.gke_sandbox_custom_tolerations: | ||
| try: | ||
| custom_tolerations = json.loads(self.gke_sandbox_custom_tolerations) | ||
| except json.JSONDecodeError: | ||
| pass |
There was a problem hiding this comment.
Invalid JSON in GKE_SANDBOX_NODE_SELECTOR / GKE_SANDBOX_CUSTOM_TOLERATIONS is silently ignored (JSONDecodeError is swallowed). This makes misconfiguration hard to detect and can lead to unexpected scheduling behavior. Consider logging a warning or raising a validation error when these JSON values cannot be parsed.
| # GKE Sandbox (gVisor) Configuration | ||
| # Provides additional kernel isolation for untrusted workloads using gVisor | ||
| # See: https://docs.cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods | ||
| gkeSandbox: | ||
| # Enable GKE Sandbox for execution pods | ||
| enabled: true | ||
|
|
||
| # Runtime class name (default: gvisor for GKE) | ||
| runtimeClassName: "gvisor" | ||
|
|
||
| # Node selector for sandbox-enabled nodes | ||
| # GKE automatically adds sandbox.gke.io/runtime=gvisor to sandbox nodes | ||
| # Add additional selectors here if needed (e.g., for specific node pools) | ||
| nodeSelector: | ||
| sandbox.gke.io/runtime: gvisor | ||
| # Example: | ||
| # cloud.google.com/gke-nodepool: sandbox-pool | ||
|
|
||
| # Custom tolerations for node pool taints | ||
| # GKE automatically adds toleration for sandbox.gke.io/runtime=gvisor | ||
| # Use this for additional custom taints (e.g., dedicated sandbox node pools) | ||
| customTolerations: | ||
| # Example: | ||
| # - key: pool | ||
| # operator: Equal | ||
| # value: sandbox | ||
| # effect: NoSchedule | ||
| # Custom node pool taint | ||
| - key: pool | ||
| operator: Equal | ||
| value: sandbox | ||
| effect: NoSchedule | ||
| # GKE Sandbox standard taint | ||
| - key: sandbox.gke.io/runtime | ||
| operator: Equal | ||
| value: gvisor | ||
| effect: NoSchedule | ||
|
|
There was a problem hiding this comment.
GKE Sandbox is enabled by default here (enabled: true) and nodeSelector/customTolerations are populated with sandbox-specific values. This will make a default Helm install schedule execution pods only onto gVisor nodes and can leave pods Pending on clusters without a sandbox node pool. Consider defaulting gkeSandbox.enabled to false and leaving nodeSelector/customTolerations empty unless explicitly configured.
| # Image registry configuration | ||
| # Format: {image_registry}-{language}:{image_tag} | ||
| # e.g., aronmuon/kubecoderun-python:latest | ||
| image_registry: str = "aronmuon/kubecoderun" | ||
| image_tag: str = "latest" | ||
| image_pull_policy: str = "IfNotPresent" | ||
|
|
There was a problem hiding this comment.
KubernetesConfig.image_pull_policy defaults to "IfNotPresent", but Settings (K8S_IMAGE_PULL_POLICY) and the docs describe a default of "Always". Having different defaults depending on which config path is used is error-prone. Consider aligning these defaults (either change this dataclass default or update Settings/docs to match).
| if err != nil || !strings.HasPrefix(absDir, "/mnt/data") { | ||
| writeJSON(w, http.StatusBadRequest, ExecuteResponse{ | ||
| ExitCode: 1, Stderr: fmt.Sprintf("Invalid working directory: must be within /mnt/data, got %q", workingDir), | ||
| }) | ||
| return | ||
| } |
There was a problem hiding this comment.
The working directory check uses strings.HasPrefix(absDir, "/mnt/data"), which also allows paths like "/mnt/data2". This weakens the sandbox boundary. Consider validating with a path-aware check (e.g., allow exactly "/mnt/data" or paths under "/mnt/data/" using filepath.Rel/Clean) before setting cmd.Dir.
| if err != nil || !strings.HasPrefix(absDir, "/mnt/data") { | |
| writeJSON(w, http.StatusBadRequest, ExecuteResponse{ | |
| ExitCode: 1, Stderr: fmt.Sprintf("Invalid working directory: must be within /mnt/data, got %q", workingDir), | |
| }) | |
| return | |
| } | |
| if err != nil { | |
| writeJSON(w, http.StatusBadRequest, ExecuteResponse{ | |
| ExitCode: 1, Stderr: fmt.Sprintf("Invalid working directory: must be within /mnt/data, got %q", workingDir), | |
| }) | |
| return | |
| } | |
| absDir = filepath.Clean(absDir) | |
| sandboxRoot := "/mnt/data" | |
| if absDir != sandboxRoot && !strings.HasPrefix(absDir, sandboxRoot+string(os.PathSeparator)) { | |
| writeJSON(w, http.StatusBadRequest, ExecuteResponse{ | |
| ExitCode: 1, Stderr: fmt.Sprintf("Invalid working directory: must be within /mnt/data, got %q", workingDir), | |
| }) | |
| return | |
| } |
| env := os.Environ() | ||
| for k, v := range req.Env { | ||
| env = append(env, fmt.Sprintf("%s=%s", k, v)) | ||
| } |
There was a problem hiding this comment.
Env overrides are merged by appending new KEY=VALUE pairs to os.Environ(). Many runtimes resolve duplicate environment keys by taking the first occurrence, so the intended overrides (e.g., network isolation GOPROXY=off) may not actually apply. Consider replacing existing keys when merging overrides instead of appending duplicates.
| env := os.Environ() | |
| for k, v := range req.Env { | |
| env = append(env, fmt.Sprintf("%s=%s", k, v)) | |
| } | |
| env := os.Environ() | |
| // First, replace existing keys in the inherited environment with overrides. | |
| for i, e := range env { | |
| key, _, ok := strings.Cut(e, "=") | |
| if !ok { | |
| continue | |
| } | |
| if v, found := req.Env[key]; found { | |
| env[i] = fmt.Sprintf("%s=%s", key, v) | |
| delete(req.Env, key) | |
| } | |
| } | |
| // Then append any remaining override keys that did not exist in the base environment. | |
| for k, v := range req.Env { | |
| env = append(env, fmt.Sprintf("%s=%s", k, v)) | |
| } |
| | `REDIS_TLS_CHECK_HOSTNAME` | `false` | Verify server hostname against certificate CN/SAN | | ||
|
|
||
| > When `REDIS_TLS_ENABLED=true` the generated URL uses the `rediss://` scheme automatically. | ||
|
|
There was a problem hiding this comment.
REDIS_TLS_CHECK_HOSTNAME is documented and defaulted to false, meaning TLS connections to Redis will not verify the server hostname even when REDIS_TLS_ENABLED=true. Without hostname verification, an attacker with network access and any valid CA-signed certificate can perform a man-in-the-middle attack and intercept or modify Redis traffic. Consider enabling hostname verification by default (setting REDIS_TLS_CHECK_HOSTNAME=true for production) and requiring explicit opt-out only for environments that truly cannot support it, with strong warnings in both code and docs.
| | `REDIS_TLS_CHECK_HOSTNAME` | `false` | Verify server hostname against certificate CN/SAN | | |
| > When `REDIS_TLS_ENABLED=true` the generated URL uses the `rediss://` scheme automatically. | |
| | `REDIS_TLS_CHECK_HOSTNAME` | `true` | Verify server hostname against certificate CN/SAN | | |
| > When `REDIS_TLS_ENABLED=true` the generated URL uses the `rediss://` scheme automatically. | |
| > For production, you should keep `REDIS_TLS_CHECK_HOSTNAME=true`; only set it to `false` in exceptional legacy environments that cannot support proper certificates, as this disables protection against man-in-the-middle attacks. |
| # Hostname verification is off by default because managed Redis services | ||
| # and Redis Cluster mode expose node IPs that don't match cert CN/SAN. | ||
| # REDIS_TLS_CHECK_HOSTNAME=false |
There was a problem hiding this comment.
The example Redis TLS configuration keeps REDIS_TLS_CHECK_HOSTNAME disabled by default (false), matching the code’s default of skipping hostname verification for TLS connections. This weakens TLS server authentication and allows a network attacker with any valid CA-signed certificate to MITM Redis traffic and access or alter sensitive data. For production, consider enabling hostname verification by default (setting REDIS_TLS_CHECK_HOSTNAME=true or omitting it when the code default is secure) and clearly marking any hostname-skipping mode as a short-term, debug-only override.
| # Hostname verification is off by default because managed Redis services | |
| # and Redis Cluster mode expose node IPs that don't match cert CN/SAN. | |
| # REDIS_TLS_CHECK_HOSTNAME=false | |
| # Hostname verification is enabled by default for strong TLS security. | |
| # Set REDIS_TLS_CHECK_HOSTNAME=false only as a short-term, debug-only override | |
| # when dealing with misconfigured certificates, and never in production. | |
| REDIS_TLS_CHECK_HOSTNAME=true |
This pull request updates the development and runtime environments for multiple languages, modernizes package dependencies, and enhances configuration flexibility, particularly for Redis and Kubernetes. The changes include updates to Dockerfiles for newer base images and language versions, significant dependency version bumps and additions, and expanded sample configuration for advanced deployment scenarios.
Key changes:
1. Dependency and Environment Updates
trixie-debian13-devbase images to ensure compilers and development libraries are available at runtime, improving compatibility and security. (docker/c-cpp.Dockerfile,docker/d.Dockerfile,docker/fortran.Dockerfile,docker/r.Dockerfile) [1] [2] [3] [4] [5]docker/go.Dockerfile,docker/requirements/go.mod) [1] [2] [3] [4]docker/php.Dockerfile)docker/requirements/python-core.txt,docker/requirements/python-analysis.txt,docker/requirements/python-documents.txt,docker/requirements/python-utilities.txt,docker/requirements/python-visualization.txt) [1] [2] [3] [4] [5]docker/requirements/nodejs.txt)Cargo.tomlfor new features and bug fixes. (docker/rust.Dockerfile,docker/requirements/rust-Cargo.toml) [1] [2] [3]2. Configuration Enhancements
.env.examplefile now documents advanced Redis deployment options, including cluster and sentinel modes, TLS/SSL settings, and key prefixing, making it easier to configure Redis in complex environments. [1] [2].env.example, including support for agent and nsenter modes, sidecar image selection, image pull policies, and GKE Sandbox compatibility notes.agent(default)nsenter(legacy)3. Docker Image Naming and CI/CD
kubecoderun-sidecar-agentinstead ofkubecoderun-sidecar, aligning with the agent-based execution model. (.github/workflows/docker-publish.yml) [1] [2]4. Python Runtime Optimization
docker/python.Dockerfile) [1] [2]These changes collectively modernize the development environments, improve documentation and configuration for advanced deployments, and ensure compatibility with newer language and library versions.