Skip to content

This repository contains the project for my talk: Zero Downtime Deployment with Cloud Native tools

Notifications You must be signed in to change notification settings

Varsilias/zero-downtime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Zero-Downtime AI Chat (Go + HTMX + Tailwind + Ollama)

A compact, production-style demo that feels like a real LLM chat app and doubles as a zero-downtime deployment showcase.

  • Go backend with clean layers (API, chat controller, model manager, sessions, logging)
  • HTMX + Tailwind UI (server-rendered; no SPA framework)
  • Ollama integration (local or Kubernetes sidecar) for open-source models
  • Markdown rendering (safe, pretty, code blocks styled)
  • Rolling updates via K8s/Minikube with a live version pill that flips after each deploy

Repo: github.com/varsilias/zero-downtime


✨ Highlights (the clever bits)

  • Zero-downtime β€œversion pill”
    A tiny HTMX fragment that refreshes every 2 minutes and updates out-of-band without reloading the page. Perfect visual for rolling updates.
  • Resilient Ollama integration
    App detects Ollama at runtime; if unreachable, it falls back to an in-memory engine so the demo keeps working.
  • Safe, server-rendered Markdown
    goldmark + bluemonday for beautiful, sanitized Markdown (incl. code blocks) in chat.
  • Production-style build
    Multi-stage Docker build compiles Tailwind assets, embeds build info via -ldflags, runs on Distroless non-root.
  • Operational polish
    Request ID middleware, access logs, panic recovery, and a clean Makefile that builds, loads into Minikube, sets the image on the Deployment, and waits for rollout.
  • Sidecar pattern for Ollama
    Keeps models on a PVC, probes configured, optional auto-pull on start.

🧱 Architecture

                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚        Frontend            β”‚
                 β”‚  HTMX + Tailwind           β”‚
                 β”‚  - Chat UI                 β”‚
                 β”‚  - Model Selector          β”‚
                 β”‚  - Sessions Sidebar        β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ (REST + HTMX fragments)
                             β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚         Backend (Go)       β”‚
                 β”‚  - API Gateway             β”‚
                 β”‚  - Chat Controller         β”‚
                 β”‚  - Model Manager           β”‚
                 β”‚  - Session Store (mem)     β”‚
                 β”‚  - Logging & Metrics       β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β–Ό                                        β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚       Ollama Runtime       β”‚            β”‚       Platform (K8s)       β”‚
 β”‚     (Local or Sidecar)     β”‚            β”‚  - Deployment (app+ollama) β”‚
 β”‚  - /api/version            β”‚            β”‚    RollingUpdate (0/1)     β”‚
 β”‚  - /api/tags               β”‚            β”‚  - Service (80β†’8080)       β”‚
 β”‚  - /api/generate           β”‚            β”‚  - PVC (ollama-models)     β”‚
 β”‚  - Models on PVC (K8s)     β”‚            β”‚  - Ingress (optional)      β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

βœ… Features

  • Chat UI that feels like a real LLM product:
    • Sidebar with recent sessions + New chat
    • Chat bubbles with Markdown (headings, lists, code fences, inline code)
    • Sticky top bar & composer, independent scroll areas (sidebar & messages), auto-scroll to last message
    • Immediate user echo; assistant bubble after compute
    • Per-response latency and timestamp
  • Model dropdown sourced from Ollama /api/tags
  • Admin endpoint to pull models (optional)
  • Version pill that auto-refreshes every 120s without htmx loops
  • Health checks (/healthz) and /version API with build metadata (version/commit/built_at)
  • Clean logging via Go slog and middleware (Request ID, access log, recoverer)

πŸ›  Prerequisites (local dev)

You don’t need all of these at once, but this list avoids β€œmissing tool” surprises:

  • Go 1.22+
  • Node.js 20+ and npm (for Tailwind build)
  • Air (optional, hot reload): go install github.com/air-verse/air@latest
  • Docker (to build images)
  • Minikube + kubectl (to demo rolling updates locally)
  • make (for the Makefile targets)
  • Ollama (optional, to use a local daemon): https://ollama.com
  • jq (optional, nicer curls)

Apple Silicon βœ…. For CPU inference, use small models and keep num_ctx modest.


πŸš€ Run locally (no Docker)

  1. Install Go & UI deps
go mod download
npm ci
  1. Build Tailwind CSS
# Development:
npm run tw:build
# Production (minified):
npm run tw:prod
  1. Start the server
# Plain:
go run .

# With hot reload (Air) & build info flags via script:
air -c .air.toml
  1. Open
  • App: http://localhost:8080
  • Build info: http://localhost:8080/version
  • Health: http://localhost:8080/healthz

Optional: Use your local Ollama

# In another terminal:
ollama serve &
ollama pull gemma3:270m

# Run app without startup wait (dev-friendly):
OLLAMA_WAIT=false OLLAMA_BASE_URL=http://localhost:11434 go run .

Pick a model from the dropdown and chat.


🐳 Docker build & run

# Build with ldflags baked in (Makefile also sets these)
docker build -t varsilias/zero-downtime:dev .
docker run -p 8080:8080 varsilias/zero-downtime:dev

Dockerfile highlights

  • Node stage builds Tailwind β†’ web/static/dist/app.css
  • Go stage builds with -ldflags & -trimpath (CGO_ENABLED=0)
  • Distroless non-root runtime; templates/assets copied into /app/web

☸️ Kubernetes (Minikube) β€” zero-downtime demo

First rollout downloads models (PVC-backed cache). Subsequent rollouts are fast.

  1. Start Minikube with headroom
minikube start --memory=12000 --cpus=4
  1. Apply PVC for Ollama models
kubectl apply -f k8s/pvc.yaml
  1. Build + load + rollout
make release
# does: docker build β†’ minikube image load β†’ apply svc/deploy β†’ set image β†’ rollout status β†’ print URL 
  1. Open the printed URL. Watch the version pill.
  2. Do another release (and watch zero-downtime flip)
VERSION=demo-$(date +%s) make release 

Ollama sidecar notes

  • Sidecar image: ollama/ollama listening on :11434
  • Models persisted in PVC ollama-models
  • App talks to http://127.0.0.1:11434 inside the Pod
  • Optional env on sidecar: OLLAMA_PULL_MODELS="gemma3:270m smollm:135m" to auto-pull at start
  • On CPU, serialize requests: OLLAMA_NUM_PARALLEL=1

πŸ”§ Environment variables

Var Default Purpose
ADDR :8080 HTTP bind address
LOG_LEVEL info debug | info | warn | error
LOG_JSON true JSON logs (set false for pretty text)
OLLAMA_BASE_URL http://localhost:11434 Ollama API base (or http://127.0.0.1:11434 for sidecar)
OLLAMA_WAIT true On startup, wait for Ollama/models. Set false for local dev.
OLLAMA_WAIT_TIMEOUT 180s Max time to wait before continuing anyway
OLLAMA_WAIT_INTERVAL 2s Poll frequency during startup wait
OLLAMA_WAIT_MODELS "gemma3:270m smollm:135m deepseek-r1:1.5b" Space-separated list: gemma3:270m smollm:135m

πŸ”Œ API (quick reference)

  • POST /api/chat β†’ chat with selected model
{ "model":"gemma3:270m", "message":"Hello!" } 
  • GET /api/models β†’ ["gemma3:270m","smollm:135m","deepseek-r1:1.5b", ...]
  • GET /api/history/:session_id β†’ chat transcript (in-memory)
  • POST /admin/models/pull β†’ { "name": "gemma3:270m" } (optional admin)
  • GET /version β†’ { "version": "...", "commit": "...", "built_at": "..." }

UI endpoints

  • GET / – chat UI

  • POST /ui/chat – HTMX post (returns user + assistant bubbles)

  • POST /ui/session/new – creates a new session (via HX-Redirect)

  • GET /ui/version-pill – HTMX fragment for the version pill (polled by a non-swapped element every 120s)


Why is this project awesome?

Production-flavored design, real operational concerns addressed, clean Go layering, elegant UI without SPA, and a compelling rolling-update strategy.

  • Zero-Downtime in 60 seconds: run make release twice and point at the version pill. No 5xx, pods roll gracefully, traffic keeps flowing.

  • Sidecar runtime pattern: clean in-pod service-to-service without extra networking; PVC-backed model cache; probes; resource requests/limits; startup wait with timeout.

  • Safe, server-rendered UI: modern-feeling app without a SPAβ€”HTMX requests, partial swaps, Markdown rendered on the server, sanitized with bluemonday.

  • Thoughtful DX: Air hot reload wired with -ldflags; Tailwind build baked into Docker; Distroless runtime; request IDs; access logs; panic recovery.

  • Pragmatic fallbacks: if Ollama isn’t reachable, the app still works, preserving the demo and UX.


Limitations (By Design)

  • In-memory sessions only (no DB yet) β€” sidebar lists current-run chats; restart loses history.
  • No token streaming yet (responses are returned whole; no SSE/WebSocket).
  • No auth/tenancy β€” endpoints are open; fine for demos, not for production.
  • Basic backpressure β€” no rate limiting; rely on ingress/gateway if needed.
  • CPU inference defaults β€” recommended to use small models; larger models need GPU/tuning.
  • Model pulls can be long β€” use longer timeouts or a background pull job for big models.

These choices keep the demo lean and make the zero-downtime story the star.


Roadmap (If This Weren’t a Demo)

  • Persistence: Postgres for sessions/users/model metadata; S3 for attachments.
  • Streaming responses: SSE/WebSocket with token-by-token updates + cancelation.
  • Observability: Prometheus metrics (latency, tokens/sec), OpenTelemetry traces, Grafana dashboards.
  • Auth & RBAC: login, API keys, roles; protect admin endpoints.
  • Background workers: queued model pulls, warm-ups, summarization jobs.
  • GPU & autoscaling: HPA on latency/QPS; dedicated GPU node pool; model-aware scheduling.
  • Config & GitOps: ConfigMaps for model lists, ArgoCD, image updater; canary rollouts.
  • Testing: e2e smoke (health, version, chat flow), load testing (vegeta/k6), chaos (pod kill) to prove resilience.

Troubleshooting

CSS 404 / Wrong MIME

  • Ensure web/static/dist/app.css exists (npm run tw:prod).
  • Static server must point at web/static; link /static/app.css.

HTMX Polling Loops

  • Don’t use hx-trigger="load, every …" with hx-swap="outerHTML" on the same element.
  • Use a hidden poller targeting #version-pill, or OOB swap.

Nil Map Panic (MemoryStore)

  • Initialize via NewMemoryStore() or guard with ensure() before writes.
  • Pass the store by pointer across the app.

Model Pulls Never Finish

  • Use a no-timeout HTTP client or long context for /api/pull.
  • Consider pulling via sidecar postStart or a background job.

500 on /api/generate

  • If first call after model load, retry with small backoff.
  • Keep options light on CPU: num_ctx:512, num_predict:128, num_thread:4, num_gpu_layers:0.
  • Increase server WriteTimeout (β‰₯ 3–5 min); serialize requests: OLLAMA_NUM_PARALLEL=1.
  • Check sidecar resources (CPU/mem) and Minikube size.

Version Pill Doesn’t Update

  • Ensure polling element is not swapped itself.
  • Endpoint /ui/version-pill must send Cache-Control: no-store.
  • Poll interval set to every 120s.

Kubernetes Rollout Stalls

  • Confirm readinessProbe paths and ports.
  • maxUnavailable: 0 requires enough capacity for maxSurge: 1.
  • Check PVC events if sidecar waits on model cache.

About

This repository contains the project for my talk: Zero Downtime Deployment with Cloud Native tools

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published