A compact, production-style demo that feels like a real LLM chat app and doubles as a zero-downtime deployment showcase.
- Go backend with clean layers (API, chat controller, model manager, sessions, logging)
- HTMX + Tailwind UI (server-rendered; no SPA framework)
- Ollama integration (local or Kubernetes sidecar) for open-source models
- Markdown rendering (safe, pretty, code blocks styled)
- Rolling updates via K8s/Minikube with a live version pill that flips after each deploy
Repo:
github.com/varsilias/zero-downtime
- Zero-downtime βversion pillβ
A tiny HTMX fragment that refreshes every 2 minutes and updates out-of-band without reloading the page. Perfect visual for rolling updates. - Resilient Ollama integration
App detects Ollama at runtime; if unreachable, it falls back to an in-memory engine so the demo keeps working. - Safe, server-rendered Markdown
goldmark+bluemondayfor beautiful, sanitized Markdown (incl. code blocks) in chat. - Production-style build
Multi-stage Docker build compiles Tailwind assets, embeds build info via-ldflags, runs on Distroless non-root. - Operational polish
Request ID middleware, access logs, panic recovery, and a cleanMakefilethat builds, loads into Minikube, sets the image on the Deployment, and waits for rollout. - Sidecar pattern for Ollama
Keeps models on a PVC, probes configured, optional auto-pull on start.
ββββββββββββββββββββββββββββββ
β Frontend β
β HTMX + Tailwind β
β - Chat UI β
β - Model Selector β
β - Sessions Sidebar β
βββββββββββββ¬βββββββββββββββββ
β (REST + HTMX fragments)
βΌ
ββββββββββββββββββββββββββββββ
β Backend (Go) β
β - API Gateway β
β - Chat Controller β
β - Model Manager β
β - Session Store (mem) β
β - Logging & Metrics β
βββββββββββββ¬βββββββββββββββββ
β
βββββββββββββββββββββ΄βββββββββββββββββββββ
βΌ βΌ
ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
β Ollama Runtime β β Platform (K8s) β
β (Local or Sidecar) β β - Deployment (app+ollama) β
β - /api/version β β RollingUpdate (0/1) β
β - /api/tags β β - Service (80β8080) β
β - /api/generate β β - PVC (ollama-models) β
β - Models on PVC (K8s) β β - Ingress (optional) β
ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
- Chat UI that feels like a real LLM product:
- Sidebar with recent sessions + New chat
- Chat bubbles with Markdown (headings, lists, code fences, inline code)
- Sticky top bar & composer, independent scroll areas (sidebar & messages), auto-scroll to last message
- Immediate user echo; assistant bubble after compute
- Per-response latency and timestamp
- Model dropdown sourced from Ollama
/api/tags - Admin endpoint to pull models (optional)
- Version pill that auto-refreshes every 120s without htmx loops
- Health checks (
/healthz) and/versionAPI with build metadata (version/commit/built_at) - Clean logging via Go
slogand middleware (Request ID, access log, recoverer)
You donβt need all of these at once, but this list avoids βmissing toolβ surprises:
- Go 1.22+
- Node.js 20+ and npm (for Tailwind build)
- Air (optional, hot reload):
go install github.com/air-verse/air@latest - Docker (to build images)
- Minikube + kubectl (to demo rolling updates locally)
- make (for the Makefile targets)
- Ollama (optional, to use a local daemon): https://ollama.com
- jq (optional, nicer curls)
Apple Silicon β . For CPU inference, use small models and keep
num_ctxmodest.
- Install Go & UI deps
go mod download
npm ci- Build Tailwind CSS
# Development:
npm run tw:build
# Production (minified):
npm run tw:prod- Start the server
# Plain:
go run .
# With hot reload (Air) & build info flags via script:
air -c .air.toml- Open
- App:
http://localhost:8080 - Build info:
http://localhost:8080/version - Health:
http://localhost:8080/healthz
Optional: Use your local Ollama
# In another terminal:
ollama serve &
ollama pull gemma3:270m
# Run app without startup wait (dev-friendly):
OLLAMA_WAIT=false OLLAMA_BASE_URL=http://localhost:11434 go run .Pick a model from the dropdown and chat.
# Build with ldflags baked in (Makefile also sets these)
docker build -t varsilias/zero-downtime:dev .
docker run -p 8080:8080 varsilias/zero-downtime:devDockerfile highlights
- Node stage builds Tailwind β web/static/dist/app.css
- Go stage builds with -ldflags & -trimpath (CGO_ENABLED=0)
- Distroless non-root runtime; templates/assets copied into /app/web
First rollout downloads models (PVC-backed cache). Subsequent rollouts are fast.
- Start Minikube with headroom
minikube start --memory=12000 --cpus=4- Apply PVC for Ollama models
kubectl apply -f k8s/pvc.yaml- Build + load + rollout
make release
# does: docker build β minikube image load β apply svc/deploy β set image β rollout status β print URL - Open the printed URL. Watch the version pill.
- Do another release (and watch zero-downtime flip)
VERSION=demo-$(date +%s) make release Ollama sidecar notes
- Sidecar image: ollama/ollama listening on :11434
- Models persisted in PVC ollama-models
- App talks to http://127.0.0.1:11434 inside the Pod
- Optional env on sidecar: OLLAMA_PULL_MODELS="gemma3:270m smollm:135m" to auto-pull at start
- On CPU, serialize requests: OLLAMA_NUM_PARALLEL=1
| Var | Default | Purpose |
|---|---|---|
ADDR |
:8080 |
HTTP bind address |
LOG_LEVEL |
info |
debug | info | warn | error |
LOG_JSON |
true |
JSON logs (set false for pretty text) |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama API base (or http://127.0.0.1:11434 for sidecar) |
OLLAMA_WAIT |
true |
On startup, wait for Ollama/models. Set false for local dev. |
OLLAMA_WAIT_TIMEOUT |
180s |
Max time to wait before continuing anyway |
OLLAMA_WAIT_INTERVAL |
2s |
Poll frequency during startup wait |
OLLAMA_WAIT_MODELS |
"gemma3:270m smollm:135m deepseek-r1:1.5b" |
Space-separated list: gemma3:270m smollm:135m |
POST /api/chatβ chat with selected model
{ "model":"gemma3:270m", "message":"Hello!" } GET /api/models β ["gemma3:270m","smollm:135m","deepseek-r1:1.5b", ...]GET /api/history/:session_idβ chat transcript (in-memory)POST /admin/models/pull β { "name": "gemma3:270m" }(optional admin)GET /version β { "version": "...", "commit": "...", "built_at": "..." }
UI endpoints
-
GET /β chat UI -
POST /ui/chatβ HTMX post (returns user + assistant bubbles) -
POST /ui/session/newβ creates a new session (via HX-Redirect) -
GET /ui/version-pillβ HTMX fragment for the version pill (polled by a non-swapped element every 120s)
Production-flavored design, real operational concerns addressed, clean Go layering, elegant UI without SPA, and a compelling rolling-update strategy.
-
Zero-Downtime in 60 seconds: run make release twice and point at the version pill. No 5xx, pods roll gracefully, traffic keeps flowing.
-
Sidecar runtime pattern: clean in-pod service-to-service without extra networking; PVC-backed model cache; probes; resource requests/limits; startup wait with timeout.
-
Safe, server-rendered UI: modern-feeling app without a SPAβHTMX requests, partial swaps, Markdown rendered on the server, sanitized with bluemonday.
-
Thoughtful DX: Air hot reload wired with -ldflags; Tailwind build baked into Docker; Distroless runtime; request IDs; access logs; panic recovery.
-
Pragmatic fallbacks: if Ollama isnβt reachable, the app still works, preserving the demo and UX.
- In-memory sessions only (no DB yet) β sidebar lists current-run chats; restart loses history.
- No token streaming yet (responses are returned whole; no SSE/WebSocket).
- No auth/tenancy β endpoints are open; fine for demos, not for production.
- Basic backpressure β no rate limiting; rely on ingress/gateway if needed.
- CPU inference defaults β recommended to use small models; larger models need GPU/tuning.
- Model pulls can be long β use longer timeouts or a background pull job for big models.
These choices keep the demo lean and make the zero-downtime story the star.
- Persistence: Postgres for sessions/users/model metadata; S3 for attachments.
- Streaming responses: SSE/WebSocket with token-by-token updates + cancelation.
- Observability: Prometheus metrics (latency, tokens/sec), OpenTelemetry traces, Grafana dashboards.
- Auth & RBAC: login, API keys, roles; protect admin endpoints.
- Background workers: queued model pulls, warm-ups, summarization jobs.
- GPU & autoscaling: HPA on latency/QPS; dedicated GPU node pool; model-aware scheduling.
- Config & GitOps: ConfigMaps for model lists, ArgoCD, image updater; canary rollouts.
- Testing: e2e smoke (health, version, chat flow), load testing (vegeta/k6), chaos (pod kill) to prove resilience.
- Ensure
web/static/dist/app.cssexists (npm run tw:prod). - Static server must point at
web/static; link/static/app.css.
- Donβt use
hx-trigger="load, every β¦"withhx-swap="outerHTML"on the same element. - Use a hidden poller targeting
#version-pill, or OOB swap.
- Initialize via
NewMemoryStore()or guard withensure()before writes. - Pass the store by pointer across the app.
- Use a no-timeout HTTP client or long context for
/api/pull. - Consider pulling via sidecar
postStartor a background job.
- If first call after model load, retry with small backoff.
- Keep
optionslight on CPU:num_ctx:512,num_predict:128,num_thread:4,num_gpu_layers:0. - Increase server
WriteTimeout(β₯ 3β5 min); serialize requests:OLLAMA_NUM_PARALLEL=1. - Check sidecar resources (CPU/mem) and Minikube size.
- Ensure polling element is not swapped itself.
- Endpoint
/ui/version-pillmust sendCache-Control: no-store. - Poll interval set to
every 120s.
- Confirm
readinessProbepaths and ports. maxUnavailable: 0requires enough capacity formaxSurge: 1.- Check PVC events if sidecar waits on model cache.