Fix Docker health checks & improve nginx stability #43

rom98759 · 2026-01-08T11:47:58Z

🎯 Objectif

Résoudre les problèmes intermittents de démarrage des services (502 Bad Gateway, containers unhealthy) et améliorer la stabilité globale de l'architecture reverse proxy.

Problèmes résolus

Health Checks

api-gateway et game-service restaient unhealthy indéfiniment
localhost ne résout pas correctement dans Alpine Linux
Timeouts trop courts (10s start_period) pour les services lents à démarrer

Nginx

Erreurs 502 Bad Gateway intermittentes (resolver DNS Docker instable)
Cookies non transmis (authentification cassée)
WebSocket instable avec HTTP/2
Pas de timeouts explicites → connexions mortes

Solutions implémentées

1. Docker Compose Health Checks

Changements:

Tous les health checks utilisent 127.0.0.1 au lieu de localhost
Ajout health check pour api-gateway (HTTP)
Ajout health check pour redis-broker (redis-cli ping)
Template amélioré : retries: 5, start_period: 30s
Réorganisation en couches architecturales :

Infrastructure (redis)
  ↓
Backend Services (users, auth, blockchain, game)
  ↓
API Gateway
  ↓
Reverse Proxy (nginx)

Dépendances optimisées:

game-service attend redis-broker:healthy
auth-service attend user-service:healthy
api-gateway attend auth-service:healthy
nginx-proxy attend api-gateway:healthy + user-service:healthy

2. Nginx Configuration

Documentation complète:

Commentaires exhaustifs sur toutes les décisions techniques
Explications des upstreams, timeouts, headers de sécurité
Nouveau fichier srcs/nginx/README.md

Corrections de stabilité:

Upstreams déclarés avec keepalive (32 pour gateway, 16 pour users)
HTTP/2 désactivé (incompatibilité WebSocket)
Resolver DNS supprimé (source de 502 intermittents)
Timeouts explicites partout : 10s/60s/3600s selon le type de route
Transmission cookies : proxy_pass_header Set-Cookie; proxy_cookie_path / /;
Headers de sécurité : suppression des headers forgés (x-user-name, x-user-id)

Structure des locations:

/health              → Health check local
/users/doc/          → Swagger UI (user-service)
/api/                → REST API (api-gateway)
/api/game/           → WebSocket Pong (timeouts 3600s)
/public/             → Assets publics
/, /uploads/, /assets/ → Fichiers statiques

- Fix health check resolution: use 127.0.0.1 instead of localhost in Alpine - Add health checks for api-gateway and redis-broker services - Reorganize docker-compose.yml in architectural layers (infra, backend, gateway, proxy) - Improve health check template: retries 5, start_period 30s - Add depends_on with service_healthy for critical dependencies - Refactor nginx configuration with comprehensive documentation - Remove HTTP/2 (WebSocket incompatibility) and DNS resolver (502 source) - Add upstream declarations with keepalive for backend stability - Add explicit timeouts (10s/60s/3600s) on all proxy locations - Ensure cookie transmission for authentication (proxy_pass_header Set-Cookie) - Document all architectural decisions and technical choices - Add envs Makefile target to create .env files from examples - Add .gitignore entries for TLS certificates - Create https-internal-guide.md for future HTTPS end-to-end implementation - Add scripts/generate-internal-certs.sh for internal CA generation Resolves intermittent 502 errors, cookie transmission issues, and unhealthy container states. Architecture now follows clear layering: redis → users → auth → gateway → nginx

Copilot

Pull request overview

This PR addresses critical stability issues in the Docker and nginx infrastructure by fixing health checks, improving nginx configuration, and enhancing cookie/WebSocket handling. The changes focus on resolving intermittent 502 Bad Gateway errors and unhealthy container states.

Key changes:

Health checks: Switched from localhost to 127.0.0.1 for Alpine Linux compatibility, increased retries (3→5) and start_period (10s→30s)
Nginx stability: Introduced upstream blocks with keepalive pools, removed problematic DNS resolver, disabled HTTP/2 for WebSocket compatibility, added explicit timeouts
Architecture: Reorganized docker-compose into clear layers (Infrastructure → Backend → Gateway → Proxy) with proper dependency chains

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
srcs/nginx/nginx.conf	Added upstream blocks (api_gateway, user_service) with keepalive pools; enhanced comments explaining performance settings
srcs/nginx/conf.d/default.conf	Complete rewrite with explicit timeouts, cookie transmission directives, removed DNS resolver, disabled HTTP/2, hardcoded https in X-Forwarded-Proto
srcs/nginx/README.md	Comprehensive new documentation (359 lines) covering architecture, configuration decisions, debugging, security, and monitoring
srcs/docker-compose.yml	Restructured services into layers, standardized health checks using 127.0.0.1, updated retries/timeouts, added redis-broker and api-gateway health checks
Makefile	New `envs` target to generate .env files from examples and synchronize JWT_SECRET across all services using openssl

Copilot · 2026-01-08T12:02:56Z

Makefile

+envs:
+	@echo "Creating .env files from examples…"
+	@secret=$$(openssl rand -hex 32); \
+	for f in srcs/.env*.example; do \
+		envfile=$${f%.example}; \
+		if [ ! -f "$$envfile" ]; then \
+			cp "$$f" "$$envfile"; \
+			echo "Copied $$f → $$envfile"; \
+		else \
+			echo "$$envfile already exists"; \
+		fi; \
+		if grep -q '^JWT_SECRET=' "$$envfile" 2>/dev/null; then \
+			sed -i.bak "s|^JWT_SECRET=.*|JWT_SECRET=$$secret|" "$$envfile"; \
+			rm -f "$$envfile.bak"; \
+			echo "Set JWT_SECRET in $$envfile"; \
+		fi; \
+	done; \
+	echo "JWT_SECRET applied to all files"


The JWT_SECRET is generated once and applied to all .env files. While this ensures consistency across services (which is correct for JWT validation), the implementation has a security consideration:

The secret is stored in a shell variable and echoed to output. In shared/monitored environments, this could expose the secret in process listings or logs. Additionally, all .env files receive the same secret, which is stored in plaintext.

Consider:

Suppressing the echo of "JWT_SECRET applied to all files" or being more specific about which files (without revealing the secret)

Adding a comment in the code explaining why all services need the same JWT_SECRET (for token validation across the auth/gateway boundary)

Ensuring the .env files are properly added to .gitignore

This is acceptable for a school project, but in production you'd use a secrets management system (HashiCorp Vault, AWS Secrets Manager, etc.).

Copilot · 2026-01-08T12:02:56Z

srcs/nginx/nginx.conf

+    upstream api_gateway {
+        server api-gateway:3000;
+        keepalive 32;            # Pool de 32 connexions keepalive vers ce backend
+    }
+
+    upstream user_service {
+        server user-service:3002;
+        keepalive 16;            # Pool de 16 connexions keepalive vers ce backend
+    }


The upstream configuration declares keepalive pools (32 for api_gateway, 16 for user_service), but there's a missing configuration to fully enable HTTP keepalive to the backend.

According to the Nginx documentation, when using keepalive with upstream blocks, you should also consider the backend's keepalive timeout. More importantly, to fully utilize the keepalive connections, the backend services should support HTTP/1.1 keepalive connections.

While proxy_http_version 1.1 is set in the location blocks (which is correct), you might want to document in the README that the backend services (api-gateway:3000 and user-service:3002) must have keepalive enabled and should have keepalive timeouts >= nginx's timeout to avoid connection churn.

Additionally, the keepalive pool size of 32/16 should be tuned based on expected concurrent connections. For a school project with potentially few concurrent users, these values are reasonable, but in production you'd monitor and adjust based on actual traffic patterns.

Copilot · 2026-01-08T12:02:56Z

srcs/nginx/conf.d/default.conf

+    location /public/ {
+        # Réécriture: /public/file.jpg → /file.jpg côté backend
+        rewrite ^/public/(.*)$ /$1 break;
+        proxy_pass http://api_gateway;
+        proxy_http_version 1.1;
+
+        # Headers standard
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto https;
+        proxy_set_header Origin https://$host;
+
+        # Sécurité
+        proxy_set_header x-user-name "";
+        proxy_set_header x-user-id "";
+        proxy_set_header Upgrade "";
+
+        # Transmission des cookies
+        proxy_pass_header Set-Cookie;
+        proxy_cookie_path / /;
+
+        # Timeouts standard
+        proxy_connect_timeout 10s;
+        proxy_send_timeout 60s;
+        proxy_read_timeout 60s;
+    }


Missing Connection "" header directive for locations using keepalive upstreams.

Similar to the /api/ location, this location proxies to api_gateway which has a keepalive pool configured. To properly utilize the keepalive connections, you should add:

proxy_set_header Connection "";

This is especially important here since you're explicitly clearing the Upgrade header. Without clearing Connection as well, the default "close" behavior might be used, defeating the purpose of the upstream keepalive pool.

Copilot · 2026-01-08T12:02:57Z

srcs/nginx/nginx.conf

+    tcp_nopush on;               # Envoyer les headers en un seul paquet
+    tcp_nodelay on;              # Désactiver l'algorithme de Nagle (faible latence)


There's a potential configuration conflict between tcp_nopush on (line 32) and tcp_nodelay on (line 33).

The reasoning:

tcp_nopush enables TCP_CORK (send headers in one packet) - optimizes for throughput

tcp_nodelay disables Nagle's algorithm - optimizes for latency

These two settings have opposite goals and can conflict. According to nginx documentation, nginx handles this by:

Using tcp_nopush during file sending (with sendfile)

Using tcp_nodelay for keepalive connections

While nginx manages this intelligently, it's worth understanding the trade-off:

For static file serving: tcp_nopush is beneficial

For real-time proxying (especially WebSocket): tcp_nodelay is critical

Your configuration is acceptable and follows common nginx practices. The combination works because nginx applies them at different stages of the connection lifecycle. However, for a pedagogical project, consider documenting this nuance in comments to show understanding of the trade-offs.

Reference: nginx HTTP core module documentation explains how these interact.

Copilot · 2026-01-08T12:02:57Z

Makefile

+			sed -i.bak "s|^JWT_SECRET=.*|JWT_SECRET=$$secret|" "$$envfile"; \
+			rm -f "$$envfile.bak"; \
+			echo "Set JWT_SECRET in $$envfile"; \
+		fi; \
+	done; \


The sed -i.bak command creates backup files with .bak extension, and while these are removed with rm -f "$$envfile.bak", this approach has a portability issue.

On macOS (BSD sed), -i requires an extension argument (-i.bak), but on GNU/Linux sed, the backup extension is optional and the syntax differs. The current implementation works on macOS but may behave differently on Linux systems.

Consider using a more portable approach:

Use sed -i'' -e for GNU sed compatibility, or

Test for the OS type and use appropriate syntax, or

Use a temporary file with mv for cross-platform compatibility

Example portable approach:

sed "s|^JWT_SECRET=.*|JWT_SECRET=$$secret|" "$$envfile" > "$$envfile.tmp" && mv "$$envfile.tmp" "$$envfile"

Suggested change

sed -i.bak "s|^JWT_SECRET=.*|JWT_SECRET=$$secret|" "$$envfile"; \

rm -f "$$envfile.bak"; \

echo "Set JWT_SECRET in $$envfile"; \

fi; \

done; \

sed "s|^JWT_SECRET=.*|JWT_SECRET=$$secret|" "$$envfile" > "$$envfile.tmp" && mv "$$envfile.tmp" "$$envfile"; \

echo "Set JWT_SECRET in $$envfile"; \

fi; \

done; \

done; \

Copilot · 2026-01-08T12:02:57Z

srcs/nginx/README.md

+- Le resolver dynamique peut réévaluer les IPs toutes les 5s
+- Si un container redémarre, Nginx garde l'ancienne IP
+- Résultat : 502 aléatoires jusqu'au reload de Nginx


In the documentation you mention that the commented resolver was "source de 502 intermittents" (source of intermittent 502 errors). However, the explanation in the README (lines 236-239) about why the resolver causes issues could be more technically accurate.

The actual issue with dynamic DNS resolution in Docker is:

When using resolver with proxy_pass to a variable (like $backend), nginx re-resolves on every request or based on the valid parameter

With proxy_pass http://service-name:port (without resolver), nginx resolves once at config load/reload

With proxy_pass http://upstream_name (using upstream block), nginx resolves at startup and uses the upstream's internal connection pooling

The current explanation suggests nginx "garde l'ancienne IP" (keeps the old IP) which is slightly misleading - with upstream blocks and no resolver, nginx doesn't re-resolve at all, which is actually the desired behavior for stable service discovery in Docker's internal DNS.

The decision to remove the resolver is correct, but the technical explanation could be refined for educational purposes.

Suggested change

- Le resolver dynamique peut réévaluer les IPs toutes les 5s

- Si un container redémarre, Nginx garde l'ancienne IP

- Résultat : 502 aléatoires jusqu'au reload de Nginx

- Avec `resolver` + `proxy_pass` vers une **variable**, Nginx re-résout le DNS à chaque requête (ou selon `valid=5s`)

- Dans un réseau Docker, pendant un redémarrage de container la résolution peut renvoyer **aucune IP** ou une IP temporairement non joignable

- Résultat : 502 intermittents ; en supprimant le resolver et en utilisant des `upstream` nommés, Nginx résout une fois au (re)chargement et reste stable

Copilot · 2026-01-08T12:02:57Z

srcs/nginx/conf.d/default.conf

+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto https;


The hardcoded X-Forwarded-Proto https value on line 54 (and elsewhere) assumes all traffic is HTTPS. While this is true for your current setup where all external traffic comes through the HTTPS server block, using the variable $scheme would be more flexible and semantically correct.

However, there's a subtle consideration here: since you have an HTTP server block (lines 7-13 in default.conf) that redirects to HTTPS, and the proxy locations are only in the HTTPS server block, the hardcoded https value is technically correct and even preferable. It ensures the backend always knows the original protocol was HTTPS, even if there are future internal HTTP connections.

This is actually a good practice for your architecture - just ensure it's documented that these proxy locations should only exist in the HTTPS server block, not in any HTTP block.

Copilot · 2026-01-08T12:02:58Z

srcs/nginx/conf.d/default.conf

+    location ^~ /users/doc/ {
+        proxy_pass http://user_service/doc/;
+        proxy_http_version 1.1;
+
+        # Headers standard
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto https;
+
+        # Sécurité: supprimer les headers forgés par le client
+        proxy_set_header x-user-name "";
+        proxy_set_header x-user-id "";
+        proxy_set_header Upgrade "";
+
+        proxy_pass_header Set-Cookie;
+        proxy_cookie_path / /;
+
+        # Timeouts explicites
+        proxy_connect_timeout 10s;
+        proxy_send_timeout 60s;
+        proxy_read_timeout 60s;
+    }


Similarly, this location should also include proxy_set_header Connection ""; for proper keepalive usage with the user_service upstream.

While you're setting Upgrade "" on line 59, explicitly setting Connection to empty ensures the keepalive pool (configured as 16 connections in nginx.conf) is properly utilized.

Copilot · 2026-01-08T12:02:58Z

srcs/nginx/conf.d/default.conf

+    location /api/ {
+        proxy_pass http://api_gateway;
+        proxy_http_version 1.1;
+
+        # Headers standard
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto https;
+        proxy_set_header Origin https://$host;
+
+        # Sécurité: supprimer les headers forgés par le client
+        proxy_set_header x-user-name "";
+        proxy_set_header x-user-id "";
+        proxy_set_header Upgrade "";
+
+        # Transmission des cookies
+        proxy_pass_header Set-Cookie;
+        proxy_cookie_path / /;
+
+        proxy_connect_timeout 10s;
+        proxy_send_timeout 60s;
+        proxy_read_timeout 60s;
+
+        proxy_pass_request_body on;
+        proxy_pass_request_headers on;
+    }


When using upstream blocks with keepalive connections, you need to clear the Connection header to prevent it from being passed to the backend. Without this, the keepalive pool won't be used effectively.

According to nginx documentation on the proxy module, when using keepalive with upstreams, you should add:
proxy_set_header Connection "";

This is already done correctly for line 85 where Upgrade is cleared, but the Connection header should also be explicitly managed to ensure the keepalive pool functions properly. While you're setting Upgrade "" (line 85), best practice is to also explicitly set an empty Connection header when not using WebSocket upgrade, especially when keepalive is configured in the upstream.

srcs/docker-compose.yml

codastream

Je n'ai pas pu cloner et vérifier la stabilité sur les points mentionnés par Copilot : notamment absence du header Connection: keep-alive (ça serait peut-être bien de l'ajouter).
Pour le build : vérifié sur codespace, et ça doit être un environnement Ubuntu : a priori la syntaxe sed semble fonctionner.

La PR apporte pas mal de choses et reste bien ciblée.
On risque de retoucher à ce fichier de conf, notamment pour rediriger la doc sous une même règle.

codastream · 2026-01-08T14:35:15Z

srcs/nginx/conf.d/default.conf

+    }
+
+    # Swagger UI - Users service documentation
+    location ^~ /users/doc/ {


on pourra dans un prochain ticket de doc faire une declaration

location ^~ /public/doc

qui rerouterait par exemple /doc/users vers /users/doc

codastream · 2026-01-08T14:43:43Z

srcs/nginx/README.md

cool, notamment debugging.
On peut ajouter les suggestions copilot

que les services upstream aient un timeout > celui de nginx (keepalive_timeout 65)

Est-ce que quelqu'un voudrait faire un résumé en anglais des principaux concepts de configuration dans Nginx du wiki ? notamment la partie optimisation (distinction tcp_nopush, tcp_nodelay, fonctionnement du resolver)

rom98759 requested review from Ilia1177, codastream, jmtth and lisambet as code owners January 8, 2026 11:47

rom98759 requested a review from Copilot January 8, 2026 11:54

Copilot started reviewing on behalf of rom98759 January 8, 2026 11:54 View session

rom98759 assigned Copilot Jan 8, 2026

Copilot AI reviewed Jan 8, 2026

View reviewed changes

codastream approved these changes Jan 8, 2026

View reviewed changes

jmtth approved these changes Jan 8, 2026

View reviewed changes

fix(nginx): ajouter des en-têtes pour le pool de connexions keepalive

e028445

jmtth merged commit 91307f7 into codastream:main Jan 12, 2026
2 of 3 checks passed

		tcp_nopush on; # Envoyer les headers en un seul paquet
		tcp_nodelay on; # Désactiver l'algorithme de Nagle (faible latence)

Fix Docker health checks & improve nginx stability #43

Fix Docker health checks & improve nginx stability #43

Uh oh!

Conversation

rom98759 commented Jan 8, 2026

🎯 Objectif

Problèmes résolus

Health Checks

Nginx

Solutions implémentées

1. Docker Compose Health Checks

2. Nginx Configuration

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codastream left a comment

Choose a reason for hiding this comment

Uh oh!

codastream Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

codastream Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants