-
Notifications
You must be signed in to change notification settings - Fork 4
Fix Docker health checks & improve nginx stability #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Fix health check resolution: use 127.0.0.1 instead of localhost in Alpine - Add health checks for api-gateway and redis-broker services - Reorganize docker-compose.yml in architectural layers (infra, backend, gateway, proxy) - Improve health check template: retries 5, start_period 30s - Add depends_on with service_healthy for critical dependencies - Refactor nginx configuration with comprehensive documentation - Remove HTTP/2 (WebSocket incompatibility) and DNS resolver (502 source) - Add upstream declarations with keepalive for backend stability - Add explicit timeouts (10s/60s/3600s) on all proxy locations - Ensure cookie transmission for authentication (proxy_pass_header Set-Cookie) - Document all architectural decisions and technical choices - Add envs Makefile target to create .env files from examples - Add .gitignore entries for TLS certificates - Create https-internal-guide.md for future HTTPS end-to-end implementation - Add scripts/generate-internal-certs.sh for internal CA generation Resolves intermittent 502 errors, cookie transmission issues, and unhealthy container states. Architecture now follows clear layering: redis → users → auth → gateway → nginx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses critical stability issues in the Docker and nginx infrastructure by fixing health checks, improving nginx configuration, and enhancing cookie/WebSocket handling. The changes focus on resolving intermittent 502 Bad Gateway errors and unhealthy container states.
Key changes:
- Health checks: Switched from
localhostto127.0.0.1for Alpine Linux compatibility, increased retries (3→5) and start_period (10s→30s) - Nginx stability: Introduced upstream blocks with keepalive pools, removed problematic DNS resolver, disabled HTTP/2 for WebSocket compatibility, added explicit timeouts
- Architecture: Reorganized docker-compose into clear layers (Infrastructure → Backend → Gateway → Proxy) with proper dependency chains
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| srcs/nginx/nginx.conf | Added upstream blocks (api_gateway, user_service) with keepalive pools; enhanced comments explaining performance settings |
| srcs/nginx/conf.d/default.conf | Complete rewrite with explicit timeouts, cookie transmission directives, removed DNS resolver, disabled HTTP/2, hardcoded https in X-Forwarded-Proto |
| srcs/nginx/README.md | Comprehensive new documentation (359 lines) covering architecture, configuration decisions, debugging, security, and monitoring |
| srcs/docker-compose.yml | Restructured services into layers, standardized health checks using 127.0.0.1, updated retries/timeouts, added redis-broker and api-gateway health checks |
| Makefile | New envs target to generate .env files from examples and synchronize JWT_SECRET across all services using openssl |
| envs: | ||
| @echo "Creating .env files from examples…" | ||
| @secret=$$(openssl rand -hex 32); \ | ||
| for f in srcs/.env*.example; do \ | ||
| envfile=$${f%.example}; \ | ||
| if [ ! -f "$$envfile" ]; then \ | ||
| cp "$$f" "$$envfile"; \ | ||
| echo "Copied $$f → $$envfile"; \ | ||
| else \ | ||
| echo "$$envfile already exists"; \ | ||
| fi; \ | ||
| if grep -q '^JWT_SECRET=' "$$envfile" 2>/dev/null; then \ | ||
| sed -i.bak "s|^JWT_SECRET=.*|JWT_SECRET=$$secret|" "$$envfile"; \ | ||
| rm -f "$$envfile.bak"; \ | ||
| echo "Set JWT_SECRET in $$envfile"; \ | ||
| fi; \ | ||
| done; \ | ||
| echo "JWT_SECRET applied to all files" |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JWT_SECRET is generated once and applied to all .env files. While this ensures consistency across services (which is correct for JWT validation), the implementation has a security consideration:
The secret is stored in a shell variable and echoed to output. In shared/monitored environments, this could expose the secret in process listings or logs. Additionally, all .env files receive the same secret, which is stored in plaintext.
Consider:
- Suppressing the echo of "JWT_SECRET applied to all files" or being more specific about which files (without revealing the secret)
- Adding a comment in the code explaining why all services need the same JWT_SECRET (for token validation across the auth/gateway boundary)
- Ensuring the .env files are properly added to .gitignore
This is acceptable for a school project, but in production you'd use a secrets management system (HashiCorp Vault, AWS Secrets Manager, etc.).
| upstream api_gateway { | ||
| server api-gateway:3000; | ||
| keepalive 32; # Pool de 32 connexions keepalive vers ce backend | ||
| } | ||
|
|
||
| upstream user_service { | ||
| server user-service:3002; | ||
| keepalive 16; # Pool de 16 connexions keepalive vers ce backend | ||
| } |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upstream configuration declares keepalive pools (32 for api_gateway, 16 for user_service), but there's a missing configuration to fully enable HTTP keepalive to the backend.
According to the Nginx documentation, when using keepalive with upstream blocks, you should also consider the backend's keepalive timeout. More importantly, to fully utilize the keepalive connections, the backend services should support HTTP/1.1 keepalive connections.
While proxy_http_version 1.1 is set in the location blocks (which is correct), you might want to document in the README that the backend services (api-gateway:3000 and user-service:3002) must have keepalive enabled and should have keepalive timeouts >= nginx's timeout to avoid connection churn.
Additionally, the keepalive pool size of 32/16 should be tuned based on expected concurrent connections. For a school project with potentially few concurrent users, these values are reasonable, but in production you'd monitor and adjust based on actual traffic patterns.
| location /public/ { | ||
| # Réécriture: /public/file.jpg → /file.jpg côté backend | ||
| rewrite ^/public/(.*)$ /$1 break; | ||
| proxy_pass http://api_gateway; | ||
| proxy_http_version 1.1; | ||
|
|
||
| # Headers standard | ||
| proxy_set_header Host $host; | ||
| proxy_set_header X-Real-IP $remote_addr; | ||
| proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; | ||
| proxy_set_header X-Forwarded-Proto https; | ||
| proxy_set_header Origin https://$host; | ||
|
|
||
| # Sécurité | ||
| proxy_set_header x-user-name ""; | ||
| proxy_set_header x-user-id ""; | ||
| proxy_set_header Upgrade ""; | ||
|
|
||
| # Transmission des cookies | ||
| proxy_pass_header Set-Cookie; | ||
| proxy_cookie_path / /; | ||
|
|
||
| # Timeouts standard | ||
| proxy_connect_timeout 10s; | ||
| proxy_send_timeout 60s; | ||
| proxy_read_timeout 60s; | ||
| } |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing Connection "" header directive for locations using keepalive upstreams.
Similar to the /api/ location, this location proxies to api_gateway which has a keepalive pool configured. To properly utilize the keepalive connections, you should add:
proxy_set_header Connection "";
This is especially important here since you're explicitly clearing the Upgrade header. Without clearing Connection as well, the default "close" behavior might be used, defeating the purpose of the upstream keepalive pool.
| tcp_nopush on; # Envoyer les headers en un seul paquet | ||
| tcp_nodelay on; # Désactiver l'algorithme de Nagle (faible latence) |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential configuration conflict between tcp_nopush on (line 32) and tcp_nodelay on (line 33).
The reasoning:
tcp_nopushenables TCP_CORK (send headers in one packet) - optimizes for throughputtcp_nodelaydisables Nagle's algorithm - optimizes for latency
These two settings have opposite goals and can conflict. According to nginx documentation, nginx handles this by:
- Using
tcp_nopushduring file sending (with sendfile) - Using
tcp_nodelayfor keepalive connections
While nginx manages this intelligently, it's worth understanding the trade-off:
- For static file serving:
tcp_nopushis beneficial - For real-time proxying (especially WebSocket):
tcp_nodelayis critical
Your configuration is acceptable and follows common nginx practices. The combination works because nginx applies them at different stages of the connection lifecycle. However, for a pedagogical project, consider documenting this nuance in comments to show understanding of the trade-offs.
Reference: nginx HTTP core module documentation explains how these interact.
| sed -i.bak "s|^JWT_SECRET=.*|JWT_SECRET=$$secret|" "$$envfile"; \ | ||
| rm -f "$$envfile.bak"; \ | ||
| echo "Set JWT_SECRET in $$envfile"; \ | ||
| fi; \ | ||
| done; \ |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sed -i.bak command creates backup files with .bak extension, and while these are removed with rm -f "$$envfile.bak", this approach has a portability issue.
On macOS (BSD sed), -i requires an extension argument (-i.bak), but on GNU/Linux sed, the backup extension is optional and the syntax differs. The current implementation works on macOS but may behave differently on Linux systems.
Consider using a more portable approach:
- Use
sed -i'' -efor GNU sed compatibility, or - Test for the OS type and use appropriate syntax, or
- Use a temporary file with
mvfor cross-platform compatibility
Example portable approach:
sed "s|^JWT_SECRET=.*|JWT_SECRET=$$secret|" "$$envfile" > "$$envfile.tmp" && mv "$$envfile.tmp" "$$envfile"| sed -i.bak "s|^JWT_SECRET=.*|JWT_SECRET=$$secret|" "$$envfile"; \ | |
| rm -f "$$envfile.bak"; \ | |
| echo "Set JWT_SECRET in $$envfile"; \ | |
| fi; \ | |
| done; \ | |
| sed "s|^JWT_SECRET=.*|JWT_SECRET=$$secret|" "$$envfile" > "$$envfile.tmp" && mv "$$envfile.tmp" "$$envfile"; \ | |
| echo "Set JWT_SECRET in $$envfile"; \ | |
| fi; \ | |
| done; \ | |
| done; \ |
| - Le resolver dynamique peut réévaluer les IPs toutes les 5s | ||
| - Si un container redémarre, Nginx garde l'ancienne IP | ||
| - Résultat : 502 aléatoires jusqu'au reload de Nginx |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the documentation you mention that the commented resolver was "source de 502 intermittents" (source of intermittent 502 errors). However, the explanation in the README (lines 236-239) about why the resolver causes issues could be more technically accurate.
The actual issue with dynamic DNS resolution in Docker is:
- When using
resolverwithproxy_passto a variable (like$backend), nginx re-resolves on every request or based on thevalidparameter - With
proxy_pass http://service-name:port(without resolver), nginx resolves once at config load/reload - With
proxy_pass http://upstream_name(using upstream block), nginx resolves at startup and uses the upstream's internal connection pooling
The current explanation suggests nginx "garde l'ancienne IP" (keeps the old IP) which is slightly misleading - with upstream blocks and no resolver, nginx doesn't re-resolve at all, which is actually the desired behavior for stable service discovery in Docker's internal DNS.
The decision to remove the resolver is correct, but the technical explanation could be refined for educational purposes.
| - Le resolver dynamique peut réévaluer les IPs toutes les 5s | |
| - Si un container redémarre, Nginx garde l'ancienne IP | |
| - Résultat : 502 aléatoires jusqu'au reload de Nginx | |
| - Avec `resolver` + `proxy_pass` vers une **variable**, Nginx re-résout le DNS à chaque requête (ou selon `valid=5s`) | |
| - Dans un réseau Docker, pendant un redémarrage de container la résolution peut renvoyer **aucune IP** ou une IP temporairement non joignable | |
| - Résultat : 502 intermittents ; en supprimant le resolver et en utilisant des `upstream` nommés, Nginx résout une fois au (re)chargement et reste stable |
| proxy_set_header Host $host; | ||
| proxy_set_header X-Real-IP $remote_addr; | ||
| proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; | ||
| proxy_set_header X-Forwarded-Proto https; |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded X-Forwarded-Proto https value on line 54 (and elsewhere) assumes all traffic is HTTPS. While this is true for your current setup where all external traffic comes through the HTTPS server block, using the variable $scheme would be more flexible and semantically correct.
However, there's a subtle consideration here: since you have an HTTP server block (lines 7-13 in default.conf) that redirects to HTTPS, and the proxy locations are only in the HTTPS server block, the hardcoded https value is technically correct and even preferable. It ensures the backend always knows the original protocol was HTTPS, even if there are future internal HTTP connections.
This is actually a good practice for your architecture - just ensure it's documented that these proxy locations should only exist in the HTTPS server block, not in any HTTP block.
| location ^~ /users/doc/ { | ||
| proxy_pass http://user_service/doc/; | ||
| proxy_http_version 1.1; | ||
|
|
||
| # Headers standard | ||
| proxy_set_header Host $host; | ||
| proxy_set_header X-Real-IP $remote_addr; | ||
| proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; | ||
| proxy_set_header X-Forwarded-Proto https; | ||
|
|
||
| # Sécurité: supprimer les headers forgés par le client | ||
| proxy_set_header x-user-name ""; | ||
| proxy_set_header x-user-id ""; | ||
| proxy_set_header Upgrade ""; | ||
|
|
||
| proxy_pass_header Set-Cookie; | ||
| proxy_cookie_path / /; | ||
|
|
||
| # Timeouts explicites | ||
| proxy_connect_timeout 10s; | ||
| proxy_send_timeout 60s; | ||
| proxy_read_timeout 60s; | ||
| } |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, this location should also include proxy_set_header Connection ""; for proper keepalive usage with the user_service upstream.
While you're setting Upgrade "" on line 59, explicitly setting Connection to empty ensures the keepalive pool (configured as 16 connections in nginx.conf) is properly utilized.
| location /api/ { | ||
| proxy_pass http://api_gateway; | ||
| proxy_http_version 1.1; | ||
|
|
||
| # Headers standard | ||
| proxy_set_header Host $host; | ||
| proxy_set_header X-Real-IP $remote_addr; | ||
| proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; | ||
| proxy_set_header X-Forwarded-Proto https; | ||
| proxy_set_header Origin https://$host; | ||
|
|
||
| # Sécurité: supprimer les headers forgés par le client | ||
| proxy_set_header x-user-name ""; | ||
| proxy_set_header x-user-id ""; | ||
| proxy_set_header Upgrade ""; | ||
|
|
||
| # Transmission des cookies | ||
| proxy_pass_header Set-Cookie; | ||
| proxy_cookie_path / /; | ||
|
|
||
| proxy_connect_timeout 10s; | ||
| proxy_send_timeout 60s; | ||
| proxy_read_timeout 60s; | ||
|
|
||
| proxy_pass_request_body on; | ||
| proxy_pass_request_headers on; | ||
| } |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using upstream blocks with keepalive connections, you need to clear the Connection header to prevent it from being passed to the backend. Without this, the keepalive pool won't be used effectively.
According to nginx documentation on the proxy module, when using keepalive with upstreams, you should add:
proxy_set_header Connection "";
This is already done correctly for line 85 where Upgrade is cleared, but the Connection header should also be explicitly managed to ensure the keepalive pool functions properly. While you're setting Upgrade "" (line 85), best practice is to also explicitly set an empty Connection header when not using WebSocket upgrade, especially when keepalive is configured in the upstream.
codastream
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Je n'ai pas pu cloner et vérifier la stabilité sur les points mentionnés par Copilot : notamment absence du header Connection: keep-alive (ça serait peut-être bien de l'ajouter).
Pour le build : vérifié sur codespace, et ça doit être un environnement Ubuntu : a priori la syntaxe sed semble fonctionner.
La PR apporte pas mal de choses et reste bien ciblée.
On risque de retoucher à ce fichier de conf, notamment pour rediriger la doc sous une même règle.
| } | ||
|
|
||
| # Swagger UI - Users service documentation | ||
| location ^~ /users/doc/ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on pourra dans un prochain ticket de doc faire une declaration
location ^~ /public/doc
qui rerouterait par exemple /doc/users vers /users/doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, notamment debugging.
On peut ajouter les suggestions copilot
- que les services upstream aient un timeout > celui de nginx (keepalive_timeout 65)
Est-ce que quelqu'un voudrait faire un résumé en anglais des principaux concepts de configuration dans Nginx du wiki ? notamment la partie optimisation (distinction tcp_nopush, tcp_nodelay, fonctionnement du resolver)
🎯 Objectif
Résoudre les problèmes intermittents de démarrage des services (502 Bad Gateway, containers unhealthy) et améliorer la stabilité globale de l'architecture reverse proxy.
Problèmes résolus
Health Checks
unhealthyindéfinimentlocalhostne résout pas correctement dans Alpine LinuxNginx
Solutions implémentées
1. Docker Compose Health Checks
Changements:
127.0.0.1au lieu delocalhostapi-gateway(HTTP)redis-broker(redis-cli ping)retries: 5,start_period: 30sDépendances optimisées:
game-serviceattendredis-broker:healthyauth-serviceattenduser-service:healthyapi-gatewayattendauth-service:healthynginx-proxyattendapi-gateway:healthy+user-service:healthy2. Nginx Configuration
Documentation complète:
srcs/nginx/README.mdCorrections de stabilité:
proxy_pass_header Set-Cookie; proxy_cookie_path / /;x-user-name,x-user-id)Structure des locations: