-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Feature: Implement Server Mesh Architecture for Massive-Scale Multi-Agent Systems
Author: System Architecture
Date: October 8, 2025
Status: Proposal
1. The Problem: Single Server Bottleneck
Our current single-server architecture cannot scale to our vision of 10,000 simultaneous players in a single game instance. The primary bottleneck is egress network bandwidth.
A single server must broadcast the consolidated game state to every client, every frame. As the player count increases, the required egress bandwidth becomes physically impossible for a single network interface to handle.
Bandwidth Analysis (1,000 Players @ 60Hz):
-
Ingress (Client → Server):
- 1,000 clients × 60 actions/sec × ~100 bytes/action = 6 MB/sec (48 Mbps)
- This is manageable.
-
Egress (Server → Clients):
- 60 reports/sec × 1,000 recipients × ~50 KB/report = 3 GB/sec (24 Gbps)
- This is unsustainable. The largest AWS network-optimized instances top out around 50-100 Gbps, and scaling to 10,000 players would require 240 Gbps, which no single instance can provide.
The hard limit is physics: a single server cannot broadcast to that many clients at our required frequency.
2. The Solution: Server Mesh Architecture
We propose distributing the broadcast responsibility across a peer-to-peer mesh of server instances.
Core Concept:
Instead of one server broadcasting to 10,000 clients, we use 100 servers, each responsible for only 100 "local" clients.
Message Flow:
- Client → Local Server: A client sends its actions only to its assigned local server.
- Server → Peer Mesh: Each server broadcasts a small batch of its local clients' actions to all other peer servers.
- Server → Local Clients: Each server receives batches from all its peers, consolidates them into a full world state update, and broadcasts it only to its local clients.
Impact on Bandwidth (1,000 players, 10 servers):
- Each server now handles egress for only 100 clients + 9 peers.
- Total Egress per Server: ~300 MB/sec (2.4 Gbps).
- This is a 10x reduction in bandwidth per server, making it easily manageable with commodity instances (e.g.,
t3.large).
This architecture provides nearly constant bandwidth requirements per server, allowing for linear scalability.
3. Benefits Beyond Scale
- Geographic Distribution: Deploy servers in multiple regions for lower global player latency within a single, unified game instance.
- Elastic Scaling: Automatically add or remove servers based on player load, optimizing for cost and performance.
- Simplified Development: The mesh can be run locally for testing without AWS costs.
- Protocol Transparency: No client-side changes are required. The client connects to a load balancer and is unaware of the mesh backend.
4. Implementation Plan & Tasks
We propose a phased implementation to de-risk the project and deliver value incrementally.
Phase 1: Proof of Concept (1 week)
- Stand up a 3-server mesh on
localhost. - Implement manual peer configuration (e.g., static IP list).
- Implement basic client-to-server, server-to-peer, and peer-to-client message routing.
- Verify bandwidth reduction with 30 test clients (10 per server).
- Success Metric: All 30 clients maintain a synchronized game state with < 10 MB/s bandwidth per server process.
Phase 2: Production MVP (2 weeks)
- Create Terraform configuration to deploy a 10-server mesh on AWS.
- Use environment variables for static peer discovery.
- Implement basic health checks and monitoring.
- Configure a client-facing load balancer to distribute connections.
- Stress test with 1,000 concurrent bots.
- Success Metric: A 1,000-player match remains stable with < 5 Gbps bandwidth per server instance.
#### Phase 3: Elastic Scaling & Fault Tolerance (2 weeks)
-
Implement dynamic server spawning and termination based on player load. -
Implement client rebalancing logic to handle scale-up/down events. - Add graceful server draining procedures.
-
Integrate Prometheus metrics and build Grafana dashboards for mesh visibility. -
Implement basic server failure detection (heartbeats) and client redistribution. - Success Metric:
The cluster can automatically scale from 100 to 2,000 players and back down without manual intervention or significant disruption.
5. Risks and Mitigation
-
Risk 1: Increased Latency
- Concern: The extra server-to-server hop could add latency.
- Mitigation: Intra-datacenter latency is ~1-2ms. The total added latency (~2-4ms) is negligible for a 60 FPS (16.67ms/frame) game. Deploy all peers in the same AWS Availability Zone.
-
Risk 2: Network Partitions
- Concern: The mesh could "split" if server-to-server communication fails.
- Mitigation: This is extremely rare within a single AWS AZ. We will accept this risk for the MVP and plan for cross-AZ partition detection and split-brain resolution in a later phase.
-
Risk 3: Message Ordering
- Concern: Actions from different servers could arrive at peers in different orders.
- Mitigation: This is already solved by our deterministic simulation and use of frame numbers in the game state. All servers will process the same consolidated set of actions for a given frame.
-
Risk 4: Complexity
- Concern: This architecture is more complex than a single server.
- Mitigation: This complexity is unavoidable for our target scale. We will contain it within the core framework, abstracting it from game developers, and provide robust documentation, reference implementations, and monitoring tools.
6. Alternatives Considered & Rejected
- Vertical Scaling: Rejected due to physical bandwidth limits and cost.
- Client-Side Prediction: Rejected because it breaks our "provably fair" requirement for deterministic simulations.
- Server Sharding (Separate Games): Rejected as it defeats the core vision of a single, massive-scale game instance.
The proposed mesh architecture is the only viable path to achieving the project's vision.
Call to Action
The engineering team should review this proposal. The immediate goal is to approve and execute Phase 1: Proof of Concept.