Exploring what a real-time ml inference and data drift monitoring solution could look like within MLOps.
- Clients send requests to the Model Server for predictions.
- Requests hit a load balancer (NGINX) which routes them to one of the Model Server replicas.
- The Model Server processes the request and returns a prediction.
- The Model Server also sends the feature data to the Metric Server for drift monitoring. This is done via a background task that runs asynchronously. This requests goes thru a load balancer and is routed to one of the Metric Server replicas.
- The Metric Server loads a reference dataset at startup, which is used to compare incoming feature data against.
- Incoming feature data is buffered in a rolling window.
- Once the buffer is full:
- A KS test is run per feature.
- P-values and drift flags are recorded.
- Metrics are exposed to Prometheus.
- Grafana visualizes:
- Number of features drifting
- Feature-level p-values & drift flags
- Last drift timestamp
- Historical drift trends
Model Servers are stateless and can be scaled horizontally. Metric Servers are stateful. In this design, each metric server instance maintains its own buffer (state). That way it calculates drifts based on the data it receives. Prometheus scrapes outputs of all metric servers and aggregates the data for a global view of drift across all features.
Use uv to manage the Python environment:
uv venv
source .venv/bin/activate
uv syncUse Docker Compose to spin up the Model Server, Metric Server, Prometheus and Grafana:
docker compose up --buildTo scale the Model Server to handle more requests, you can use:
docker-compose up --build --scale model-server=10 --scale metric-server=10This will start 10 instances of the model server and 10 instances of the metric server, allowing them to handle more concurrent requests.
Requests to the prediction API is sent to the API Gateway (NGINX), which load balances across the model server replicas.
See nginx.conf.
To simulate a live data stream:
- Without Drift (Normal Scenario):
uv run run.py --drift false- With Drift (Simulated Drift Scenario):
uv run run.py --drift trueAccess the grafana dashboard from : http://localhost:4000/
To run Load Testing with Locust follow these steps:
Run from root:
locustThen open your browser and navigate to http://localhost:8089 to access the Locust web interface. From there, you can start your load tests by specifying the target URL and the number of users to simulate.
The target URL is: http://localhost:8002/get-prediction
currently working on scaling up the metrics up endpoit. model endpoint is simpler because it's stateless but metric server is stateful, so I'm currently considering different ways it can be scaled.


