Process pool for multiple loaded LLMs, and a queuing system from the FastAPI/uvicorn workers

In supporting concurrent requests, we won't at first assume concurrent inference capability at the model weight stage. This would require us to mutex the LLMs. We'll want control over the LLM supervision, in which case we might as well just support multiple LLM types hosted at once.

Will probably require some sort of config, for example (TOML). For example, to mount 2 instances of Meta Llama & one of Mistral Nemo:

```toml
[llm]

# Nickname to HF or local path
llama3-8B-8bit-1 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
llama3-8B-8bit-2 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
mistral-nemo-8bit = "mlx-community/Mistral-Nemo-Instruct-2407-8bit"
```

Each instance would be resident in memory, so that is a natural limit. The client would request by model type, and the model supervisor would pick the first available one.

Of interest, to understand the expected deployment scenarios:

* [Uvicorn server deployment notes](https://www.uvicorn.org/deployment/)
* [FastAPI "Server Workers - Gunicorn with Uvicorn"](https://fastapi.tiangolo.com/deployment/server-workers)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process pool for multiple loaded LLMs, and a queuing system from the FastAPI/uvicorn workers #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Process pool for multiple loaded LLMs, and a queuing system from the FastAPI/uvicorn workers #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions