Skip to content

Process pool for multiple loaded LLMs, and a queuing system from the FastAPI/uvicorn workers #17

@uogbuji

Description

@uogbuji

In supporting concurrent requests, we won't at first assume concurrent inference capability at the model weight stage. This would require us to mutex the LLMs. We'll want control over the LLM supervision, in which case we might as well just support multiple LLM types hosted at once.

Will probably require some sort of config, for example (TOML). For example, to mount 2 instances of Meta Llama & one of Mistral Nemo:

[llm]

# Nickname to HF or local path
llama3-8B-8bit-1 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
llama3-8B-8bit-2 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
mistral-nemo-8bit = "mlx-community/Mistral-Nemo-Instruct-2407-8bit"

Each instance would be resident in memory, so that is a natural limit. The client would request by model type, and the model supervisor would pick the first available one.

Of interest, to understand the expected deployment scenarios:

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions