This project is an evolution of SmolChat, running Retrieval-Augmented Generation (RAG) techniques locally to enhance the performance of LLMs in specific subject scenarios. It is ideal for situations where a specialized model is needed but unavailable, and fine-tuning isn't feasible, providing generic models with relevant in-context information.
- Support of RAG for LLM response added
- Support of reranking for LLM response added
- Customization of system message disabled - On-Device Inference of SLMs in Android
- Provide a usable user interface to interact with local SLMs (small language models) locally, on-device
- Allow users to add/remove SLMs (GGUF models) and modify their system prompts or inference parameters (temperature, min-p)
- Allow users to create specific-downstream tasks quickly and use SLMs to generate responses
- Simple, easy to understand, extensible codebase
- Clone the repository with its submodule originating from llama.cpp,
git clone https://github.com/TIC-13/SmolRag.git
cd SmolRag
git submodule update --init --recursive
-
Android Studio starts building the project automatically. If not, select Build > Rebuild Project to start a project build.
-
After a successful project build, connect an Android device to your system. Once connected, the name of the device must be visible in top menu-bar in Android Studio.
-
The application uses llama.cpp to load and execute GGUF models. As llama.cpp is written in pure C/C++, it is easy to compile on Android-based targets using the NDK.
-
The
smollmmodule uses allm_inference.cppclass which interacts with llama.cpp's C-style API to execute the GGUF model and a JNI bindingsmollm.cpp. Check the C++ source files here. On the Kotlin side, theSmolLMclass provides the required methods to interact with the JNI (C++ side) bindings. -
The
appmodule contains the application logic and UI code. Whenever a new chat is opened, the app instantiates theSmolLMclass and provides it the model file-path which is stored by theLLMModelentity in the ObjectBox. Next, the app adds messages with roleuserandsystemto the chat by retrieving them from the database and usingLLMInference::addChatMessage. -
For tasks, the messages are not persisted, and we inform to
LLMInferenceby passing_storeChats=falsetoLLMInference::loadModel.
|
|
|
|
|
|
|
|
-
ggerganov/llama.cpp is a pure C/C++ framework to execute machine learning models on multiple execution backends. It provides a primitive C-style API to interact with LLMs converted to the GGUF format native to ggml/llama.cpp. The app uses JNI bindings to interact with a small class
smollm. cppwhich uses llama.cpp to load and execute GGUF models. -
ObjectBox is a on-device, high-performance NoSQL database with bindings available in multiple languages. The app uses ObjectBox to store the model, chat and message metadata.
-
noties/Markwon is a markdown rendering library for Android. The app uses Markwon and Prism4j (for code syntax highlighting) to render Markdown responses from the SLMs.







