An rmbedded virtual assistant implemented on esp32 as an assignment for Embedded Systems Design and Analysis course at the University of Guilan, Department of Computer Engineering, presented in Fall 2024.
Elva is an embedded virtual assistant powered by a language model (basically a talking robot!). This project aims to develop a small embedded system capable of listening, processing, and responding to human speech, functioning as a smart virtual assistant. Various techniques and technologies were utilized to build this system, including C and Python programming, large language models (LLM), speech-to-text (STT) and text-to-speech (TTS) models, as well as IoT and microcontroller integration.
This system consists of two main parts. The platform and the node. While the platform is responsible for processing user input and generating outputs through several stages, the node records user's voice, sends it to platform, receives generated output and plays it on speaker.
- The node records the user's voice.
- The node saves the recorded voice as a
.wavfile on the SD card. - The node sends the
input.wavfile to the platform over the local network. - The platform receives and saves the
input.wavfile. - The platform converts the
input.wavfile to text using the Distil-Whisper STT model. - The platform processes the transcribed text using the LLaMA 3.2:1B LLM and generates a response.
- The platform converts the response text to an
.mp3file using the Edge TTS model. - The platform sends the
output.mp3file back to the node over the local network. - The node receives and saves the
output.mp3file on the SD card. - The node plays the
output.mp3file through the speaker using the I2S stereo decoder.
The platform consists of three main components: the large language model (LLM), the speech-to-text (STT) model, and the text-to-speech (TTS) model.
The LLM responsible for generating responses is the LLaMA 3.2:1B model, developed by Meta.
The STT model responsible for converting the user's speech to text is the Distil-Whisper model.
The TTS model responsible for converting generated responses to audio files is the Edge TTS model.
The node consists of four main components: the microcontroller (MCU), the I2S microphone, the I2S stereo decoder, and the SD card.
The microcontroller responsible for handling input, output, and data transmissions is the ESP32 WROOM-32U.
The microphone module used to capture the user's voice is the I2S MEMS INMP441.
The stereo decoder module used to decode the output.mp3 file for the speaker is the Adafruit I2S Stereo Decoder - UDA1334A.
The SD card module used for storing input and output files is a 6-pin micro SD card reader.
In this video I ask elva to introduce herself and tell me about her capabilities. You'll notice some delay in her response, this is because I don't have a powerful GPU.
Video.Elva.mp4
Right now she's just a bunch of wires. One day, I might build a proper case and body for her :D
- 📷 Integrate a Camera: Capture images and send them to the LLM for visual perception, enabling Elva to "see."
- 😄 Add an LCD Display: Use an LCD to show Elva’s emotions, expressions, or interactive feedback visually.
- 🖨️ 3D-Print a Custom Case: Design physical case to enhance appearance, portability, and protection.
- 🧠 Fine-Tune LLM for Specialized Tasks: Customize Elva’s language model for tasks like database querying, RAG, or domain-specific interactions.
- 🚀 Upgrade to a More Powerful LLM: Run more advanced models like Gemma once I have access to better GPU.





