a virtual talkbox using computer vision to track your mouth movements and shape sound with formant filters. basically: you play notes (midi or qwerty keyboard) and your mouth controls how they sound!
with a talkbox, you play a synth through a tube into your mouth, and shape the sound with vowel movements. this does the same thing but with a webcam instead of a tube. it uses mediapipe to track your mouth, calculates approximate formant frequencies (F1, F2, F3) based on your mouth shape, and filters a sawtooth wave in real-time. the result sounds like you're "singing" (kinda) the notes you play!
# install system dependencies (required for pyo audio library)
brew install portaudio portmidi liblo libsndfile
# create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# install python dependencies
pip install -r requirements.txt
# install pyo from github (the PyPI version has build issues)
C_INCLUDE_PATH="/opt/homebrew/include" LIBRARY_PATH="/opt/homebrew/lib" pip install git+https://github.com/belangeo/pyo.git
# run it
python main.py# install system dependencies (debian/ubuntu)
sudo apt-get install portaudio19-dev libportmidi-dev liblo-dev libsndfile1-dev
# create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# install python dependencies
pip install -r requirements.txt
# install pyo
pip install pyo
# run it
python main.py# create and activate virtual environment
python -m venv venv
venv\Scripts\activate
# install python dependencies
pip install -r requirements.txt
# install pyo (pre-built wheels available on Windows)
pip install pyo
# run it
python main.pywhen you start it up, you'll get a menu to choose your input:
- midi keyboard - if you have one plugged in
- computer keyboard - if you don't
the app will remember your choice for next time.
if you're using qwerty keyboard mode, it's set up like a piano:
black keys: w e t y u o p
white keys: a s d f g h j k l ; '
extra controls:
z/x- change octavec- toggle vibrato- arrow up/down - pitch bend
- press a key to play a note
- move your mouth while the note is playing
- experiment with different vowel shapes:
- "ah" = open mouth
- "ee" = wide smile
- "oo" = rounded lips
sound only plays when a note is pressed AND your face is detected
- face tracking: mediapipe facemesh for mouth landmark detection
- formant mapping:
- F1 (270-730hz) controlled by jaw opening
- F2 (870-2290hz) controlled by lip width
- F3 (1650-3000hz) combination of both
- audio: pyo synthesizer with supersaw oscillator + formant bandpass filters
- threading: separate threads for video, audio, and input
creates a config.json file where you can tweak:
- audio buffer size (lower = less latency, higher = more stability)
- camera device id
- formant frequency ranges
- debug display settings
uses mediapipe, pyo, opencv, mido, and pynput