Visually impaired users struggle with modern GUIs where screen readers fail to interpret icons, menus, and visual cues. This creates dependency on external help for tasks like browsing, emailing, or navigating files.
Build a lightweight Python-based assistant that:
- Activates via hotkey
- Captures a screenshot + voice command
- Uses AI/LLMs + Vision Models to interpret intent
- Gives spoken instructions for navigation
- Hotkey Activation – listen globally for trigger (e.g.,
Ctrl+Alt+S) - Screen Capture – grab screenshot of current screen
- Voice Input – record audio, send to Google Gemini API (STT)
- AI Processing
- Gemini Vision → analyze screenshot
- LLM (via Gemini) → interpret intent
- Generate action steps
- Voice Output – convert instructions to speech with TTS
- Cursor location feedback (
x, ycoordinates) - Basic error handling (
"Sorry, I didn’t understand") - Support for one extra app (Word or email client)
- Ask the assistant a question or give a command — natural language input; assistant replies (text + optional TTS).
- Change assistant persona / name / voice — tell it to use a different persona or voice and it will adopt that for speech and responses.
- Play sound effects — request short sfx when the assistant replies.
- Text-to-speech (TTS) playback — have any assistant reply spoken aloud.
- Speak in a character/persona — prompt it to speak using a specified persona.
- Short confirmations — brief spoken acknowledgements for background tasks.
- Open desktop applications by name — ask it to open an app (e.g., browser, notepad, Word); it will either launch the executable or simulate OS search+enter.
- Switch/focus windows — ask to focus a particular application/window.
- Lock workstation / shutdown commands — (script exposes OS control actions) — can be invoked if requested.
- Perform a web search — give a query; it will run a search, gather top results, speak a summary, and open the first result if asked.
- Research a topic — request a deeper research summary; it crawls top pages, aggregates text, then synthesizes a summary.
- Open specific URLs or web pages — ask to open a URL or open a search result.
- YouTube search & play — ask to search YouTube and open/play a selected result.
- Generate written content — ask for essays, emails, blog posts, code examples, explanations, etc.; it uses the LLM to produce content.
- Write into an application — request it to create a document and type/paste it into Notepad/Word or other targeted app.
- Format typing behavior — ask it to type (keystrokes/clipboard paste) with small formatting adjustments (line breaks, headings).
- Compose email content — provide recipient/topic and it will draft subject/body.
- Send email via GUI automation — it can open Gmail in browser and simulate typing/sending (or prefill compose page and copy body to clipboard as fallback).
- Create reminders — natural language reminders with datetime parsing (e.g., “remind me tomorrow at 9am to …”).
- List, update, mark tasks — add tasks, change status, list pending/completed tasks.
- Background reminder delivery — scheduled reminders are checked and spoken when due.
- Ask the assistant to “learn” a new skill — supply a command and it will generate a structured automation (a skill) and save it for reuse.
- Execute learned skills — run saved multi-step skills (open apps, type, click, browse, etc.).
- Apply a previously learned action to new inputs — reuse learned recipes on similar future commands.
- Request code generation / run generated code — ask for code examples or to run generated Python snippets (the script can execute code strings as part of a skill).
- Generate complex project scaffolding — request project/code generation and background refinement (scaffolding + files).
Note: these are powerful — they allow the assistant to run code or OS commands (treated here as “what the user can do”).
- Locate and click on-screen images / UI elements — tell it to click a UI element using a reference image.
- Take screenshots — capture screen regions and save them to disk.
- Guide user to capture assets — open example pages and help the user create
assets/images by timed screenshots.
- Copy/paste content to clipboard — generate text and place it on clipboard for manual use.
- Type long text automatically — paste or type long generated text into focused applications.
- Save and load assistant memory — persistent memory store for tasks, reminders, learned skills, and preferences.
- View recent interactions / logs — read back recent command history or interaction log.
- Export or persist skills and activity — skills and user activity logs are stored to disk for reuse/audit.
- Scrape web page text — visit pages and extract body text to summarize or include in responses.
- Collect multiple search result pages — follow and aggregate text from several top results.
- Find similar past commands — ask it to recall similar past actions and reapply them.
- Retry last failed action — instruct it to retry the most recent failed command.
- Run background jobs — long-running background tasks like research, project generation, or reminder monitoring.
- Play/pause audio and control playback — control TTS/audio playback behavior.