You know what would be great? A light weight wrapper model for voice that can use heavier ones in the background.
That much is easy but what if you could also speak to and interrupt the main voice model and keep giving it instructions? Like speaking to customer support but instead of putting you on hold you can ask them several questions and get some live updates
It's actually a nice idea - an always-on micro AI agent with voice-to-text capabilities that listens and acts on your behalf.
Actually, I'm experimenting with this kind of stuff and trying to find a nice UX to make Ottex a voice command center - to trigger AI agents like Claude, open code to work on something, execute simple commands, etc.
I speak daily in both English and Russian and have been using Gemini 3 Flash as my main transcription model for a few months. I haven't seen any model that provides better overall quality in terms of understanding, custom dictionary support, instruction following, and formatting. It's the best STT model in my experience. Gemini 3 Flash has somewhat uncomfortable latency though, and Flash Lite is much better in this regard.
Gemini 3.1 Flash-Lite is our most cost-efficient Gemini model, optimized for low latency use cases for high-volume, cost-sensitive LLM traffic.
It provides a significant quality increase over Gemini 2.0 Flash-Lite and Flash-Lite models, matching Gemini 2.5 Flash performance across key capability areas:
Improved response quality: Aims to match 2.5 Flash performance and align with target Flash-Lite use cases.
Improved instruction following: Targeted improvements to serve as a reliable migration path for complex chatbot and instruction-heavy workflows.
Improved audio input: Improved audio-input quality for tasks like Automated Speech Recognition (ASR).
Expanded thinking support: You can control how much reasoning the model performs by choosing from minimal, low, medium, or high thinking levels. This feature lets you balance response quality and speed for your specific use case.
---
Already available in Google AI Studio and OpenRouter
Try ottex with Gemini 3 flash as a transcription model. I'm bilingual as well and frequently switch between languages - Gemini handles this perfectly and even the case when I speak two languages in one transcription.
You can try ottex for this use case - it has both context capture (app screenshots), native LLMs support, meaning it can send audio AND screenshot directly to gemini 3 flash to produce the bespoke result.
I'm building in the same space, Workin on https://ottex.ai - It's a free STT app, with local models and BYOK support (OpenRouter, Groq, Mistral, and more).
The top feature is the per-app custom settings - you can peak different models and instructions for different apps and websites.
- I use the Parakeet fast model when working with Claude Code (VS Code app).
- And I use a smart one when I draft notes in Obsidian. I have a prompt to clean up my rambling and format the result with proper Markdown, very convenient.
One more cool thing is that it allows me to use LLMs with audio input modalities directly (not as text post-processing). e.g. It sends the audio to Gemini and prompts it to transcribe, format, etc., in one run. I find it a bit slow to work with CC, but it is the absolute best model in terms of accuracy, understanding, and formatting. It is the only model I trust to understand what I meant and produce the correct result, even when I use multiple languages, tech terms, etc.
Interesting, but I quickly uninstalled it after (1) it asked for permission to record keystrokes across all application and (2) registered global keyboard shortcut Option+Space without asking me.
Hey, I would really appreciate if you would try https://ottex.ai
I'm working on a Wispr/Spokenly competitor. It's free without any paywalled features, supports local models and a bunch of API providers including Mistral.
For local models ottex has - parakeet V3, Whisper, GLM-ASR nano, Qwen3-ASR (don't have voxtral yet though, looking into it).
btw, you can try new voxtral model via API (the model name to pick is `voxtral-mini-latest:transcribe`), I personally switched to it as my main default fast model - it's really good.
I'm working on a Wispr/Spokenly competitor. It's free without any paywalled features, supports local models and bunch of API providers including Mistral.
btw, the model name to pick is `voxtral-mini-latest:transcribe`, I personally switched to it as my main default fast model - it's really good.
We benchmarked it for real-life voice-to-text use cases:
Key takeaways:- 1.8x faster than Gemini 3 Flash on average
- ~1.4 sec transcription time for short to medium recordings
- ~$0.50/mo for heavy users (10h+ transcription)
- Close to SOTA audio understanding and formatting instruction following
- Multilingual: one model, 100+ languages
Gemini is slowly making $15/month voice apps obsolete.