The Problem

Most live AI listening agents are built around meeting transcripts. The ones that do have a conversational interface either have an awkward avatar or a UX that feels like a side thought. I wanted to build the experience I was actually looking for: a local, voice-first AI companion with a real sense of presence. The specific use cases I had in mind were interview prep with a mode that pushes on clarity and structure of answers, a speech coach that surfaces filler words and ums in real time, a mentor personality that cuts against the sycophantic tendency of most LLMs and gives direct feedback on goals and thinking, a colleague for brainstorming and working through ideas out loud, and a therapist mode for reflective conversation. Five distinct tools, one interface. I also had a vision for a unique illustrated avatar style but Three.js character work is harder than it looks and that part is parked for now. You can already upload your own avatar if you have one.

A Tauri desktop app that pairs any LLM with a 3D avatar that reacts to the conversation: mouth sync, eye blink, emotion-driven expressions, head movement. Voice goes in via Whisper, text goes to the model, the response drives both TTS and avatar state. Ships with five built-in personalities: Interviewer (structures and challenges your responses), Speech Coach (flags filler words and pacing), Mentor (direct feedback, no flattery), Colleague (open brainstorming partner), and Therapist (reflective, patient listening). Works with Anthropic, OpenAI, or any local model via Ollama.

What's interesting

Full multi-modal pipeline

Voice → Whisper (local or OpenAI) → LLM → emotion detection → avatar state update → TTS playback. Each stage is swappable: local Kokoro for TTS instead of OpenAI, Ollama instead of Claude, any OpenAI-compatible endpoint. The pipeline doesn't assume a specific provider at any stage.

Four avatar renderer backends

Built-in procedural 3D (Three.js morph targets), Rive vector animation (state machine with talking and emotion inputs), VRM skeleton control (@pixiv/three-vrm), and a started Ready Player Me integration. Users can upload their own .riv or .vrm files, stored in IndexedDB and resolved via blob URL. The renderer is swapped at runtime without restarting.

Streaming parser for extended thinking

Extended thinking arrives as think tags interleaved with the response stream, sometimes split across chunk boundaries. The parser maintains state across chunks to extract reasoning content correctly and display it separately, without buffering the full response first.

Conversation state machine

idle → listening → processing → speaking, with Escape to interrupt at any phase. Silence detection auto-stops the STT after a configurable quiet period. Mic permission errors surface a retry prompt. The state machine means the avatar always reflects what's actually happening, not just what was last said.

Why this matters

The multi-modal pipeline works — voice in, avatar out, model-swappable at every stage. The 3D avatar work exposed how much craft goes into character animation that feels present rather than mechanical; that part is parked until it can be done properly. The personality modes are the part worth returning to: a single conversational interface that shifts between interviewer, coach, mentor, and colleague depending on what you need is a genuinely useful tool.