An AI VTuber that chats via a local LLM, speaks through an external TTS server, and animates a VRM avatar in real time via WebSocket.
- π§ Local LLM chat (Ollama by default)
- π Low-latency, chunked TTS playback (external server, e.g., GPT-SoVITS)
- π VRM animation signals (
tts_start
,tts_end
) over WebSocket - π€ Optional push-to-talk voice input (faster-whisper)
- βοΈ YAML-driven configuration (providers, audio, TTS, personality)
- Create and activate a virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1
- Install dependencies
pip install -r requirements.txt
- Start your TTS server (separate project)
- Must expose
/tts
onhttp://127.0.0.1:9880
- For streaming WAV, return WAV header, then raw PCM chunks
- Prefer an absolute
ref_audio_path
on Windows
- Run Miko
python miko.py
- (Optional) Use the Setup UI
python setup.py
- Save audio/TTS/personality settings, then click Start to launch VRM loader and Miko.
This repo includes a VRM viewer/loader (in vrmloader/
) and Miko in one place.
-
Use the Setup UI to launch both processes with your saved config:
- It starts
vrmloader/vrmloader.exe
(VRM viewer) first - Then launches
miko.py
which connects and emitstts_start/tts_end
- WebSocket:
ws://localhost:{vrm_websocket_port}
(default8765
)
- It starts
-
Manual launch (alternative):
- Start the viewer:
./vrmloader/vrmloader.exe
- In another terminal, run:
python miko.py
- Ensure the viewer is connected and listening for VRM WebSocket messages
- Start the viewer:
Note: The TTS server (e.g., GPT-SoVITS) is external and must be started separately.
miko.py
β main app (LLM chat + TTS streaming + VRM signals)setup.py
β PyQt6 Setup UI (devices, ASR, personality, launch)miko_config.yaml
β main configuration (providers, audio, TTS, ASR, personality)audio_config.json
β persisted output devicemodules/asr.py
β ASR manager (used bymiko.py
)modules/audio.py
β audio playback threadvrmloader/
β example VRM resources andvrmloader.exe
viewer
Note:
miko.py
currently inlines most logic and only importsASRManager
frommodules.asr
. The Setup UI reads audio utilities frommodules/audio_utils
and storesmodules/miko_personality.json
for compatibility.
Miko reads from YAML and falls back gracefully.
-
π§ Providers & Model
provider
: active provider key (e.g.,ollama
)providers.{provider}.model
: model name used byollama.chat
- Fallback:
ollama_config.selected_model
-
π TTS
tts_config.server_url
: e.g.,http://127.0.0.1:9880
(fallback to top-leveltts_server_url
)tts_config.text_lang
,prompt_lang
,ref_audio_path
,prompt_text
,media_type
tts_config.streaming_mode
,parallel_infer
(GET flow forcesstreaming_mode=true
for compatibility)- Missing fields inherit from
sovits_config
when present
-
π Audio Devices
audio_devices.device_index
: output device (ornull
for system default)audio_devices.asr_enabled
,asr_model
,asr_device
,push_to_talk_key
,input_device_id
-
π€ ASR Fallback (if not using
audio_devices
)asr_config.enabled
,model
,device
,push_to_talk_key
,input_device_id
-
π VRM
vrm_websocket_port
(default8765
)
-
π Personality
personality.name
,system_prompt
,greeting
,farewell
- Current build forces a fixed welcome TTS string for stability during testing
Run the GUI to configure and launch.
python setup.py
- Select input/output devices, ASR, and voice settings
- Save updates to
miko_config.yaml
andaudio_config.json
- Launch flow: starts
vrmloader/vrmloader.exe
, then runsmiko.py
- WebSocket server:
ws://localhost:{vrm_websocket_port}
(default8765
) - Messages sent to all connected VRM clients:
{ "type": "tts_start", "text": "..." }
{ "type": "tts_end" }
- Use
vrmloader/vrmloader.exe
or your own VRM viewer that consumes these events to trigger lip-sync/animations.
Mikoβs TTS client performs a GET request to /tts
with parameters like:
text, text_lang, ref_audio_path, prompt_text, prompt_lang,
streaming_mode=true, parallel_infer, media_type=wav,
batch_size, top_k, top_p, temperature, text_split_method,
speed_factor, fragment_interval, repetition_penalty, seed
Notes:
- For streaming WAV, the server should return the WAV header first, then raw PCM chunks
- If you get HTTP 400, verify required params and make
ref_audio_path
absolute
You must run a TTS server separately. For best latency/quality, use a CUDA-enabled GPT-SoVITS build and run its API server.
- NVIDIA GPU + recent drivers
- CUDA-enabled PyTorch in the GPT-SoVITS environment (CUDA 12.x commonly used)
- FastAPI + Uvicorn in that environment
# 1) Create and activate a dedicated env (example with conda)
conda create -n gpt-sovits python=3.10 -y
conda activate gpt-sovits
# 2) Install CUDA-enabled PyTorch (adjust CUDA version as needed)
# Example for CUDA 12.1:
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
# 3) Install project requirements (from GPT-SoVITS repo)
pip install -r requirements.txt
# 4) API server deps
pip install fastapi uvicorn soundfile websockets
# 5) Launch the API server (adjust paths)
python api_v2.py -a 127.0.0.1 -p 9880 -c GPT_SoVITS/configs/tts_infer.yaml
- Use CUDA builds of PyTorch; verify with
python -c "import torch; print(torch.cuda.is_available())"
- Prefer FP16 where supported (model/config dependent)
- Keep
streaming_mode=true
for GET streaming (Miko enforces this automatically)
- If you maintain a modified API (e.g., similar to
vrmloader/api_v3.py
) that emitstts_start
/tts_end
, run it inside the GPT-SoVITS environment (not this repoβs venv), so all model deps and CUDA builds are available - Keep it on
127.0.0.1:9880
to match Mikoβs default configuration
If you downloaded a precompiled GPT-SoVITS v2pro package, edit its go-webui.bat
to launch the API server on the expected host/port:
Example file: go-webui.bat
set "SCRIPT_DIR=%~dp0"
set "SCRIPT_DIR=%SCRIPT_DIR:~0,-1%"
cd /d "%SCRIPT_DIR%"
set "PATH=%SCRIPT_DIR%\runtime;%PATH%"
runtime\python.exe -I api_v2.py -a 127.0.0.1 -p 9880 -c GPT_SoVITS\configs\tts_infer.yaml
pause
Notes:
- Ensure the
-c
path points to your actualtts_infer.yaml
- Keep the port
9880
(or changemiko_config.yaml
βtts_config.server_url
accordingly)
- Hold the configured hotkey (default
shift
) to record; release to transcribe - Requires
faster-whisper
and a working mic; enable in YAML (asr_enabled: true
)
.\.venv\Scripts\Activate.ps1
python miko.py
- Ensure TTS server is running at configured URL
- If using Ollama, ensure
ollama serve
is running and the model is available
Runtime installs are skipped when frozen. Example onedir build:
pyinstaller --noconfirm --clean --onedir --name MikoVTuber ^
--add-data "miko_config.yaml;." ^
--add-data "audio_config.json;." ^
--add-data "main_sample.wav;." ^
--add-data "modules\miko_personality.json;modules" ^
--add-data "vrmloader\vrmloader.exe;vrmloader" ^
--hidden-import aiohttp --hidden-import websockets ^
--hidden-import sounddevice --hidden-import numpy --hidden-import requests ^
--paths modules miko.py
Run from dist\MikoVTuber
so relative assets are found.
- β TTS returns 400
- Check that required params are present
- Use an absolute
ref_audio_path
- Keep
streaming_mode=true
for GET-based streaming
β οΈ Event loop warnings- VRM signal scheduling is guarded; run from a console to capture logs
- π No audio output
- Pick a different output device in the menu
- Verify Windows sound settings and sample rate
- ποΈ ASR not working
- Confirm
asr_enabled
, mic permissions, andinput_device_id
- Confirm
- π§ EXE instantly closes
- Run from a console to capture output; ensure the external TTS server is started separately
miko.py
- YAML config β LLM chat via Ollama β sentence buffering β TTS GET
/tts
- Audio chunks β playback thread β
tts_start
/tts_end
β VRM WebSocket clients - Optional ASR via
ASRManager
(push-to-talk)
- YAML config β LLM chat via Ollama β sentence buffering β TTS GET
setup.py
- PyQt6 GUI to configure YAML, select devices, and launch VRM loader then Miko
Built for creators who want reliable, low-latency AI VTubing with deterministic animation sync. Have fun with Miko! β¨
- Inspired by the public Riko Project by Just Rayen. In canon, Miko is the βshameless clone with blue streaksβ (Riko has red) who totally βstoleβ Rikoβs code and attitude β purely a joke and tribute. See: rayenfeng/riko_project
- TTS powered externally by GPT-SoVITS variants. For the latest builds and notes, see: RVC-Boss/GPT-SoVITS Releases
Reference: RVC-Boss/GPT-SoVITS api_v2.py
- CUDA/TF32 optimizations
- Enables cuDNN benchmark and TF32 fast paths; attempts
torch.set_float32_matmul_precision("high")
- Tries enabling CUDA SDP kernels (FlashAttention/mem-efficient)
- Sets
BIGVGAN_USE_CUDA_KERNEL=1
and attempts fused-kernel toggles on BigVGAN
- Enables cuDNN benchmark and TF32 fast paths; attempts
- Performance logging
- Timer and GPU memory helpers around pipeline init and generation
- Suppresses noisy http logs (urllib3/httpx)
- Memory hygiene
- Calls
torch.cuda.empty_cache()
after generation to reduce fragmentation
- Calls
- Audio packing / formats
- Unified
pack_audio
for wav/raw/ogg/aac (ogg via soundfile, aac via ffmpeg pipe) - For streaming WAV, sends a one-time WAV header, then raw PCM chunks
- Unified
- Request validation
- Enforces required fields and validates languages against
tts_config.languages
- Rejects
ogg
when not in streaming mode
- Enforces required fields and validates languages against
- Endpoints
- GET/POST
/tts
compatible with upstream, with enhanced streaming behavior /control
(restart/exit),/set_gpt_weights
,/set_sovits_weights
,/set_refer_audio
- New diagnostics:
/cuda_info
and/health
- GET/POST
- Runtime/boot
- CLI args:
-a
(bind),-p
(port),-c
(config) with explicit boot logs - Forces
workers=1
for uvicorn
- CLI args:
In short: keeps upstream contract, adds GPU fast-paths (TF32/SDP/BigVGAN), stricter validation, richer formats, explicit WAV streaming headering, memory cleanup, and health/CUDA introspection for low-latency, long-running GPU use.