🎙️ GibbsTTS — Zero-Shot Voice Cloning TTS

Official demo for Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech.

Upload a short reference clip (a few seconds is enough). The reference transcript is optional — leave it blank and Whisper will fill it in for you automatically. Then type the text you want to synthesize, and the model will speak it in the reference voice. Supports English and Chinese Mandarin (plus experimental EN/ZH mixing).

ASR language

Language hint for Whisper. Choose None to use auto-detection.

TTS language

Language used by GibbsTTS for synthesis.

16 64
1 5
0 1
0.1 1.5
0.5 1
Examples
Reference audio (prompt) Reference transcript (optional) Target text (what you want the model to speak) TTS language ASR language

If you find this work useful, please cite the paper. Model trained on Emilia-en/zh.