[P] Added 8 Indian languages to Chatterbox TTS via LoRA — 1.4% of parameters, no phoneme engineering [P]
TL;DR:
Fine-tuned Chatterbox-Multilingual (Resemble AI's open-source TTS) to support Telugu, Kannada, Bengali, Tamil, Malayalam, Marathi, Gujarati, and Hindi using LoRA adapters + tokenizer extension. Only 7.8M / 544M parameters trained. Model + audio samples available.
---
The Problem
Chatterbox-Multilingual supports 23 languages with zero-shot voice cloning, but no Dravidian languages (Telugu, Kannada, Tamil, Malayalam) and limited Indo-Aryan coverage beyond Hindi. That's 500M+ speakers with no representation.
The conventional approach would be: build G2P (grapheme-to-phoneme) for each language, retrain the full model, spend months on it. Hindi schwa deletion alone is an unsolved problem. Bengali G2P is notoriously hard.
The Approach
Instead of phonemes, I went grapheme-level:
- Extended the BPE tokenizer with Indic script characters (2454 → 2871 tokens). Telugu, Kannada, Bengali, Tamil, Malayalam, Gujarati graphemes added alongside their existing Devanagari.
Brahmic warm-start
— Initialized new character embeddings from phonetically equivalent Devanagari characters. Telugu "క" (ka) gets initialized from Hindi "क" (ka). This works because Brahmic scripts share phonetic structure — same sounds, different glyphs. The model starts with a reasonable prior instead of random noise.
LoRA on T3 backbone
— Rank-32 adapters on q/k/v/o projections of the Llama-based T3 module. ~7.8M trainable params (1.4% of 544M total). Everything else frozen: vocoder (S3Gen), speaker encoder, speech tokenizer.Incremental language training
— Added languages one at a time with weighted sampling. Started with Hindi-only (validate pipeline), then Telugu+Hindi, then Kannada+Telugu+Hindi, finally all 8 languages. This prevents catastrophic forgetting — Hindi CER actually improved after adding 7 new languages.
Results
CER (Character Error Rate) via Whisper large-v3 ASR on 100 held-out samples per language:
| Language | CER | Notes |
|---|---|---|
| Hindi | 0.1058 | Improved from 0.29 baseline |
| Kannada | 0.1434 | |
| Tamil | 0.1608 | |
| Marathi | 0.1976 | |
| Gujarati | 0.2377 | |
| Bengali | 0.2450 | |
| Telugu | 0.2853 | |
| Malayalam | 0.8593 | Experimental — needs more data |
Malayalam struggles significantly. Likely needs more training data or a dedicated round. The rest produce intelligible, natural-sounding speech.
What Didn't Work / Limitations
-
Malayalam
— CER 0.86 is essentially unintelligible. Possibly the script complexity (many conjuncts) or insufficient data.
-
No MOS evaluation yet
— CER tells you the words are right, not that it sounds natural. Subjective eval is pending.
-
2 speakers per language
— Male + female from IndicTTS. Won't generalize to all voice types.
-
No code-mixing
— Hindi+English mixed sentences not specifically trained yet.
Links
-
Model + audio samples:
https://huggingface.co/reenigne314/chatterbox-indic-lora
-
Article (full writeup):
https://theatomsofai.substack.com/p/teaching-an-ai-to-speak-indian-languages
-
Base model:
[ResembleAI/chatterbox](
https://github.com/resemble-ai/chatterbox
) (MIT license)
Quick Start
```python
from chatterbox.mtl_tts import ChatterboxMultilingualTTS
model = ChatterboxMultilingualTTS.from_indic_lora(device="cuda", speaker="te_female")
wav = model.generate("నమస్కారం, మీరు ఎలా ఉన్నారు?", language_id="te")
```
Training Details
- Hardware: 1x RTX PRO 6000 Blackwell (96GB)
- Data: SPRINGLab IndicTTS + ai4bharat Rasa
- 6 training rounds, incremental language addition
- LoRA rank 32, alpha 64, bf16
Part 2 (technical deep-dive with code) coming this week. Happy to answer questions about the approach.
[link] [comments]
Want to read more?
Check out the full article on the original site