I built a real-time pipeline that reads game subtitles and converts them into dynamic voice acting (OCR → TTS → RVC) [P]
I've been experimenting with real-time pipelines that combine OCR + TTS + voice conversion, and I ended up building a desktop app that can "voice" game subtitles dynamically.
The idea is simple: - Capture subtitles from screen (OCR) - Convert them into speech (TTS) - Transform the voice per character (RVC)
But the hard parts were: - Avoiding repeated subtitle spam (similarity filtering) - Keeping latency low (~0.3s) - Handling multiple characters with different voice models without reloading - Running everything in a smooth pipeline (no audio gaps)
One thing that helped a lot was using a two-stage pipeline: While one sentence is playing, the next one is already processed in the background.
I also experimented with: - Emotion-based voice changes - Real-time translation (EN → TR) - Audio ducking (lowering game sound during speech)
I'm curious: How would you approach reducing latency further in a multi-model setup like this? Or is there a better alternative to RVC for real-time character voice conversion?
Happy to share more technical details if anyone is interested.
[link] [comments]
Want to read more?
Check out the full article on the original site