I built a real-time pipeline that reads game subtitles and converts them into dynamic voice acting (OCR → TTS → RVC) [P]

I've been experimenting with real-time pipelines that combine OCR + TTS + voice conversion, and I ended up building a desktop app that can "voice" game subtitles dynamically.

The idea is simple: - Capture subtitles from screen (OCR) - Convert them into speech (TTS) - Transform the voice per character (RVC)

But the hard parts were: - Avoiding repeated subtitle spam (similarity filtering) - Keeping latency low (~0.3s) - Handling multiple characters with different voice models without reloading - Running everything in a smooth pipeline (no audio gaps)

One thing that helped a lot was using a two-stage pipeline: While one sentence is playing, the next one is already processed in the background.

I also experimented with: - Emotion-based voice changes - Real-time translation (EN → TR) - Audio ducking (lowering game sound during speech)

I'm curious: How would you approach reducing latency further in a multi-model setup like this? Or is there a better alternative to RVC for real-time character voice conversion?

Happy to share more technical details if anyone is interested.

submitted by /u/fqtih0
[link] [comments]

I built a real-time pipeline that reads game subtitles and converts them into dynamic voice acting (OCR → TTS → RVC) [P]

Want to read more?

Tagged with