FireRedTTS-2

Long-form Streaming Text-to-Speech System for Multi-speaker Dialogue Generation

System Overview

FireRedTTS-2 is the second-generation text-to-speech system launched by the FireRed team of Xiaohongshu, designed specifically for multi-speaker dialogue generation. The system provides stable and natural voice output while achieving reliable speaker switching and context-aware prosody control.

Feature Introduction Screenshots

Core Improvements

Long-form Streaming Synthesis

Supports streaming speech synthesis of long-form content to reduce latency and improve user experience

Multi-speaker Dialogue

Optimized for multi-speaker dialogue scenarios to achieve natural speaker switching

Context-aware Prosody

Automatically adjusts voice prosody according to dialogue context to make output more natural

Enhanced Stability

Improved system architecture ensures long-term operational stability and consistency

Technical Advantages

Streaming Decoder: Supports real-time streaming speech synthesis, suitable for dialogue systems
Speaker Embedding Optimization: Improved speaker representation method for more reliable speaker switching

Application Scenarios

AI Podcast Production

Automatically generate multi-character dialogue podcast content, supporting different character voices

Virtual Meetings

Provide multi-speaker speech synthesis capabilities for virtual meeting systems

Context Modeling: Enhanced context understanding capabilities to generate more context-appropriate speech
End-to-End Training: Complete end-to-end training process, simplifying deployment and usage

Dialogue Systems

Provide more natural dialogue speech for chatbots and virtual assistants

Audioplay Production

Quickly generate multi-character audioplay content, improving production efficiency

Comparison with FireRedTTS-1

Feature	FireRedTTS-1	FireRedTTS-2
Main Application Scenario	Single-speaker voice synthesis	Multi-speaker dialogue generation
Synthesis Method	Batch processing synthesis	Streaming synthesis
Speaker Switching	Basic support	Optimized support
Context Awareness	Limited support	Deep support
Long-form Content Processing	Segmented processing	Continuous streaming processing