FireRedTTS-2

Long-form Streaming Text-to-Speech System for Multi-speaker Dialogue Generation

System Overview

FireRedTTS-2 is the second-generation text-to-speech system launched by the FireRed team of Xiaohongshu, designed specifically for multi-speaker dialogue generation. The system provides stable and natural voice output while achieving reliable speaker switching and context-aware prosody control.

Feature Introduction Screenshots

FireRedTTS-2 Feature Demo

Core Improvements

Long-form Streaming Synthesis

Supports streaming speech synthesis of long-form content to reduce latency and improve user experience

Multi-speaker Dialogue

Optimized for multi-speaker dialogue scenarios to achieve natural speaker switching

Context-aware Prosody

Automatically adjusts voice prosody according to dialogue context to make output more natural

Enhanced Stability

Improved system architecture ensures long-term operational stability and consistency

Technical Advantages

  • Streaming Decoder: Supports real-time streaming speech synthesis, suitable for dialogue systems
  • Speaker Embedding Optimization: Improved speaker representation method for more reliable speaker switching

Application Scenarios

AI Podcast Production

Automatically generate multi-character dialogue podcast content, supporting different character voices

Virtual Meetings

Provide multi-speaker speech synthesis capabilities for virtual meeting systems

  • Context Modeling: Enhanced context understanding capabilities to generate more context-appropriate speech
  • End-to-End Training: Complete end-to-end training process, simplifying deployment and usage

Dialogue Systems

Provide more natural dialogue speech for chatbots and virtual assistants

Audioplay Production

Quickly generate multi-character audioplay content, improving production efficiency

Comparison with FireRedTTS-1

Feature FireRedTTS-1 FireRedTTS-2
Main Application Scenario Single-speaker voice synthesis Multi-speaker dialogue generation
Synthesis Method Batch processing synthesis Streaming synthesis
Speaker Switching Basic support Optimized support
Context Awareness Limited support Deep support
Long-form Content Processing Segmented processing Continuous streaming processing