FireRedTTS-1

Basic Text-to-Speech System

FireRedTTS-1

First version released in September 2024, FireRedTTS is a basic text-to-speech system that supports zero-shot voice cloning and emotional voice generation.

Current Version

FireRedTTS-2

Latest FireRedTTS-2 features: Long dialogue voice generation, supporting 4-minute, 4-character dialogue.

View New Version

System Overview

FireRedTTS-1 is the first open-source text-to-speech system launched by the FireRed team of Xiaohongshu, built on large language model technology. The system can achieve high-quality speech synthesis, supporting zero-shot voice cloning and emotional voice generation functions.

Core Technical Features

Semantic-Aware Tokenizer

Uses a semantic-aware speech tokenizer (SAST) to compress speech signals into discrete tokens, improving synthesis quality

Two-Stage Generation

Text-to-semantic tokens to waveform two-stage generation system, ensuring high-quality output

Zero-Shot Cloning

Only a few seconds of reference audio is needed to mimic any voice and speaking style

Emotional Expression

Supports multiple emotional expressions and rich paralinguistic feature generation

Technical Architecture

FireRedTTS-1 adopts a language model-based basic TTS system architecture, mainly including three modules:

  • Speech Tokenizer: Converts speech signals into discrete semantic tokens
  • Text-to-Speech Language Model: Maps text tokens to semantic tokens
  • Token-to-Waveform Generator: Converts semantic tokens into final audio waveforms

Application Scenarios

Video Dubbing

Provides personalized speech synthesis for short videos, educational content, etc.

Virtual Assistant

Provides natural speech output for intelligent customer service, voice assistants, and other applications

Audiobooks

Quickly generates high-quality audiobook content

FireRedTTS-1 System Architecture