
Step-Audio is a repository of open source frameworks for intelligent voice interaction:
Basic Information
- Multi-language support: Provide Chinese, English and Japanese README documents for the convenience of users of different languages.
- Project Links: Includes links to technical reports and Hugging Face-related models and datasets, providing easy access to additional resources.
Main elements and features
1. Core functions
Step-Audio is the first production-ready open source framework for intelligent voice interaction that harmonizes speech understanding and generation capabilities with the following functional features:
- multilingual dialog: Supports conversations in Chinese, English, Japanese and other languages.
- emotional tone: Ability to show different emotional tones such as joy, sadness, etc.
- local dialect: Support for local dialects such as Cantonese and Szechuan.
- Speech Rate Adjustment: The voice rate can be adjusted.
- rhyme scheme: Supports different rhyming styles such as rap.
2. Key technological innovations
- Multimodal model with 130 billion parameters
- is a unified model that integrates comprehension and generation capabilities to perform tasks such as speech recognition, semantic understanding, dialog, speech cloning, and speech synthesis.
- Open-sourced Step-Audio-Chat variant with 130 billion parameters.
- Generative Data Engine
- Eliminates the dependence of traditional text-to-speech (TTS) on manual data collection and generates high-quality audio through a multimodal model with 130 billion parameters.
- A resource-efficient Step-Audio-TTS-3B model with enhanced command-following capabilities for controlled speech synthesis is trained and disclosed using these data.
- Fine voice control
- Precise regulation is achieved through command-based control design, supporting a wide range of emotions (anger, joy, sadness, etc.), dialects (Cantonese, Szechuan, etc.), and vocal styles (rapping, a cappella humming, etc.) in order to meet the diverse voice generation needs.
- Enhanced Intelligence
- Improved performance of intelligences in complex tasks through ToolCall mechanism integration and role-playing enhancements.
3. Model architecture
- dual-code book framework: Audio streams are tokenized through a dual codebook framework, combining parallel semantic (16.7Hz, 1024-entry codebook) and acoustic (25Hz, 4096-entry codebook) taggers with 2:3 time interleaving.
- language model: Continuous audio pre-training of Step-1, a pre-trained text-based Large Language Model (LLM) based on 130 billion parameters, to enhance Step-Audio's ability to efficiently process speech information and achieve accurate speech-to-text alignment.
- voice decoder: plays a key role in converting discrete speech tokens containing semantic and acoustic information into continuous time-domain waveforms representing natural speech. The decoder architecture combines a stream matching model and a Mel-to-waveform vocoder trained using a two-code interleaving approach to optimize the intelligibility and naturalness of the synthesized speech.
- Real-time reasoning pipeline: An optimized inference pipeline is designed with a core Controller module that manages state transitions, coordinates speculative response generation, and ensures seamless coordination between key subsystems. These subsystems include Voice Activity Detection (VAD) for detecting the user's voice, a streaming audio tagger for real-time audio processing, a Step-Audio language model and speech decoder for processing and generating responses, and a context manager for maintaining dialog continuity.
Warehouse structure
The repository contains the following main folders and files:
Dockerfilerespond in singingDockerfile-vllm: The files used to build the Docker image.README.md,README_CN.md,README_JP.md: A document describing the project, containing information such as a description of the project, a summary of the model, and how to use it.requirements.txtrespond in singingrequirements-vllm.txt: The project's dependencies file, which lists the Python packages needed to run the project.assets: Stores the project's asset files, such as images, PDF documents, etc.examples: Stores example code or data.funasr_detach: May contain code for voice-related functions.speakers: Stores voice-related prompt audio files and speaker information.cosyvoice: May contain additional resources related to speech.
Model Download and Use
- Model Download: Provides links to download models for both Hugging Face and Modelscope, including Step-Audio-Tokenizer, Step-Audio-Chat, and Step-Audio-TTS-3B.
- Model Use: The documentation gives information about the requirements for running Step-Audio models, such as the minimum GPU memory needed for different models.
Step-Audio The repository provides a comprehensive and powerful framework for intelligent voice interaction and is a valuable open source project for both researchers and developers.
- ¥Download for freeDownload after commentDownload after login
- {{attr.name}}: