Step-Audio: Intelligent Voice Interaction in Multiple Languages and Styles

Step-Audio: Intelligent Voice Interaction in Multiple Languages and Styles

Step-Audio is a repository of open source frameworks for intelligent voice interaction:

Basic Information

  • Multi-language support: Provide Chinese, English and Japanese README documents for the convenience of users of different languages.
  • Project Links: Includes links to technical reports and Hugging Face-related models and datasets, providing easy access to additional resources.

Main elements and features

1. Core functions

Step-Audio is the first production-ready open source framework for intelligent voice interaction that harmonizes speech understanding and generation capabilities with the following functional features:

  • multilingual dialog: Supports conversations in Chinese, English, Japanese and other languages.
  • emotional tone: Ability to show different emotional tones such as joy, sadness, etc.
  • local dialect: Support for local dialects such as Cantonese and Szechuan.
  • Speech Rate Adjustment: The voice rate can be adjusted.
  • rhyme scheme: Supports different rhyming styles such as rap.

2. Key technological innovations

  • Multimodal model with 130 billion parameters
    • is a unified model that integrates comprehension and generation capabilities to perform tasks such as speech recognition, semantic understanding, dialog, speech cloning, and speech synthesis.
    • Open-sourced Step-Audio-Chat variant with 130 billion parameters.
  • Generative Data Engine
    • Eliminates the dependence of traditional text-to-speech (TTS) on manual data collection and generates high-quality audio through a multimodal model with 130 billion parameters.
    • A resource-efficient Step-Audio-TTS-3B model with enhanced command-following capabilities for controlled speech synthesis is trained and disclosed using these data.
  • Fine voice control
    • Precise regulation is achieved through command-based control design, supporting a wide range of emotions (anger, joy, sadness, etc.), dialects (Cantonese, Szechuan, etc.), and vocal styles (rapping, a cappella humming, etc.) in order to meet the diverse voice generation needs.
  • Enhanced Intelligence
    • Improved performance of intelligences in complex tasks through ToolCall mechanism integration and role-playing enhancements.

3. Model architecture

  • dual-code book framework: Audio streams are tokenized through a dual codebook framework, combining parallel semantic (16.7Hz, 1024-entry codebook) and acoustic (25Hz, 4096-entry codebook) taggers with 2:3 time interleaving.
  • language model: Continuous audio pre-training of Step-1, a pre-trained text-based Large Language Model (LLM) based on 130 billion parameters, to enhance Step-Audio's ability to efficiently process speech information and achieve accurate speech-to-text alignment.
  • voice decoder: plays a key role in converting discrete speech tokens containing semantic and acoustic information into continuous time-domain waveforms representing natural speech. The decoder architecture combines a stream matching model and a Mel-to-waveform vocoder trained using a two-code interleaving approach to optimize the intelligibility and naturalness of the synthesized speech.
  • Real-time reasoning pipeline: An optimized inference pipeline is designed with a core Controller module that manages state transitions, coordinates speculative response generation, and ensures seamless coordination between key subsystems. These subsystems include Voice Activity Detection (VAD) for detecting the user's voice, a streaming audio tagger for real-time audio processing, a Step-Audio language model and speech decoder for processing and generating responses, and a context manager for maintaining dialog continuity.

Warehouse structure

The repository contains the following main folders and files:

  • Dockerfile respond in singing Dockerfile-vllm: The files used to build the Docker image.
  • README.md,README_CN.md,README_JP.md: A document describing the project, containing information such as a description of the project, a summary of the model, and how to use it.
  • requirements.txt respond in singing requirements-vllm.txt: The project's dependencies file, which lists the Python packages needed to run the project.
  • assets: Stores the project's asset files, such as images, PDF documents, etc.
  • examples: Stores example code or data.
  • funasr_detach: May contain code for voice-related functions.
  • speakers: Stores voice-related prompt audio files and speaker information.
  • cosyvoice: May contain additional resources related to speech.

Model Download and Use

  • Model Download: Provides links to download models for both Hugging Face and Modelscope, including Step-Audio-Tokenizer, Step-Audio-Chat, and Step-Audio-TTS-3B.
  • Model Use: The documentation gives information about the requirements for running Step-Audio models, such as the minimum GPU memory needed for different models.

Step-Audio The repository provides a comprehensive and powerful framework for intelligent voice interaction and is a valuable open source project for both researchers and developers.

    Download permission
    View
    • Download for free
      Download after comment
      Download after login
    • {{attr.name}}:
    Your current level is
    Login for free downloadLogin Your account has been temporarily suspended and cannot be operated! Download after commentComment Download after paying points please firstLogin You have run out of downloads ( times) please come back tomorrow orUpgrade Membership Download after paying pointsPay Now Download after paying pointsPay Now Your current user level is not allowed to downloadUpgrade Membership
    You have obtained download permission You can download resources every daytimes, remaining todaytimes left today
    📢 Disclaimer | Tool Use Reminder
    1 This content is compiled based on publicly available information. As AI technologies and tools undergo frequent updates, please refer to the latest official documentation for the most current details.
    2 The recommended tools have undergone basic screening but have not undergone in-depth security verification. Please assess their suitability and associated risks yourself.
    3 When using third-party AI tools, please be mindful of data privacy protection and avoid uploading sensitive information.
    4 This website shall not be liable for any direct or indirect losses resulting from misuse of tools, technical failures, or content inaccuracies.
    5 Some tools may require a paid subscription. Please make informed decisions. This site does not provide any investment advice.
    0 comment A文章作者 M管理员
      No Comments Yet. Be the first to share what you think
    ❯❯❯❯❯❯❯❯❯❯❯❯❯❯❯❯
    Profile
    Cart
    Coupons
    Check-in
    Message Message
    Search