Skip to content

A comprehensive ComfyUI integration for Microsoft's VibeVoice text-to-speech model, enabling high-quality single and multi-speaker voice synthesis directly within your ComfyUI workflows.

License

Notifications You must be signed in to change notification settings

Enemyx-net/VibeVoice-ComfyUI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VibeVoice ComfyUI Nodes

A comprehensive ComfyUI integration for Microsoft's VibeVoice text-to-speech model, enabling high-quality single and multi-speaker voice synthesis directly within your ComfyUI workflows.

Features

Core Functionality

  • 🎤 Single Speaker TTS: Generate natural speech with optional voice cloning
  • 👥 Multi-Speaker Conversations: Support for up to 4 distinct speakers
  • 🎯 Voice Cloning: Clone voices from audio samples
  • 📝 Text File Loading: Load scripts from text files
  • 📚 Automatic Text Chunking: Handles long texts seamlessly with configurable chunk size
  • 🔄 Node Chaining: Connect multiple VibeVoice nodes for complex workflows
  • ⏹️ Interruption Support: Cancel operations before or between generations

Model Options

  • 🚀 Three Model Variants:
    • VibeVoice 1.5B (faster, lower memory)
    • VibeVoice-Large (best quality, ~17GB VRAM)
    • VibeVoice-Large-Quant-4Bit (balanced, ~7GB VRAM)
  • đź”§ Flexible Configuration: Control temperature, sampling, and guidance scale

Performance & Optimization

  • ⚡ Attention Mechanisms: Choose between auto, eager, sdpa, flash_attention_2 or sage
  • 🎛️ Diffusion Steps: Adjustable quality vs speed trade-off (default: 20)
  • đź’ľ Memory Management: Toggle automatic VRAM cleanup after generation
  • đź§ą Free Memory Node: Manual memory control for complex workflows
  • 🍎 Apple Silicon Support: Native GPU acceleration on M1/M2/M3 Macs via MPS
  • 🔢 4-Bit Quantization: Reduced memory usage with minimal quality loss

Compatibility & Installation

  • 📦 Self-Contained: Embedded VibeVoice code, no external dependencies
  • 🔄 Universal Compatibility: Adaptive support for transformers v4.51.3+
  • 🖥️ Cross-Platform: Works on Windows, Linux, and macOS
  • 🎮 Multi-Backend: Supports CUDA, CPU, and MPS (Apple Silicon)

Video Demo

VibeVoice ComfyUI Wrapper Demo
Click to watch the demo video

Installation

Automatic Installation (Recommended)

  1. Clone this repository into your ComfyUI custom nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/Enemyx-net/VibeVoice-ComfyUI
  1. Restart ComfyUI - the nodes will automatically install requirements on first use

Available Nodes

1. VibeVoice Load Text From File

Loads text content from files in ComfyUI's input/output/temp directories.

  • Supported formats: .txt
  • Output: Text string for TTS nodes

2. VibeVoice Single Speaker

Generates speech from text using a single voice.

  • Text Input: Direct text or connection from Load Text node
  • Models: VibeVoice-1.5B or VibeVoice-Large
  • Voice Cloning: Optional audio input for voice cloning
  • Parameters (in order):
    • text: Input text to convert to speech
    • model: VibeVoice-1.5B, VibeVoice-Large or VibeVoice-Large-Quant-4Bit
    • attention_type: auto, eager, sdpa, flash_attention_2 or sage (default: auto)
    • free_memory_after_generate: Free VRAM after generation (default: True)
    • diffusion_steps: Number of denoising steps (5-100, default: 20)
    • seed: Random seed for reproducibility (default: 42)
    • cfg_scale: Classifier-free guidance (1.0-2.0, default: 1.3)
    • use_sampling: Enable/disable deterministic generation (default: False)
  • Optional Parameters:
    • voice_to_clone: Audio input for voice cloning
    • temperature: Sampling temperature (0.1-2.0, default: 0.95)
    • top_p: Nucleus sampling parameter (0.1-1.0, default: 0.95)
    • max_words_per_chunk: Maximum words per chunk for long texts (100-500, default: 250)

3. VibeVoice Multiple Speakers

Generates multi-speaker conversations with distinct voices.

  • Speaker Format: Use [N]: notation where N is 1-4
  • Voice Assignment: Optional voice samples for each speaker
  • Recommended Model: VibeVoice-Large for better multi-speaker quality
  • Parameters (in order):
    • text: Input text with speaker labels
    • model: VibeVoice-1.5B, VibeVoice-Large or VibeVoice-Large-Quant-4Bit
    • attention_type: auto, eager, sdpa, flash_attention_2 or sage (default: auto)
    • free_memory_after_generate: Free VRAM after generation (default: True)
    • diffusion_steps: Number of denoising steps (5-100, default: 20)
    • seed: Random seed for reproducibility (default: 42)
    • cfg_scale: Classifier-free guidance (1.0-2.0, default: 1.3)
    • use_sampling: Enable/disable deterministic generation (default: False)
  • Optional Parameters:
    • speaker1_voice to speaker4_voice: Audio inputs for voice cloning
    • temperature: Sampling temperature (0.1-2.0, default: 0.95)
    • top_p: Nucleus sampling parameter (0.1-1.0, default: 0.95)

4. VibeVoice Free Memory

Manually frees all loaded VibeVoice models from memory.

  • Input: audio - Connect audio output to trigger memory cleanup
  • Output: audio - Passes through the input audio unchanged
  • Use Case: Insert between nodes to free VRAM/RAM at specific workflow points
  • Example: [VibeVoice Node] → [Free Memory] → [Save Audio]

Multi-Speaker Text Format

For multi-speaker generation, format your text using the [N]: notation:

[1]: Hello, how are you today?
[2]: I'm doing great, thanks for asking!
[1]: That's wonderful to hear.
[3]: Hey everyone, mind if I join the conversation?
[2]: Not at all, welcome!

Important Notes:

  • Use [1]:, [2]:, [3]:, [4]: for speaker labels
  • Maximum 4 speakers supported
  • The system automatically detects the number of speakers from your text
  • Each speaker can have an optional voice sample for cloning

Model Information

VibeVoice-1.5B

  • Size: ~5GB download
  • Speed: Faster inference
  • Quality: Good for single speaker
  • Use Case: Quick prototyping, single voices

VibeVoice-Large

  • Size: ~17GB download
  • Speed: Slower inference but optimized
  • Quality: Best available quality
  • Use Case: Highest quality production, multi-speaker conversations
  • Note: Latest official release from Microsoft

VibeVoice-Large-Quant-4Bit

  • Size: ~7GB download
  • Speed: Balanced inference
  • Quality: Good quality
  • Use Case: Good quality production with less VRAM, multi-speaker conversations
  • Note: Quantized by DevParker

Models are automatically downloaded on first use and cached in ComfyUI/models/vibevoice/.

Generation Modes

Deterministic Mode (Default)

  • use_sampling = False
  • Produces consistent, stable output
  • Recommended for production use

Sampling Mode

  • use_sampling = True
  • More variation in output
  • Uses temperature and top_p parameters
  • Good for creative exploration

Voice Cloning

To clone a voice:

  1. Connect an audio node to the voice_to_clone input (single speaker)
  2. Or connect to speaker1_voice, speaker2_voice, etc. (multi-speaker)
  3. The model will attempt to match the voice characteristics

Requirements for voice samples:

  • Clear audio with minimal background noise
  • Minimum 3–10 seconds. Recommended at least 30 seconds for better quality
  • Automatically resampled to 24kHz

Tips for Best Results

  1. Text Preparation:

    • Use proper punctuation for natural pauses
    • Break long texts into paragraphs
    • For multi-speaker, ensure clear speaker transitions
  2. Model Selection:

    • Use 1.5B for quick single-speaker tasks (fastest, ~8GB VRAM)
    • Use Large for best quality (~16GB VRAM)
    • Use Large for best quality and multi-speaker (~16GB VRAM)
  3. Seed Management:

    • Default seed (42) works well for most cases
    • Save good seeds for consistent character voices
    • Try random seeds if default doesn't work well
  4. Performance:

    • First run downloads models (5-17GB)
    • Subsequent runs use cached models
    • GPU recommended for faster inference

System Requirements

Hardware

  • Minimum: 8GB VRAM for VibeVoice-1.5B
  • Recommended: 17GB+ VRAM for VibeVoice-Large
  • RAM: 16GB+ system memory

Software

  • Python 3.8+
  • PyTorch 2.0+
  • CUDA 11.8+ (for GPU acceleration)
  • Transformers 4.51.3+
  • ComfyUI (latest version)

Troubleshooting

Installation Issues

  • Ensure you're using ComfyUI's Python environment
  • Try manual installation if automatic fails
  • Restart ComfyUI after installation

Generation Issues

  • If voices sound unstable, try deterministic mode
  • For multi-speaker, ensure text has proper [N]: format
  • Check that speaker numbers are sequential (1,2,3 not 1,3,5)

Memory Issues

  • Large model requires ~16GB VRAM
  • Use 1.5B model for lower VRAM systems
  • Models use bfloat16 precision for efficiency

Examples

Single Speaker

Text: "Welcome to our presentation. Today we'll explore the fascinating world of artificial intelligence."
Model: VibeVoice-1.5B
cfg_scale: 1.3
use_sampling: False

Two Speakers

[1]: Have you seen the new AI developments?
[2]: Yes, they're quite impressive!
[1]: I think voice synthesis has come a long way.
[2]: Absolutely, it sounds so natural now.

Four Speaker Conversation

[1]: Welcome everyone to our meeting.
[2]: Thanks for having us!
[3]: Glad to be here.
[4]: Looking forward to the discussion.
[1]: Let's begin with the agenda.

Performance Benchmarks

Model VRAM Usage Context Length Max Audio Duration
VibeVoice-1.5B ~8GB 64K tokens ~90 minutes
VibeVoice-Large ~17GB 32K tokens ~45 minutes
VibeVoice-Large-Quant-4Bit ~7GB 32K tokens ~45 minutes

Known Limitations

  • Maximum 4 speakers in multi-speaker mode
  • Works best with English and Chinese text
  • Some seeds may produce unstable output
  • Background music generation cannot be directly controlled

License

This ComfyUI wrapper is released under the MIT License. See LICENSE file for details.

Note: The VibeVoice model itself is subject to Microsoft's licensing terms:

  • VibeVoice is for research purposes only
  • Check Microsoft's VibeVoice repository for full model license details

Links

Credits

  • VibeVoice Model: Microsoft Research
  • ComfyUI Integration: Fabio Sarracino
  • Base Model: Built on Qwen2.5 architecture

Support

For issues or questions:

  1. Check the troubleshooting section
  2. Review ComfyUI logs for error messages
  3. Ensure VibeVoice is properly installed
  4. Open an issue with detailed error information

Contributing

Contributions welcome! Please:

  1. Test changes thoroughly
  2. Follow existing code style
  3. Update documentation as needed
  4. Submit pull requests with clear descriptions

Changelog

Version 1.2.4

  • Added automatic text chunking for long texts in Single Speaker node
    • Single Speaker node now automatically splits texts longer than 250 words to prevent audio acceleration issues
    • New optional parameter max_words_per_chunk (range: 100-500 words, default: 250)
    • Maintains consistent voice characteristics across all chunks using the same seed
    • Seamlessly concatenates audio chunks for smooth, natural output

Version 1.2.3

  • Added SageAttention support for inference speedup
    • New attention option "sage" using quantized attention (INT8/FP8) for faster generation
    • Requirements: NVIDIA GPU with CUDA and sageattention library installation

Version 1.2.2

  • Added 4-bit quantized model support
    • New model in menu: VibeVoice-Large-Quant-4Bit using ~7GB VRAM instead of ~17GB
    • Requirements: NVIDIA GPU with CUDA and bitsandbytes library installed

Version 1.2.1

  • Bug Fixing

Version 1.2.0

  • MPS Support for Apple Silicon:
    • Added GPU acceleration support for Mac with Apple Silicon (M1/M2/M3)
    • Automatically detects and uses MPS backend when available, providing significant performance improvements over CPU

Version 1.1.1

  • Universal Transformers Compatibility:
    • Implemented adaptive system that automatically adjusts to different transformers versions
    • Guaranteed compatibility from v4.51.3 onwards
    • Auto-detects and adapts to API changes between versions

Version 1.1.0

  • Updated the URL for downloading the VibeVoice-Large model
  • Removed VibeVoice-Large-Preview deprecated model

Version 1.0.9

  • Embedded VibeVoice code directly into the wrapper
    • Added vvembed folder containing the complete VibeVoice code (MIT licensed)
    • No longer requires external VibeVoice installation
    • Ensures continued functionality for all users

Version 1.0.8

  • BFloat16 Compatibility Fix
    • Fixed tensor type compatibility issues with audio processing nodes
    • Input audio tensors are now converted from BFloat16 to Float32 for numpy compatibility
    • Output audio tensors are explicitly converted to Float32 to ensure compatibility with downstream nodes
    • Resolves "Got unsupported ScalarType BFloat16" errors when using voice cloning or saving audio

Version 1.0.7

  • Added interruption handler to detect user's cancel request
  • Bug fixing

Version 1.0.6

  • Fixed a bug that prevented VibeVoice nodes from receiving audio directly from another VibeVoice node

Version 1.0.5

  • Added support for Microsoft's official VibeVoice-Large model (stable release)

Version 1.0.4

  • Improved tokenizer dependency handling

Version 1.0.3

  • Added attention_type parameter to both Single Speaker and Multi Speaker nodes for performance optimization
    • auto (default): Automatic selection of best implementation
    • eager: Standard implementation without optimizations
    • sdpa: PyTorch's optimized Scaled Dot Product Attention
    • flash_attention_2: Flash Attention 2 for maximum performance (requires compatible GPU)
  • Added diffusion_steps parameter to control generation quality vs speed trade-off
    • Default: 20 (VibeVoice default)
    • Higher values: Better quality, longer generation time
    • Lower values: Faster generation, potentially lower quality

Version 1.0.2

  • Added free_memory_after_generate toggle to both Single Speaker and Multi Speaker nodes
  • New dedicated "Free Memory Node" for manual memory management in workflows
  • Improved VRAM/RAM usage optimization
  • Enhanced stability for long generation sessions
  • Users can now choose between automatic or manual memory management

Version 1.0.1

  • Fixed issue with line breaks in speaker text (both single and multi-speaker nodes)
  • Line breaks within individual speaker text are now automatically removed before generation
  • Improved text formatting handling for all generation modes

Version 1.0.0

  • Initial release
  • Single speaker node with voice cloning
  • Multi-speaker node with automatic speaker detection
  • Text file loading from ComfyUI directories
  • Deterministic and sampling generation modes
  • Support for VibeVoice 1.5B and Large models