Welcome to the ComfyUI Audio Toolkit (AudioTools), a comprehensive custom node pack for ComfyUI that brings a full suite of audio generation, processing, and analysis capabilities into your generative workflows.
This toolkit is designed for a wide range of audio tasks, from podcast enhancement and text-to-speech to creative music manipulation and fully automated, batch-processed audio-reactive visual generation.
- Amplify / Gain
- Normalize Audio
- Mix Audio Tracks
- Trim Audio
- Remove Silence
- Noise Gate
- De-Esser
- De-Plosive (Low Cut)
- De-Hum (50/60Hz)
- Parametric EQ for Voice
- Vocal Compressor
- Navigate to your ComfyUI
custom_nodes
directory:cd ComfyUI/custom_nodes/
- Clone this repository:
git clone https://github.com/lum3on/ComfyUI_AudioTools
- Install the required dependencies from within the new directory:
pip install -r ComfyUI_AudioTools/requirements.txt
- Restart ComfyUI.
- Batch Processing: Most processing, effects, and visualization nodes are now batch-aware. You can load an entire folder of audio files, process them all simultaneously, and visualize the results in one go.
Category: AudioTools/IO
Loads all audio files from a folder path that match a pattern (e.g., *.wav
). Can be configured to sort files in various ways, including by modification date to get the newest file.
Parameter | Type | Description |
---|---|---|
(Input) directory_path |
STRING |
The full path to the directory containing audio files. |
(Input) file_pattern |
STRING |
The pattern to match files (e.g., *.wav , *.mp3 , audio_*.flac ). |
(Input) sort_order |
COMBO |
The order to sort files before loading. |
(Input) skip_first |
INT |
Number of files to skip from the start of the sorted list. |
(Input) load_cap |
INT |
Maximum number of files to load. Use -1 for no limit. |
(Output) audio_batch |
AUDIO |
A single padded AUDIO object containing all loaded clips as a batch. |
(Output) audio_list |
AUDIO_LIST |
A list where each item is a separate, unpadded AUDIO clip. |
(Output) filenames |
STRING |
A list of the filenames (string) that were successfully loaded. |
Category: AudioTools/IO
Retrieves a single audio clip from the audio_list
output of the batch loader, allowing for individual processing within a workflow.
Parameter | Type | Description |
---|---|---|
(Input) audio_list |
AUDIO_LIST |
The list of audio clips from a batch loader. |
(Input) index |
INT |
The index of the audio clip to retrieve from the list. Wraps around if the index is out of bounds. |
(Output) AUDIO |
AUDIO |
The single audio clip selected from the list. |
Category: AudioTools/Processing
Converts audio to a standard format (mono or stereo) and data type to fix compatibility issues with other nodes that expect a specific layout.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio (or batch) to standardize. |
(Input) channel_layout |
COMBO |
Convert audio to mono (single channel) or ensure it is stereo (two channels). |
(Output) AUDIO |
AUDIO |
The standardized audio. |
Category: AudioTools/Generation
Converts text into spoken audio using your operating system's built-in text-to-speech engine.
Parameter | Type | Description |
---|---|---|
(Input) text |
STRING |
The text to be converted into speech. |
(Input) voice_index |
COMBO |
The system voice to use for synthesis. Voice options depend on your operating system. |
(Input) rate |
INT |
The speaking rate in words per minute. |
(Input) volume |
FLOAT |
The volume of the generated audio (0.0 to 1.0). |
(Output) AUDIO |
AUDIO |
The generated spoken audio clip. |
Category: AudioTools/Utility
Joins two audio clips together end-to-end. The first clip from each input batch is used.
Parameter | Type | Description |
---|---|---|
(Input) audio_a |
AUDIO |
The first audio clip (will be the beginning of the result). |
(Input) audio_b |
AUDIO |
The second audio clip (will be appended to the end of the first). |
(Output) AUDIO |
AUDIO |
The new, combined audio clip. |
Category: AudioTools/Utility
Positions a sound in the stereo (left/right) field. This is applied to all audio clips in a batch.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to pan. |
(Input) pan |
FLOAT |
Stereo position. -1.0 is hard left, 1.0 is hard right, 0.0 is center. |
(Output) AUDIO |
AUDIO |
The panned audio. Mono inputs are converted to stereo. |
Category: AudioTools/Utility
Adds a specified duration of silence to the beginning or end of audio. This is applied to all audio clips in a batch.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio clip to pad with silence. |
(Input) pad_start_seconds |
FLOAT |
Duration of silence to add to the beginning of the audio. |
(Input) pad_end_seconds |
FLOAT |
Duration of silence to add to the end of the audio. |
(Output) AUDIO |
AUDIO |
The padded audio. |
Category: AudioTools/Processing
Adjusts the volume of the audio by a specified decibel (dB) value.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to apply gain to. |
(Input) gain_db |
FLOAT |
Amount of gain in decibels (dB) to apply. Positive values amplify, negative values attenuate. |
(Output) AUDIO |
AUDIO |
The amplified audio. |
Category: AudioTools/Processing
Normalizes the peak volume of the audio to a target dB level, maximizing loudness without clipping.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to normalize. |
(Input) target_level_db |
FLOAT |
The target peak volume in decibels (dB). A value of 0.0 is maximum, but -1.0 is a common target to avoid clipping. |
(Output) AUDIO |
AUDIO |
The normalized audio. |
Category: AudioTools/Processing
Combines two audio tracks into one. The first clip from each input batch is used.
Parameter | Type | Description |
---|---|---|
(Input) audio_1 |
AUDIO |
The first audio track. |
(Input) audio_2 |
AUDIO |
The second audio track. |
(Input) gain_1_db |
FLOAT |
Gain in dB for the first audio track. |
(Input) gain_2_db |
FLOAT |
Gain in dB for the second audio track. |
(Output) AUDIO |
AUDIO |
The mixed audio. |
Category: AudioTools/Processing
Cuts a specified number of seconds from the beginning or end of an audio clip. (Note: This node is not batch-aware).
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio clip to trim. |
(Input) trim_start_seconds |
FLOAT |
Number of seconds to cut from the beginning of the audio. |
(Input) trim_end_seconds |
FLOAT |
Number of seconds to cut from the end of the audio. |
(Output) AUDIO |
AUDIO |
The trimmed audio clip. |
Category: AudioTools/Processing
Intelligently analyzes and trims silent sections from an audio clip. (Note: This node is not batch-aware).
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio clip from which to remove silent sections. |
(Input) silence_threshold_db |
FLOAT |
The volume level (in dB) below which audio is considered silent. |
(Input) min_silence_len_ms |
INT |
The minimum duration (in milliseconds) of silence to be removed. |
(Output) AUDIO |
AUDIO |
The audio with silent sections removed. |
Category: AudioTools/Processing
Silences audio that falls below a specified volume threshold, useful for removing background noise between words.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to apply the noise gate to. |
(Input) threshold_db |
FLOAT |
The volume level (dB) below which the gate will close and silence the audio. |
(Input) attack_ms |
FLOAT |
How quickly (in ms) the gate opens when the signal exceeds the threshold. |
(Input) release_ms |
FLOAT |
How quickly (in ms) the gate closes after the signal falls below the threshold. |
(Output) AUDIO |
AUDIO |
The gated audio. |
Category: AudioTools/Processing
Reduces harsh "s" sounds (sibilance) in voice recordings by applying a narrow-band EQ cut at a specified frequency.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to de-ess. |
(Input) frequency_hz |
INT |
The center frequency of sibilance to target (typically 5-8 kHz). |
(Input) reduction_db |
FLOAT |
The amount of gain reduction (in dB) to apply at the target frequency. |
(Input) q_factor |
FLOAT |
The width of the frequency band to affect. Higher Q is narrower. |
(Output) AUDIO |
AUDIO |
The de-essed audio. |
Category: AudioTools/Processing
Reduces low-frequency pops ("plosives") caused by air hitting the microphone (e.g., from 'p' and 'b' sounds) using a high-pass filter.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to de-plosive (low-cut). |
(Input) cutoff_hz |
INT |
The cutoff frequency for the high-pass filter. Frequencies below this will be rolled off. |
(Output) AUDIO |
AUDIO |
The filtered audio. |
Category: AudioTools/Processing
Removes electrical power line hum by applying very narrow notch filters at the fundamental frequency (50 or 60 Hz) and its first harmonic.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to de-hum. |
(Input) hum_freq |
COMBO |
The fundamental frequency of the electrical hum to remove. |
(Input) reduction_db |
FLOAT |
The amount of gain reduction (dB) to apply to the hum frequencies. |
(Input) q_factor |
FLOAT |
The narrowness of the filter. A high Q value is needed to target only the hum. |
(Output) AUDIO |
AUDIO |
The de-hummed audio. |
Category: AudioTools/Processing
A 3-band equalizer specifically tuned for enhancing vocal clarity, featuring a low-cut, a presence boost, and an "air" band.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to equalize. |
(Input) low_cut_hz |
INT |
Low-cut (high-pass) filter to remove rumble. 80-120Hz is common for voice. |
(Input) presence_boost_db |
FLOAT |
Boost/cut for vocal presence (around 4kHz). |
(Input) air_boost_db |
FLOAT |
High-shelf boost/cut for 'air' and clarity (around 12kHz). |
(Output) AUDIO |
AUDIO |
The equalized audio. |
Category: AudioTools/Processing
Evens out the dynamic range of an audio clip, making quiet parts louder and loud parts quieter for a more consistent volume level.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to compress. |
(Input) threshold_db |
FLOAT |
The volume level (dB) at which the compressor starts working. |
(Input) ratio |
FLOAT |
The amount of gain reduction (e.g., 4.0 means a 4:1 ratio). |
(Input) attack_ms |
FLOAT |
How quickly (in ms) the compressor reacts to loud sounds. |
(Input) release_ms |
FLOAT |
How quickly (in ms) the compressor stops after the sound falls below the threshold. |
(Input) makeup_gain_db |
FLOAT |
Volume boost to apply after compression to make up for the reduced level. |
(Output) AUDIO |
AUDIO |
The compressed audio. |
Category: AudioTools/Effects
Adds spatial reverberation to the audio, simulating the sound of a room or space.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to apply reverb to. |
(Input) room_size |
FLOAT |
The perceived size of the reverberant space (0-100). |
(Input) damping |
FLOAT |
How much the high frequencies are absorbed in the reverb tails (0-100). |
(Input) wet_level |
FLOAT |
The volume of the reverberated (wet) signal. |
(Input) dry_level |
FLOAT |
The volume of the original (dry) signal. |
(Output) AUDIO |
AUDIO |
The reverberated audio. |
Category: AudioTools/Effects
Creates a repeating, decaying echo effect on the audio.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to apply delay to. |
(Input) delay_ms |
FLOAT |
The time (in milliseconds) between each echo. |
(Input) feedback |
FLOAT |
How much of the delayed signal is fed back into the delay line, creating more echoes. |
(Input) mix |
FLOAT |
The balance between the original (dry) and delayed (wet) signal. 0.0 is all dry, 1.0 is all wet. |
(Output) AUDIO |
AUDIO |
The audio with the delay effect. |
Category: AudioTools/Effects
Applies a linear fade-in from silence to the start of the audio.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to apply a fade-in to. |
(Input) duration_seconds |
FLOAT |
The duration of the fade-in effect in seconds. |
(Output) AUDIO |
AUDIO |
The audio with the fade-in applied. |
Category: AudioTools/Effects
Applies a linear fade-out to silence at the end of the audio.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to apply a fade-out to. |
(Input) duration_seconds |
FLOAT |
The duration of the fade-out effect in seconds. |
(Output) AUDIO |
AUDIO |
The audio with the fade-out applied. |
Category: AudioTools/Effects
Changes the audio's pitch without changing its speed, and/or changes its speed without changing the pitch. (Note: This node is not batch-aware).
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio to pitch shift or time stretch. |
(Input) pitch_semitones |
FLOAT |
The number of semitones to shift the pitch up or down. |
(Input) tempo_factor |
FLOAT |
The factor by which to change the tempo. >1.0 is faster, <1.0 is slower. |
(Output) AUDIO |
AUDIO |
The processed audio. |
Category: AudioTools/AI
Uses the Demucs AI model to split a music track into its core components: vocals, bass, drums, and other instruments.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio clip to be separated into stems. |
(Input) model_name |
COMBO |
The Demucs model to use for separation. 'htdemucs_ft' is a good general-purpose choice. |
(Output) vocals |
AUDIO |
The isolated vocal track. |
(Output) bass |
AUDIO |
The isolated bass track. |
(Output) drums |
AUDIO |
The isolated drum track. |
(Output) other |
AUDIO |
All other musical elements combined. |
Category: AudioTools/AI
Uses the Demucs AI model to isolate vocals from a recording, effectively removing background noise and non-vocal sounds.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio clip containing speech to be denoised. |
(Input) model_name |
COMBO |
The Demucs model to use for isolating vocals. It will remove non-vocal sounds. |
(Output) AUDIO |
AUDIO |
The denoised audio containing only the vocal signal. |
Category: AudioTools/AI
Transcribes audio to text using OpenAI's Whisper model. Can optionally generate a timed SRT (SubRip Subtitle) formatted string.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio clip to be transcribed. |
(Input) model_size |
COMBO |
The size of the Whisper model to use. Larger models are more accurate but slower. |
(Input) language |
COMBO |
The language of the speech in the audio. 'Auto-Detect' is an option. |
(Input) task |
COMBO |
Choose between standard transcription or translating the speech directly to English. |
(Input) generate_srt |
BOOLEAN |
If enabled, generates a timestamped SRT subtitle string in the srt_text output. |
(Input) srt_max_line_len |
INT |
(Optional) The maximum number of characters allowed per line in an SRT block. |
(Input) srt_max_lines |
INT |
(Optional) The maximum number of lines allowed per SRT block. |
(Input) srt_max_duration_sec |
FLOAT |
(Optional) The maximum duration in seconds an SRT block can cover. |
(Output) text |
STRING |
The transcribed text as a single string. |
(Output) srt_text |
STRING |
The generated SRT subtitle string with timestamps. Empty if generate_srt is disabled. |
Category: AudioTools/Analysis
Measures the perceived loudness of an audio clip according to the EBU R 128 broadcast standard and outputs the result as a string.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio clip to measure for loudness. |
(Output) loudness_info |
STRING |
A text string containing the measured integrated loudness in LUFS. |
Category: AudioTools/Analysis
Estimates the tempo (Beats Per Minute) of an audio clip and generates a frame-synced list of beat events.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio clip to detect the tempo from. |
(Input) fps |
INT |
The target frames per second to sync the beat events list with. |
(Output) bpm_info |
STRING |
A text string with the estimated BPM. |
(Output) beat_events |
FLOAT |
A list of floats, with 1.0 on frames that land on a beat and 0.0 otherwise. |
Category: AudioTools/Analysis
Analyzes the volume envelope (RMS) of an audio clip and outputs it as a frame-by-frame list of floats, perfect for driving animations.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio clip to analyze for its volume envelope. |
(Input) fps |
INT |
The target frames per second to sync the envelope list with. |
(Input) smoothing |
FLOAT |
Amount of smoothing to apply to the envelope. 0 is no smoothing, 1 is maximum smoothing. |
(Output) envelope |
FLOAT |
A list of floats (normalized 0-1) representing the audio's volume for each frame. |
Category: AudioTools/Visualization
An output node that displays technical details about an audio clip or batch, such as sample rate, duration, batch size, and more.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio clip to display information about. |
(Output) info |
STRING |
A text string containing the technical details of the audio. |
Category: AudioTools/Visualization
Generates and displays an image of the audio's waveform. This node is batch-aware and will produce one image for each audio clip in the batch.
Parameter | Type | Description |
---|---|---|
(Input) audio |
AUDIO |
The audio clip(s) to visualize. |
(Input) width |
INT |
The width of the output image in pixels. |
(Input) height |
INT |
The height of the output image in pixels. |
(Input) line_color |
STRING |
The color of the waveform line (hex code). |
(Input) bg_color |
STRING |
The background color of the image (hex code). |
(Input) show_axis |
BOOLEAN |
Whether to display the time and amplitude axes. |
(Output) IMAGE |
IMAGE |
An image (or batch of images) of the waveform. |
Category: AudioTools/Visualization
Creates an overlay image of two waveforms, making it easy to see the difference before and after processing. This node is batch-aware.
Parameter | Type | Description |
---|---|---|
(Input) audio_a |
AUDIO |
The first audio clip (or batch) to compare. |
(Input) audio_b |
AUDIO |
The second audio clip (or batch) to compare. |
(Input) width |
INT |
The width of the output image in pixels. |
(Input) height |
INT |
The height of the output image in pixels. |
(Input) color_a |
STRING |
The color for Audio A's waveform (hex code). |
(Input) color_b |
STRING |
The color for Audio B's waveform (hex code). |
(Input) bg_color |
STRING |
The background color of the image (hex code). |
(Input) show_axis |
BOOLEAN |
Whether to display the time and amplitude axes and a legend. |
(Output) IMAGE |
IMAGE |
An image (or batch of images) comparing the two waveforms. |
torch
&torchaudio
demucs
openai-whisper
pyttsx3
matplotlib
librosa
(for advanced analysis)pyloudnorm
(for LUFS measurement)
Changelog
Recent Changes:
Version 1.1.06
- fixed compatibility of the Speech-to-text Node with https://github.com/niknah/ComfyUI-F5-TTS
- added "Standardize Audio Format/Channels" Node to fix future compatibility issues
Version 1.1.05
- added srt output to the Speech-to-Text Node
- fixed some edge case errors in the Speech-to-text Node
Version 1.1.01
- added tooltips on hover to almost all in/outputs
- added licence + disclaimer
Version 1.1.00
- init