Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
0efab64
Add the base code for the annotator module
Nov 6, 2025
4338308
- Add model definition for the LLM structured output
Nov 7, 2025
ad19191
Make it works with LLM
Nov 7, 2025
9af6370
Add summary generation
Nov 7, 2025
cb4739b
- Move the Annotator's subclasses into two different files nlp and au…
Nov 7, 2025
2dcc327
- Add ASR task
Nov 7, 2025
75a7755
Merge branch 'idiap:main' into annotator
qanastek Nov 8, 2025
a92c551
Merge branch 'main' into annotator
Nov 18, 2025
0dbb406
- Make subfolders for the annotators in order to improve collaborativ…
Nov 19, 2025
2061756
Generate the annotations file
Nov 19, 2025
e9a581e
Merge branch 'idiap:main' into annotator
qanastek Nov 19, 2025
1dd2b7d
Merge branch 'idiap:main' into annotator
qanastek Nov 19, 2025
d418254
Merge branch 'idiap:main' into annotator
qanastek Nov 20, 2025
13428d5
Merge branch 'idiap:main' into annotator
qanastek Nov 22, 2025
7378d8c
Rename annotator into task
Nov 22, 2025
5da6f33
Merge branch 'annotator' of https://github.com/qanastek/sdialog into …
Nov 22, 2025
5fa9af7
Add NER and SLU tasks
Nov 23, 2025
90ed67f
Add the diarization task
Nov 23, 2025
9e7ac64
Update the documentation with the new changes on tasks
Nov 23, 2025
e868d1f
Rename NER and SLU in full name: NamedEntityRecognition and SpokenLan…
Nov 23, 2025
59878d2
Rename SummaryTask into SummarizationTask
Nov 23, 2025
0a9f496
Update diarization task and add support for speaker identification
Nov 23, 2025
84ab655
Create the audio.evaluation submodule and add to it a function for co…
Nov 24, 2025
e61e6b1
Add documentation section for the evaluation of the audio module.
Nov 24, 2025
313263f
- Add Result class
Nov 24, 2025
37e66ca
Add or improve the differents audio evaluators
Nov 25, 2025
9dcc11d
- Fix bug when loading step 3 data from a file, the room is now corre…
Nov 27, 2025
dee5cb8
Merge branch 'idiap:main' into tasks+evaluation
qanastek Nov 27, 2025
8be55eb
Add the file saving feature in the audio evaluation module
Nov 28, 2025
b37b2ae
Merge branch 'idiap:main' into tasks+evaluation
qanastek Nov 28, 2025
6395be3
Add audio analytics
Nov 29, 2025
3bc1faa
- Update IndexTTS class to support also version 2.0
Nov 29, 2025
73c320e
- Add texts normalizer in the ASR task (whisper + lowercase)
Nov 30, 2025
4cf7ed7
- Make the TTS models take text normalizers
Nov 30, 2025
991cc0f
Merge branch 'idiap:main' into tasks+evaluation
qanastek Nov 30, 2025
edca7eb
- Add voices in the to_audio
Dec 5, 2025
803a698
Merge branch 'idiap:main' into tasks+evaluation
qanastek Dec 5, 2025
a4d7108
Fix dscaper problem with indextts 1 bit missing due to rounding
Dec 5, 2025
71f3b4a
Bug correction dscaper
Dec 6, 2025
40fe631
Revert bug fix
Dec 6, 2025
205bfd9
Update IndexTTS example
Dec 22, 2025
274e598
- Add compute_overlapping_and_pausing_llm
Jan 27, 2026
37a6a08
- Add overlap / pauses in to_audio
Jan 27, 2026
2f19b34
Merge branch 'idiap:main' into tasks+evaluation
qanastek Jan 27, 2026
6c1cea8
Update tutorial 10
Jan 27, 2026
7b6ef58
Merge pull request #1 from qanastek/tasks+evaluation
qanastek Jan 27, 2026
51ceaa4
Prepare sound effect feature addition
Feb 1, 2026
4e860e2
Make the audio event feature addition working.
Feb 9, 2026
da1eee5
lower the SNR
Feb 9, 2026
e584e31
Merge branch 'idiap:main' into main
qanastek Feb 9, 2026
4f35fcd
- Add event dropout
Feb 10, 2026
b813045
- Change SNR -10dB for audio events
Feb 11, 2026
eb1e77d
- Second variant of RTTM generator
Feb 11, 2026
4deabc5
Add threashold
Feb 11, 2026
48bca40
Remove used imports
Feb 12, 2026
de8bf97
Merge remote-tracking branch 'upstream/main'
Feb 12, 2026
fe06aa6
Merge remote-tracking branch 'upstream/main'
Feb 12, 2026
6c1ceb8
Merge branch 'idiap:main' into main
qanastek Feb 12, 2026
1220ae7
Update tutorials
Feb 12, 2026
056e72e
Fix the bug introduced by merging to main branch
Feb 12, 2026
b2c4a62
Update qwen 3 models and pipeline to fix errors
Feb 13, 2026
7af70ef
- Merge sergio's fix for voice databases using Qwen 3 TTS
Feb 13, 2026
e2ac51f
Add annotation skiping
Feb 13, 2026
e6acc33
Add a tag normalizer
Feb 13, 2026
6f9efac
Improve receipe
Feb 13, 2026
58d10eb
Fix bug with foreground and background datasets
Feb 14, 2026
0fec97d
Fix case where llm cannot annotate
Feb 14, 2026
26c8ff9
Add normalization of the text Qwen 3 TTS
Feb 14, 2026
0a7fabf
Fix ndarray
Feb 14, 2026
c93b398
Add deterministic support for Qwen3TTS
SevKod Feb 15, 2026
5ac796f
Add trimming after audio generation, and add random gaps
SevKod Feb 15, 2026
0eeddae
Fix numpy conversion of audios
SevKod Feb 15, 2026
aed0ce5
Merge pull request #2 from SevKod/alternative
qanastek Feb 15, 2026
88d7aef
Update
Feb 15, 2026
316fe2f
Merge branch 'main' of https://github.com/qanastek/sdialog
Feb 16, 2026
19c6488
Update
Feb 16, 2026
0860161
Fix sound event error
Feb 16, 2026
64666d5
- Fix looping ac noise
Feb 16, 2026
a3a741e
Add normalization after dry and wet
SevKod Feb 16, 2026
56cbb37
Merge pull request #3 from SevKod/patch-3
qanastek Feb 16, 2026
cc54001
Move dScaper data sending before to be done only once
Feb 16, 2026
bd0e254
Remove implicit normalizer
Feb 16, 2026
eb698e9
Fix normalization for consistency with Qwen3-TTS Voice cloning
SevKod Feb 16, 2026
211cce7
Merge pull request #5 from SevKod/patch-7
qanastek Feb 17, 2026
636dbff
Add seed to the speaker placement
Feb 17, 2026
9ecab99
First try of the post processing
Feb 17, 2026
53da14e
Add ASCII to text normalization for most of the cases
SevKod Feb 18, 2026
9aaaf5d
Callback for final pyroom mix
Feb 18, 2026
2811fd6
Merge pull request #6 from SevKod/patch-8
qanastek Feb 18, 2026
df3cf4a
Add control of the snr callback
Feb 22, 2026
e91b7ec
Update the sound effect tutorial
Mar 12, 2026
4b22bff
Update title of tutorial 11
qanastek Mar 12, 2026
69b10eb
Merge remote-tracking branch 'upstream/main'
qanastek Mar 12, 2026
cd3e295
Remove 02_tasks tutorial from repository
qanastek Mar 12, 2026
ad78e78
Remove audio evaluation from GitHub tracking
qanastek Mar 12, 2026
3bd4bbc
Remove tasks from GitHub tracking
qanastek Mar 12, 2026
bb598f4
Remove audio evaluation and to tasks
qanastek Mar 12, 2026
670c7a2
Remove tasks from GitHub tracking
qanastek Mar 12, 2026
58d9f58
Remove audio evaluation from GitHub tracking
qanastek Mar 12, 2026
aeba194
Update code for passing Flake8
qanastek Mar 12, 2026
fa3dd55
Update test for audio pipeline
qanastek Mar 12, 2026
381c479
FIx flake 8
qanastek Mar 12, 2026
e01f733
Merge branch 'idiap:main' into main
qanastek Mar 12, 2026
367b3d3
Revert changes on the gitignore
qanastek Mar 12, 2026
b66c926
Remove tasks and audio evaluation from the documentation.
qanastek Mar 12, 2026
1d1a4e9
Remove internal function to get / set annotations in Dialog
qanastek Mar 12, 2026
fbb14fb
Remove file
qanastek Mar 12, 2026
8f19e4b
Remove more
qanastek Mar 12, 2026
cc18caf
Merge remote-tracking branch 'upstream/main'
qanastek Mar 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion docs/api/sdialog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,6 @@ sdialog.evaluation.base
:members:
:show-inheritance:


----

sdialog.datasets
Expand Down
2 changes: 1 addition & 1 deletion docs/sdialog/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -656,7 +656,7 @@ Audio Generation
The audio module of SDialog extends the core functionality by adding comprehensive audio generation and processing capabilities for dialogues. It enables transforming text dialogues into immersive audio experiences with realistic voices and simulated acoustic environments.

Setup and Installation
---------------------
----------------------

To work with audio features in SDialog, you'll need to install additional dependencies and system packages:

Expand Down
2 changes: 1 addition & 1 deletion requirements-audio-test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ jams
pyloudnorm
pyroomacoustics
huggingface_hub[cli]
dscaper>=1.7.0
dscaper>=1.7.7
6 changes: 3 additions & 3 deletions requirements-audio.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ sox
jams
pyloudnorm
pyroomacoustics
datasets<=3.6.0
datasets<=2.21.0
huggingface_hub[cli]
dscaper>=1.7.0
qwen-tts
dscaper>=1.7.7
whisper-normalization
19 changes: 16 additions & 3 deletions src/sdialog/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -474,7 +474,7 @@ def to_audio(
This is a convenience wrapper around the full `sdialog.audio.pipeline.to_audio` function.
All keyword arguments are passed to it.

:param path: Directory path for storing audio outputs.
:param path: Path to the audio file or directory for storing audio outputs.
:type path: str
:param dialog_dir_name: Custom name for the dialogue directory.
:type dialog_dir_name: str
Expand Down Expand Up @@ -510,8 +510,6 @@ def to_audio(
:type audio_file_format: str
:param seed: Seed for random number generator.
:type seed: int
:param re_sampling_rate: Re-sampling rate for the output audio.
:type re_sampling_rate: Optional[int]
:param recording_devices: The identifiers of the recording devices to simulate.
:type recording_devices: Optional[List[Union[RecordingDevice, str]]]
:param impulse_response_database: The database for impulse responses.
Expand All @@ -520,6 +518,21 @@ def to_audio(
:type override_tts_audio: Optional[bool]
:param verbose: Verbose mode for logging.
:type verbose: Optional[bool]
:param overlap_pauses: Generate the audio with overlapping and pausing between turns using LLM.
:type overlap_pauses: Optional[bool]
:param add_sound_effects: Add sound effects (such as door opening, footsteps, etc.) to the audio.
:type add_sound_effects: Optional[bool]
:param sound_effects_dropout: Dropout rate for sound effects.
:type sound_effects_dropout: Optional[float]
:param skip_annotation: Whether to skip the annotation of the sound effects
(if your dialogs are already annotated with sound effects tags, you can skip this step).
:type skip_annotation: Optional[bool]
:param remove_silences: Remove the silences at the beginning and the end of the audio.
:type remove_silences: Optional[bool]
:param callback_mix_fn: Callback function to apply to the mixed audio.
:type callback_mix_fn: Optional[Callable]
:param callback_mix_kwargs: Keyword arguments for the callback function.
:type callback_mix_kwargs: dict
:return: Audio dialogue with processed audio data.
:rtype: "sdialog.audio.dialog.AudioDialog"
:raises Exception: If the audio module is not installed.
Expand Down
76 changes: 53 additions & 23 deletions src/sdialog/audio/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,14 +58,14 @@
import numpy as np
from tqdm import tqdm
import soundfile as sf
from typing import Union
from typing import Union, Optional, Callable

from sdialog.audio.tts import BaseTTS
from sdialog.audio.dialog import AudioDialog
from sdialog.audio.room import Room, RoomPosition
from sdialog.audio.utils import AudioUtils, SourceVolume, Role, logger
from sdialog.audio.acoustics_simulator import AcousticsSimulator
from sdialog.audio.voice_database import BaseVoiceDatabase, Voice
from sdialog.audio.dialog import AudioDialog, RoomAcousticsConfig
from sdialog.audio.utils import SourceVolume, Role, logger

device = "cuda" if torch.cuda.is_available() else "cpu"

Expand All @@ -79,7 +79,8 @@ def generate_utterances_audios(
keep_duplicate: bool = False,
seed: int = None,
sampling_rate: int = 24_000,
tts_pipeline_kwargs: dict = {}
tts_pipeline_kwargs: dict = {},
remove_silences: bool = True
) -> AudioDialog:
"""
Generates audio for each utterance in an AudioDialog object using the specified TTS engine.
Expand Down Expand Up @@ -113,6 +114,8 @@ def generate_utterances_audios(
:type seed: int
:param sampling_rate: Sampling rate for the audio generation.
:type sampling_rate: int
:param remove_silences: If True, remove the silences at the beginning and the end of the audio.
:type remove_silences: bool
:return: The AudioDialog object with generated audio for each turn.
:rtype: AudioDialog
"""
Expand All @@ -137,7 +140,7 @@ def generate_utterances_audios(

# Generate the utterance audio
utterance_audio, utterance_sampling_rate = generate_utterance(
text=AudioUtils.remove_audio_tags(turn.text),
text=turn.text,
voice=turn.voice,
tts_pipeline=tts_pipeline,
tts_pipeline_kwargs=tts_pipeline_kwargs
Expand All @@ -156,9 +159,16 @@ def generate_utterances_audios(
target_sr=sampling_rate,
)

# Remove the silences at the beginning and the end of the audio
if remove_silences:
utterance_audio, _ = librosa.effects.trim(utterance_audio, top_db=60)

# Set the utterance audio to the turn
turn.set_audio(utterance_audio, sampling_rate)

# Set the audio duration of the turn
turn.audio_duration = utterance_audio.shape[0] / sampling_rate

return dialog


Expand Down Expand Up @@ -188,7 +198,12 @@ def generate_utterance(
:return: A tuple containing the audio data as a numpy array and the sampling rate.
:rtype: tuple[np.ndarray, int]
"""
return tts_pipeline.generate(text, speaker_voice=voice, tts_pipeline_kwargs=tts_pipeline_kwargs)
audio, sr = tts_pipeline.generate(text, speaker_voice=voice, tts_pipeline_kwargs=tts_pipeline_kwargs)

if isinstance(audio, torch.Tensor):
audio = audio.cpu().numpy()

return audio, sr


def generate_audio_room_accoustic(
Expand All @@ -201,7 +216,9 @@ def generate_audio_room_accoustic(
audio_file_format: str = "wav",
background_effect: str = "white_noise",
foreground_effect: str = "ac_noise_minimal",
foreground_effect_position: RoomPosition = RoomPosition.TOP_RIGHT
foreground_effect_position: RoomPosition = RoomPosition.TOP_RIGHT,
callback_mix_fn: Optional[Callable] = None,
callback_mix_kwargs: dict = {}
) -> AudioDialog:
"""
Generates room acoustics simulation for the dialogue audio.
Expand Down Expand Up @@ -237,16 +254,29 @@ def generate_audio_room_accoustic(
:type foreground_effect: str
:param foreground_effect_position: Position for foreground effects.
:type foreground_effect_position: RoomPosition
:param callback_mix_fn: Callback function to apply to the mixed audio.
:type callback_mix_fn: Optional[Callable]
:param callback_mix_kwargs: Keyword arguments for the callback function.
:type callback_mix_kwargs: dict
:return: The AudioDialog with room acoustics simulation results and file paths.
:rtype: AudioDialog
"""

# Create the room acoustics simulator
room_acoustics = AcousticsSimulator(room=room, kwargs_pyroom=kwargs_pyroom)

# Prepare callback kwargs
_callback_mix_kwargs = callback_mix_kwargs.copy() if callback_mix_kwargs is not None else {}

# Add dialog to kwargs if not present
if "dialog" not in _callback_mix_kwargs:
_callback_mix_kwargs["dialog"] = dialog

_audio_accoustic = room_acoustics.simulate(
sources=dialog.get_audio_sources(),
source_volumes=source_volumes
source_volumes=source_volumes,
callback_mix_fn=callback_mix_fn,
callback_mix_kwargs=_callback_mix_kwargs,
)

# Save the audio file
Expand All @@ -270,28 +300,28 @@ def generate_audio_room_accoustic(
# If the audio paths post processing are already in the dialog, use them, otherwise create a new dictionary
if (
room_name in dialog.audio_step_3_filepaths
and "audio_paths_post_processing" in dialog.audio_step_3_filepaths[room_name]
and dialog.audio_step_3_filepaths[room_name]["audio_paths_post_processing"] != {}
and dialog.audio_step_3_filepaths[room_name].audio_paths_post_processing is not None
and dialog.audio_step_3_filepaths[room_name].audio_paths_post_processing != {}
):
audio_paths_post_processing = dialog.audio_step_3_filepaths[room_name]["audio_paths_post_processing"]
audio_paths_post_processing = dialog.audio_step_3_filepaths[room_name].audio_paths_post_processing
logger.info(
f"Existing audio paths for the post processing stage "
f"already exist for room name: '{room_name}' and are kept unchanged"
)
else:
audio_paths_post_processing = {}

dialog.audio_step_3_filepaths[room_name] = {
"audio_path": current_room_audio_path,
"microphone_position": room.mic_position,
"room_name": room_name,
"room": room,
"source_volumes": source_volumes,
"kwargs_pyroom": kwargs_pyroom,
"background_effect": background_effect,
"foreground_effect": foreground_effect,
"foreground_effect_position": foreground_effect_position,
"audio_paths_post_processing": audio_paths_post_processing
}
dialog.audio_step_3_filepaths[room_name] = RoomAcousticsConfig(
audio_path=current_room_audio_path,
microphone_position=room.mic_position,
room_name=room_name,
room=room,
source_volumes=source_volumes,
kwargs_pyroom=kwargs_pyroom,
background_effect=background_effect,
foreground_effect=foreground_effect,
foreground_effect_position=foreground_effect_position,
audio_paths_post_processing=audio_paths_post_processing,
)

return dialog
36 changes: 32 additions & 4 deletions src/sdialog/audio/acoustics_simulator.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,10 @@
import os
import numpy as np
import soundfile as sf
from typing import List
from typing import List, Callable, Optional

from sdialog.audio.utils import logger, SourceVolume
from sdialog.audio.room import Room, AudioSource, RoomPosition, DirectivityType
from sdialog.audio.room import Room, AudioSource, RoomPosition, DirectivityType, Position3D


class AcousticsSimulator:
Expand Down Expand Up @@ -230,6 +230,8 @@ def _add_sources(

for i, audio_source in enumerate(audiosources):

# audio_source.position = audio_source.position.replace("sfx|", "")

self.audiosources.append(audio_source)

# Get the position of the audio source
Expand All @@ -242,6 +244,22 @@ def _add_sources(
elif audio_source.position.startswith("speaker_"): # speaker_ is the speaker sound
_position3d = self.room.speakers_positions[audio_source.position]

# Check if the position corresponds to a furniture
elif audio_source.position in self.room.furnitures:
furniture = self.room.furnitures[audio_source.position]
_position3d = Position3D(
furniture.x + furniture.width / 2,
furniture.y + furniture.depth / 2,
furniture.get_top_z()
)

else:
logger.warning(
f"Unknown position '{audio_source.position}' for audio source '{audio_source.name}'. "
"Placing it at the center of the room."
)
_position3d = self.room.room_position_to_position3d(RoomPosition.CENTER)

# Load the audio file from the file system for the audio source
if audio_source.source_file and os.path.exists(audio_source.source_file):

Expand Down Expand Up @@ -279,7 +297,9 @@ def simulate(
self,
sources: List[AudioSource] = [],
source_volumes: dict[str, SourceVolume] = {},
reset: bool = False
reset: bool = False,
callback_mix_fn: Optional[Callable] = None,
callback_mix_kwargs: Optional[dict] = None
):
"""
Simulates room acoustics for the given audio sources.
Expand All @@ -301,6 +321,10 @@ def simulate(
:type source_volumes: dict[str, SourceVolume]
:param reset: If True, resets the room acoustics simulator before simulation.
:type reset: bool
:param callback_mix_fn: Callback function to apply to the mixed audio.
:type callback_mix_fn: Optional[Callable]
:param callback_mix_kwargs: Keyword arguments for the callback function.
:type callback_mix_kwargs: dict
:return: Processed audio with room acoustics effects applied.
:rtype: np.ndarray
:raises ValueError: If audio sources are invalid or empty.
Expand All @@ -316,7 +340,10 @@ def simulate(
self._add_sources(sources, source_volumes)

logger.info("[Step 3] Simulating room acoustics...")
self._pyroom.simulate()
self._pyroom.simulate(
callback_mix=callback_mix_fn if callback_mix_fn is not None else None,
callback_mix_kwargs=callback_mix_kwargs if callback_mix_fn is not None else {}
)

except ValueError as e:

Expand Down Expand Up @@ -361,6 +388,7 @@ def reset(self):

del self._pyroom
self._pyroom = None
self.audiosources = []

@staticmethod
def apply_snr(x, snr):
Expand Down
Loading
Loading