You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/paper.md
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,10 +9,12 @@ tags:
9
9
authors:
10
10
- name: Sebastian Rosenzweig
11
11
orcid: 0000-0003-4964-9217
12
+
equal-contrib: true
12
13
corresponding: true
13
14
affiliation: 1
14
15
- name: Marius Kriegerowski
15
16
orcid:
17
+
equal-contrib: true
16
18
corresponding: false
17
19
affiliation: 2
18
20
- name: Frank Scherbaum
@@ -35,7 +37,7 @@ bibliography: paper.bib
35
37
# Summary
36
38
Polyphonic singing is one of the most widespread forms of music-making. During a performance, singers must constantly adjust their pitch to stay in tune with one another — a complex skill that requires extensive practice. Research has shown that pitch monitoring tools can assist singers in fine-tuning their intonation during a performance [@BerglinPD22_VisualFeedback_JPM]. Specifically, real-time visualizations of the fundamental frequency (F0), which represents the pitch of the singing voice, help singers assess their pitch relative to a fixed reference or other voices.
37
39
To support the monitoring of polyphonic singing performances, we developed `pytch`, an interactive Python tool with a graphical user interface (GUI) designed to record, process, and visualize multiple voices in real time. The GUI displays vocal spectra and estimated F0 trajectories for all singers, as well as the harmonic intervals between them. Additionally, users can adjust visual and algorithmic parameters interactively to accommodate different input devices, microphone signals, singing styles and use cases. Written in Python, `pytch` utilizes the `libf0-realtime` library [@MeierSM25_RealTimeF0_ISMIR] for real-time F0 estimation and `pyqtgraph`[^1] for efficient visualizations of the analysis results.
38
-
Our tool builds upon a late-breaking demo in [@KriegerowskiS_Pytch_2017] - which we refer to as version 1. Since then, the tool has been significantly extended with a new real-time graphics engine, a modular audio processing backend that facilitates the integration of additional algorithms, and improved support for a wider range of platforms and recording hardware - which we refer to as version 2. Over its seven years of development, `pytch` has been tested and refined through use in several rehearsals, workshops, and field studies — including Sardinian quartet singing (see demo video[^2]) and traditional Georgian singing (see demo video[^3]).
40
+
Our tool builds upon a late-breaking demo in [@KriegerowskiS_Pytch_2017] - which we refer to as version 1. Since then, the tool has been significantly extended with a new real-time graphics engine, a modular audio processing backend that facilitates the integration of additional algorithms, and improved support for a wider range of platforms and recording hardware - which we refer to as version 2. Over its seven years of development, `pytch` has been tested and refined through use in several rehearsals, workshops, and field studies — including Sardinian quartet singing (see demo video[^2]) and traditional Georgian singing (see demo video[^3]).
@@ -57,19 +59,19 @@ In addition to its practical applications, `pytch` also provides a flexible plat
57
59
58
60
# Multitrack Singing Recordings
59
61
60
-
To fully leverage the capabilities of `pytch`, it is essential to record each singer with an individual microphone. In contrast, stereo recordings—such as those captured by a room microphone placed in front of the ensemble—often suffer from overlapping signals, making it difficult to analyze individual voices. Suitable multitrack recordings can be obtained using handheld dynamic microphones or headset microphones, both of which provide good audio quality. However, these setups are prone to cross-talk, especially when singers are positioned close together.
62
+
To fully leverage the capabilities of `pytch`, it is essential to record each singer with an individual microphone. Stereo recordings—such as those captured by a room microphone placed in front of the ensemble—often suffer from overlapping signals, making it difficult to analyze individual voices. While there is no hard limit on the number of channels, we recommend to record up to four individual singers to ensure visibility of the charts and responsiveness of the GUI. Suitable multitrack recordings can be obtained using handheld dynamic microphones or headset microphones. However, these setups are prone to cross-talk, especially when singers are positioned close together.
61
63
62
-
One way to reduce cross-talk is to increase the physical distance between singers or to record them in isolation. However, this is not always feasible, as singers need to hear one another to maintain accurate tuning. An effective workaround is the use of contact microphones, such as throat microphones, which capture vocal fold vibrations directly from the skin of the throat. This method offers a significant advantage: the recorded signals are largely immune to interference from other singers, resulting in much cleaner, more isolated recordings [@Scherbaum16_LarynxMicrophones_IWFMA].
64
+
One way to reduce cross-talk is to increase the physical distance between singers or to record them in isolation. However, this is not always feasible, as singers need to hear one another to maintain accurate tuning. An effective workaround is the use of contact microphones, such as throat microphones, which capture vocal fold vibrations directly from the skin of the throat. This method offers a significant advantage: the recorded signals are largely immune to interference from other singers, resulting in much cleaner, more isolated recordings. Throat microphones have successfully been used to record vocal ensembles in several past studies[@Scherbaum16_LarynxMicrophones_IWFMA].
63
65
64
66
65
67
# Audio Processing
66
68
The real-time audio processing pipeline implemented in the file `audio.py` is the heart of `pytch` and consists of two main stages: recording and analysis. The recording stage records multichannel audio waveforms from the soundcard or an external audio interface using the `sounddevice` library. The library is based on PortAudio and supports a wide range of operating systems, audio devices, and sampling rates. The recorded audio is received in chunks via a recording callback and fed into a ring buffer shared with the analysis process. When the buffer is sufficiently filled with audio chunks, the analysis process reads the recorded audio to compute several audio features.
67
69
68
-
For each channel, the analysis stage computes the audio level in dBFS, a time-frequency representation of the audio signal via the Short-Time Fourier Transform (see [@Mueller21_FMP_SPRINGER] for fundamentals of music processing), and an estimate of the F0 along with a confidence value using the `libf0-realtime` library [@MeierSM25_RealTimeF0_ISMIR]. The library includes several real-time implementations of well-known F0 estimation algorithms, such as YIN [@CheveigneK02_YIN_JASA] and SWIPE. YIN is a time-domain algorithm that computes the F0 based on a tweaked auto-correlation function. It is computationally efficient and well-suited for low-latency applications, but it tends to suffer from estimation errors, particularly confusions with higher harmonics such as the octave. In contrast, SWIPE is a frequency-domain algorithm that estimates the F0 by matching different spectral representations of the audio with sawtooth-like kernels. While more computationally demanding, SWIPE typically yields more reliable estimates, in particular for vocal input signals. `pytch` allows users to choose between these algorithms depending on their specific needs and system capabilities. The obtained F0 estimates, which are natively computed in the unit Hz are converted to the unit cents using a user-specified reference frequency. Depending on the audio quality and vocal characteristics, F0 estimates may exhibit artifacts such as discontinuities or pitch slides, which can make the resulting trajectories difficult to interpret [@RosenzweigSM19_StableF0_ISMIR]. Previous research has shown that using throat microphones can improve the isolation of individual voices in group singing contexts, resulting in cleaner signals and more accurate F0 estimates [@Scherbaum16_LarynxMicrophones_IWFMA]. To further enhance interpretability, `pytch` includes several optional post-processing steps: a confidence threshold to discard estimates with low confidence score, a median filter to smooth the trajectories, and a gradient filter to suppress abrupt pitch slides. As a final step in the audio analysis, the harmonic intervals between the F0 trajectories are computed. Every audio feature is stored separately in a dedicated ring buffer. After processing, the pipeline sets a flag that notifies the GUI that new data is ready for visualization.
70
+
For each channel, the analysis stage computes the audio level in dBFS, a time-frequency representation of the audio signal via the Short-Time Fourier Transform (see [@Mueller21_FMP_SPRINGER] for fundamentals of music processing), and an estimate of the F0 along with a confidence value using the `libf0-realtime` library [@MeierSM25_RealTimeF0_ISMIR]. The library includes several real-time implementations of well-known F0 estimation algorithms, such as YIN [@CheveigneK02_YIN_JASA] and SWIPE [@CamachoH08_SawtoothWaveform_JASA]. YIN is a time-domain algorithm that computes the F0 based on a tweaked auto-correlation function. It is computationally efficient and well-suited for low-latency applications, but it tends to suffer from estimation errors, particularly confusions with higher harmonics such as the octave. In contrast, SWIPE is a frequency-domain algorithm that estimates the F0 by matching different spectral representations of the audio with sawtooth-like kernels. While more computationally demanding, SWIPE typically yields more reliable estimates, in particular for vocal input signals. `pytch` allows users to choose between these algorithms depending on their specific needs and system capabilities. The obtained F0 estimates, which are natively computed in the unit Hz are converted to the unit cents using a user-specified reference frequency. Depending on the audio quality and vocal characteristics, F0 estimates may exhibit artifacts such as discontinuities or pitch slides, which can make the resulting trajectories difficult to interpret [@RosenzweigSM19_StableF0_ISMIR]. Previous research has shown that using throat microphones can improve the isolation of individual voices in group singing contexts, resulting in cleaner signals and more accurate F0 estimates [@Scherbaum16_LarynxMicrophones_IWFMA]. To further enhance interpretability, `pytch` includes several optional post-processing steps: a confidence threshold to discard estimates with low confidence score, a median filter to smooth the trajectories, and a gradient filter to suppress abrupt pitch slides. As a final step in the audio analysis, the harmonic intervals between the F0 trajectories are computed. Every audio feature is stored separately in a dedicated ring buffer. After processing, the pipeline sets a flag that notifies the GUI that new data is ready for visualization.
69
71
70
72
71
73
# Graphical User Interface (GUI)
72
-
In this section, we provide a step-by-step explanation of the `pytch` GUI implemented in the file `gui.py`. Right after the program start, a startup menu opens in which the user is asked to specify the soundcard, input channels, sampling rate, and window size for processing. Furthermore, the user can choose to store the recorded audio and the F0 trajectories on disk. These configuration choices are required to initialize the audio processing module and the main GUI which is loaded when the user clicks "ok". While there is no hard limit on the number of channels, we recommend to use up to four input channels to ensure visibility of the charts and responsiveness of the GUI. A screenshot of the main GUI is shown in Figure \autoref{fig:GUI}.
74
+
In this section, we provide a step-by-step explanation of the `pytch` GUI implemented in the file `gui.py`. Right after the program start, a startup menu opens in which the user is asked to specify the soundcard, input channels, sampling rate, and window size for processing. Furthermore, the user can choose to store the recorded audio and the F0 trajectories on disk. These configuration choices are required to initialize the audio processing module and the main GUI which is loaded when the user clicks "ok". A screenshot of the main GUI which opens after successful initialization is shown in Figure \autoref{fig:GUI}.
73
75
74
76
{ width=90% }
0 commit comments