Skip to content

Commit 32c11db

Browse files
committed
Update README.
1 parent 3cf46bf commit 32c11db

File tree

1 file changed

+77
-33
lines changed

1 file changed

+77
-33
lines changed

README.md

Lines changed: 77 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,23 @@ I. Overview
33
This suite supports evaluation of diarization system output relative
44
to a reference diarization subject to the following conditions:
55

6-
- both the reference and system diarizations are saved within [Rich Transcription Time Marked (RTTM)](#rttm) files
6+
- both the reference and system diarizations are saved within [Rich
7+
Transcription Time Marked (RTTM)](#rttm) files
78
- for any pair of recordings, the sets of speakers are disjoint
89

910

1011
II. Dependencies
1112
==========
1213
The following Python packages are required to run this software:
1314

14-
- Python >= 2.7.1 (https://www.python.org/)
15+
- Python >= 2.7.1* (https://www.python.org/)
1516
- NumPy >= 1.6.1 (https://github.com/numpy/numpy)
16-
- SciPy >= 0.10.0 (https://github.com/scipy/scipy)
17-
- intervaltree >= 2.1.0 (https://pypi.python.org/pypi/intervaltree)
17+
- SciPy >= 0.17.0 (https://github.com/scipy/scipy)
18+
- intervaltree >= 3.0.0 (https://pypi.python.org/pypi/intervaltree)
1819
- tabulate >= 0.5.0 (https://pypi.python.org/pypi/tabulate)
1920

21+
* Tested with Python 2.7.X, 3.6.X, and 3.7.X.
22+
2023

2124
III. Metrics
2225
======
@@ -36,14 +39,52 @@ As with word error rate, a score of zero indicates perfect performance and
3639
higher scores (which may exceed 100) indicate poorer performance. For more
3740
details, consult section 6.1 of the [NIST RT-09 evaluation plan](https://web.archive.org/web/20100606041157if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf).
3841

42+
Jaccard error rate
43+
------------------
44+
We also report Jaccard error rate (JER), a metric introduced for [DIHARD II](https://coml.lscp.ens.fr/dihard/index.html) that is based on the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index). The Jaccard index is a similarity
45+
measure typically used to evaluate the output of image segmentation systems and
46+
is defined as the ratio between the intersection and union of two segmentations.
47+
To compute Jaccard error rate, an optimal mapping between reference and system
48+
speakers is determined and for each pair the Jaccard index of their
49+
segmentations is computed. The Jaccard error rate is then 1 minus the average
50+
of these scores.
51+
52+
More concretely, assume we have ``N`` reference speakers and ``M`` system
53+
speakers. An optimal mapping between speakers is determined using the
54+
Hungarian algorithm so that each reference speaker is paired with at most one
55+
system speaker and each system speaker with at most one reference speaker. Then,
56+
for each reference speaker ``ref`` the speaker-specific Jaccard error rate is
57+
``(FA + MISS)/TOTAL``, where:
58+
59+
- ``TOTAL`` is the duration of the union of reference and system speaker
60+
segments; if the reference speaker was not paired with a system speaker, it is
61+
the duration of all reference speaker segments
62+
- ``FA`` is the total system speaker time not attributed to the reference
63+
speaker; if the reference speaker was not paired with a system speaker, it is
64+
0
65+
- ``MISS`` is the total reference speaker time not attributed to the system
66+
speaker; if the reference speaker was not paired with a system speaker, it is
67+
equal to ``TOTAL``
68+
69+
The Jaccard error rate then is the average of the speaker specific Jaccard error
70+
rates.
71+
72+
JER and DER are highly correlated with JER typically being higher, especially in
73+
recordings where one or more speakers is particularly dominant. Where it tends
74+
to track DER is in outliers where the diarization is especially bad, resulting
75+
in one or more unmapped system speakers whose speech is not then penalized. In
76+
these cases, where DER can easily exceed 500%, JER will never exceed 100% and
77+
may be far lower if the reference speakers are handled correctly. For this
78+
reason, it may be useful to pair JER with another metric evaluating speech
79+
detection and/or speaker overlap detection.
3980

4081
Clustering metrics
4182
---------------------------------
42-
An alternate approach to system evaluation is convert both the reference and
43-
system outputs to frame-level labels, then evaluate using one of many
44-
well-known approaches for evaluating clustering performance. Each recording
45-
is converted to a sequence of 10 ms frames, each of which is assigned a single
46-
label corresponding to one of the following cases:
83+
A third approach to system evaluation is convert both the reference and system
84+
outputs to frame-level labels, then evaluate using one of many well-known
85+
approaches for evaluating clustering performance. Each recording is converted to
86+
a sequence of 10 ms frames, each of which is assigned a single label
87+
corresponding to one of the following cases:
4788

4889
- the frame contains no speech
4990
- the frame contains speech from a single speaker (one label per speaker
@@ -56,7 +97,7 @@ These frame-level labelings are then scored with the following metrics:
5697
### Goodman-Kruskal tau
5798
Goodman-Kruskal tau is an asymmetric association measure dating back to work
5899
by Leo Goodman and William Kruskal in the 1950s (Goodman and Kruskal, 1954).
59-
For a reference labeling ``ref`` and a system labeling ``ref``,
100+
For a reference labeling ``ref`` and a system labeling ``sys``,
60101
``GKT(ref, sys)`` corresponds to the fraction of variability in ``sys`` that
61102
can be explained by ``ref``. Consequently, ``GKT(ref, sys)`` is 1 when ``ref``
62103
is perfectly predictive of ``sys`` and 0 when it is not predictive at all.
@@ -113,7 +154,8 @@ files ``ref1.rttm``, ``ref2.rttm``, ...:
113154
which will calculate and report the following metrics both overall and on
114155
a per-file basis:
115156

116-
- ``DER`` -- diarization error rate
157+
- ``DER`` -- diarization error rate (in percent)
158+
- ``JER`` -- Jaccard error rate (in percent)
117159
- ``B3-Precision`` -- B-cubed precision
118160
- ``B3-Recall`` -- B-cubed recall
119161
- ``B3-F1`` -- B-cubed F1
@@ -144,11 +186,10 @@ out and warn about any speaker turns not present in those files, and trim the
144186
remaining turns to the relevant scoring regions before computing the metrics
145187
as before.
146188

147-
DER is scored using the NIST ``md-eval.pl`` tool with
148-
a default collar size of 0 ms and explicitly including regions that contain
149-
overlapping speech in the reference diarization. If desired, this behavior
150-
can be altered using the ``--collar`` and ``--ignore_overlaps`` flags. For
151-
instance
189+
DER is scored using the NIST ``md-eval.pl`` tool with a default collar size of
190+
0 ms and explicitly including regions that contain overlapping speech in the
191+
reference diarization. If desired, this behavior can be altered using the
192+
``--collar`` and ``--ignore_overlaps`` flags. For instance
152193

153194
python score.py --collar 0.100 --ignore_overlaps -R ref.scp -S sys.scp
154195

@@ -158,22 +199,24 @@ reference and system speaker turns **WITHOUT** any use of collars. The default
158199
frame step is 10 ms, which may be altered via the ``--step`` flag. For more
159200
details, consult the docstrings within the ``scorelib.metrics`` module.
160201

161-
The overall and per-file results will be printed to STDOUT as a table; for instance
162-
163-
File DER B3-Precision B3-Recall B3-F1 GKT(ref, sys) GKT(sys, ref) H(ref|sys) H(sys|ref) MI NMI
164-
--------------------------- ----- -------------- ----------- ------- --------------- --------------- ------------ ------------ ---- -----
165-
CMU_20020319-1400_d01_NONE 6.10 0.91 1.00 0.95 1.00 0.88 0.22 0.00 2.66 0.96
166-
ICSI_20000807-1000_d05_NONE 17.37 0.72 1.00 0.84 1.00 0.68 0.65 0.00 2.79 0.90
167-
ICSI_20011030-1030_d02_NONE 13.06 0.80 0.95 0.87 0.95 0.80 0.54 0.11 5.10 0.94
168-
LDC_20011116-1400_d06_NONE 5.64 0.95 0.89 0.92 0.85 0.93 0.10 0.27 1.87 0.91
169-
LDC_20011116-1500_d07_NONE 1.69 0.96 0.96 0.96 0.95 0.95 0.14 0.12 2.39 0.95
170-
NIST_20020305-1007_d01_NONE 42.05 0.51 0.95 0.66 0.93 0.44 1.58 0.11 2.13 0.74
171-
*** TOTAL *** 14.31 0.81 0.96 0.88 0.96 0.80 0.55 0.10 5.45 0.94
172-
173-
Some basic control of the formatting of this table is possible via the ``--n_digits`` and
174-
``--table_format`` flags. The former controls the number of decimal places printed for floating
175-
point numbers, while the latter controls the table format. For a list of valid table formats plus example
176-
outputs, consult the [documentation](https://pypi.python.org/pypi/tabulate) for the ``tabulate`` package.
202+
The overall and per-file results will be printed to STDOUT as a table; for
203+
instance:
204+
205+
File DER JER B3-Precision B3-Recall B3-F1 GKT(ref, sys) GKT(sys, ref) H(ref|sys) H(sys|ref) MI NMI
206+
--------------------------- ----- ----- -------------- ----------- ------- --------------- --------------- ------------ ------------ ---- -----
207+
CMU_20020319-1400_d01_NONE 6.10 20.10 0.91 1.00 0.95 1.00 0.88 0.22 0.00 2.66 0.96
208+
ICSI_20000807-1000_d05_NONE 17.37 21.92 0.72 1.00 0.84 1.00 0.68 0.65 0.00 2.79 0.90
209+
ICSI_20011030-1030_d02_NONE 13.06 25.61 0.80 0.95 0.87 0.95 0.80 0.54 0.11 5.10 0.94
210+
LDC_20011116-1400_d06_NONE 5.64 16.10 0.95 0.89 0.92 0.85 0.93 0.10 0.27 1.87 0.91
211+
LDC_20011116-1500_d07_NONE 1.69 2.00 0.96 0.96 0.96 0.95 0.95 0.14 0.12 2.39 0.95
212+
NIST_20020305-1007_d01_NONE 42.05 53.38 0.51 0.95 0.66 0.93 0.44 1.58 0.11 2.13 0.74
213+
*** OVERALL *** 14.31 26.75 0.81 0.96 0.88 0.96 0.80 0.55 0.10 5.45 0.94
214+
215+
Some basic control of the formatting of this table is possible via the
216+
``--n_digits`` and ``--table_format`` flags. The former controls the number of
217+
decimal places printed for floating point numbers, while the latter controls
218+
the table format. For a list of valid table formats plus example outputs,
219+
consult the [documentation](https://pypi.python.org/pypi/tabulate) for the ``tabulate`` package.
177220

178221
For additional details consult the docstring of ``score.py``.
179222

@@ -182,7 +225,8 @@ V. File formats
182225
========
183226
RTTM
184227
-------
185-
Rich Transcription Time Marked (RTTM) files are space-delimited text files containing one turn per line, each line containing ten fields:
228+
Rich Transcription Time Marked (RTTM) files are space-delimited text files
229+
containing one turn per line, each line containing ten fields:
186230

187231
- ``Type`` -- segment type; should always by ``SPEAKER``
188232
- ``File ID`` -- file name; basename of the recording minus extension (e.g.,

0 commit comments

Comments
 (0)