You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ComputeTimeStampErrorJob: allow multiple caches as input, add output file with individual TSEs per segment pair (#605)
* `ComputeTimeStampErrorJob`: also accept list for ref/hyp caches
* Modify code to adapt to list format
* Add output of sorted highest TSE differences for analyzed segments
* Update imports
* Uniformize output
Sad but needed
* Remove "highest" prefix from output file
* Rename self parameter to plural
Technically it's a list of alignment caches
f"Found different number of word starts ({len(hyp_word_starts)}) "
707
-
f"than word ends ({len(hyp_word_ends)}) in reference. Something seems to be broken."
708
-
)
683
+
ref_allophone_map=ref_alignments.allophones
709
684
710
-
iflen(hyp_word_starts) !=len(ref_word_starts):
711
-
logging.warning(
712
-
f"Sequence {hyp_seq_tag} ({idx} / {len(file_list)}:\n Discarded because the number of words in alignment ({len(hyp_word_starts)}) does not equal the number of words in reference ({len(ref_word_starts)})."
# Sometimes different feature extraction or subsampling may produce mismatched lengths that are different by a few frames, so cut off at the shorter length
f"Sequence {hyp_seq_tag} ({idx} / {len(file_list)}):\n Word start distances are {seq_word_start_diffs}\n Word end distances are {seq_word_end_diffs}\n Sequence TSE is {seq_tse} frames"
f"Found different number of word starts ({len(hyp_word_starts)}) "
714
+
f"than word ends ({len(hyp_word_ends)}) in reference. Something seems to be broken."
756
715
)
757
-
discarded_seqs+=1
758
-
continue
716
+
717
+
iflen(hyp_word_starts) !=len(ref_word_starts):
718
+
logging.warning(
719
+
f"Sequence {hyp_seq_tag} ({idx} / {len(file_list)}:\n Discarded because the number of words in alignment ({len(hyp_word_starts)}) does not equal the number of words in reference ({len(ref_word_starts)})."
720
+
)
721
+
discarded_seqs+=1
722
+
continue
723
+
724
+
# Sometimes different feature extraction or subsampling may produce mismatched lengths that are different by a few frames, so cut off at the shorter length
f"Sequence {hyp_seq_tag} ({idx} / {len(file_list)}):\n Word start distances are {seq_word_start_diffs}\n Word end distances are {seq_word_end_diffs}\n Sequence TSE is {seq_tse} frames"
759
+
)
760
+
counted_seqs+=1
761
+
else:
762
+
logging.warning(
763
+
f"Sequence {hyp_seq_tag} ({idx} / {len(file_list)}):\n Discarded since all distances are over the upper limit"
764
+
)
765
+
discarded_seqs+=1
766
+
continue
759
767
760
768
logging.info(
761
769
f"Processing finished. Computed TSE value based on {counted_seqs} sequences; {discarded_seqs} sequences were discarded."
0 commit comments