Speaker diarization is the problem of determining who spoke when from audio recordings when the number and identities of the speakers are unknown. Motivated by applications in automatic speech recognition and audio indexing, speaker diarization has been studied extensively over the past decade, and there are currently a wide variety of approaches including both top-down and bottom-up unsupervised clustering systems. The contributions of this talk are to provide a unified analysis of the current state-of-the-art, to understand where and why mistakes occur, and to identify directions for improvements.
In the first part of the talk, we analyze the behavior of six state-of-the-art diarization systems, all evaluated on the 2009 National Institute of Standards and Technology (NIST) Rich Transcription evaluation dataset. While performance is typically assessed in terms of a single number the diarization error rate (DER) we characterize the errors based on the durations of speech segments and their proximity to speaker changepoints. For all of the systems, performance degrades both as the segment duration decreases and as the proximity to the speaker changepoint increases. It is shown that while short segments are problematic, their overall impact on the DER is small. By contrast, the amount of time near speaker changepoints is relatively high, and thus poor performance near these changepoints contributes significantly to the DER. For example, for the single distant microphone (SDM) and multiple distant microphone (MDM) conditions, over 33% and 40% of the DER occurs within 0.5 seconds of a changepoint for all evaluated systems, respectively.
In the next part of the talk, we focus on the International Computer Science Institute (ICSI) speaker diarization system and explore the effects of various system modifications. This system contains many steps including speech activity detection, initialization, speaker segmentation, and speaker clustering. Inspired by our previous analysis, we focus on modifications that improve performance near speaker changepoints. We first implement an alternative to the minimum duration constraint, which sets the shortest amount of speech time before a speaker change can occur. This modification reduces the errors near speaker changepoints for the MDM condition. Next, we show that the difference between the largest and second largest log-likelihood scores separates the correct and incorrect segments, which has the potential to be useful for cluster purification.
Lastly, we explore the potential of applying speaker diarization methodologies to other applications. Specifically, we investigate the use of a diarization-based algorithm for the duplication detection problem, where the goal is to detect whether a query is a copy of a reference recording. With minimal modifications of the ICSI diarization system, we are able to obtain better than random performance. However, our approach is not competitive with existing approaches designed specifically for the problem of duplication detection, and the extent to which diarization-based approaches are useful for this application remains an open question.