Speech processing can be divided into two broad categories: first is speech recognition where the contents of speech audio are detected and second is the speaker recognition which is a task of identifying speakers in a conversation. The Speaker Diarization falls in the second category of speech processing where it is required to identify the speaker along with identification of the boundary/frame of the speech spoken by a particular speaker.
Speaker Diarization is the task of identifying the start and end time of a speaker in an audio file, together with the identity of the speaker i.e. who spoke when. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and by providing the speaker’s true identity. Speaker diarization is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments on the basis of speaker characteristics. Diarization has many applications in speaker indexing, retrieval, speech recognition with speaker identification, diarizing meeting and lectures.
With the increasing number of broadcasts, meeting recordings and voice mail collected every year, speaker diarization has received much attention by the speech community.
The work in this field has been done by an M Tech student, Aishwary Joshi (View Abstract).
Speaker Diarization is the task of identifying the start and end time of a speaker in an audio file, together with the identity of the speaker i.e. “who spoke when”. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. Diarization has many applications in speaker indexing, retrieval, speech recognition with speaker identification, diarizing meeting and lectures. There are various data domains of diarization system, and based on that domain diarization problem of “who spoke when” is modeled differently. for example in case of broadcast data domain, the diarization problem is to uniquely identify speaker across all recording of same broadcast show while in case of meeting data domain, the diarization system has to compensate session variabilities present in a single recording. In this study, the diarization error rate for meeting data domain have been tried to improve using combination of multiple features. The objective is to decrease or remove the effect of session variabilities which is most commonly present in the meeting recordings. Reducing session variabilities is attributed to improve the performance of diarization system.