IITG-MV Phase III


In the Phase-III of recording the telephone network was used keeping in mind about the possible remote person authentication using speech mode. Unlike in phase I and phase II, in phase III a facilitator connects a call between two person in their free time in the conference call mode. The variabilities present in the Phase-III database are four as listed below.

  • Multi-environment: Speech data were recorded by conversing in all kinds of practical environments possible like coffee shops, working places, rooms, Laboratories etc.
  • Multi-sensor: Speech data were recorded over different mobile handsets at sampling frequency of 8 kHz.
  • Multi-lingual: Every speaker spoke either in English or his/her mother tongue (favorite language).
  • Conversation style: Every speaker spoke in conversational style over a conference call.

Speech data was collected over conference call mode between two speakers when a facilitator connects the call in conference mode. The subjects were requested to engage in a natural conversation with any of their friends or relatives. In their whole conversation they mainly used two languages: their mother tongue and English. Predominantly, the subjects conversed in their mother tongue (favorite language), but in between they switched to English as well. While taking recordings, subjects were in different degraded environmental conditions that included background noise and reverberation as well. To impart a practical dimension to the corpus, we chose to record through the personal mobile handsets of the subjects themselves. This way we intended to cover all the different types of mobile sensors that are used by the public. Two sessions were taken for each speaker in which they conversed to the same person, but on different topics.

Figure-1 shows the way recording was done. The subjects were requested to give an appointment or a stipulated time at which they and their friend would be free to talk to each other. At the specified time, the facilitator, then called the subject from his mobile handset. The subject, in turn, gave the mobile number of the person he wants to talk to. The facilitator, subsequently put subject on hold, and called the other person. The two persons were then connected to each other by the facilitator, through a conference call. There was no precondition on the place of recording and the language of conversation. Subjects were engaged freely in conversation where language and matter of discussion changed frequently during the entire duration. For around fifteen minutes, the subjects talked to each other, while the call was being recorded at the facilitator's handset (Guwahati). In a very similar fashion, session two of the recording was taken for the same subject pair. On an average, the time gap between the two sessions was one week. This way, data was recorded from 100 speaker pairs leading to hundred conversations per session. Also, with a research perspective in mind, in almost 50 percent (47 out of 100) of the conversations, one of the subjects is common to both of the previously collected, Phase-I and Phase-II databases. Please refer to the IITG DIT MV database documentation for more details.


Figure-1: Conference Call scenario for Phase III data collection