IITG-HingCoS: A Hindi-English Code-Switching Corpus

Description:

The Hindi-English (Hinglish) code-switching database is created at the Electro-Medical and Speech Technology (EMST) Laboratory, Indian Institute of Technology Guwahati (IITG). This corpus is primarily design for Hinglish code-switching acoustic and language modeling in the context of automatic speech recognition task. It consists of Hinglish code-switching text data having 25,988 sentences with a total of 0.58 million words. In addition to that, the corpus also contains 25 hours of matching speech data corresponding to 9,251 code-switching sentences covering a vocabulary of 6,542 words. The speech corpus contains speech recorded in realistic environment using landline and mobile phones.

Example Hinglish sentences along with their English translations: Click here

Example Hinglish acoustic data:

Citation Details:

Sreeram Ganji, Kunal Dhawan and Rohit Sinha, "IITG-HingCoS corpus: A Hinglish code-switching database for automatic speech recognition," Journal of Speech Communication, vol 110, pp. 76-89, 2019.

Request for Database: Click here