Nijmegen Corpus of Casual Spanish

Corpus transcription

The corpus was orthographically transcribed in Barcelona by Verbio Speech Technologies S.L. using Transcriber software. The transcription process consisted of three passes. In the first pass, the speech of every pair of speakers was orthographically transcribed in a two-tier annotation file using stereo-channel audio streams. Confederates, who had been recorded in a a separate mono channel, were transcribed separately in a one-tier annotation file. The transcribed text is organized into chunks corresponding to not more than 15 seconds of the speech signal. In the second pass, non-speech events (e.g. laughter, filled pauses, etc) were added to the orthographic transcription, the location of chunk boundaries was readjusted, and the spelling of the transcription was checked using the Diccionario de la Real Academia Española as a reference. In the third pass, an automatic revision of the formatting of the transcription files was performed. Every pass was carried out by a different transcriber.

The orthographic transcription of the corpus contains around 393 000 word tokens and 16 500 word types (distinct orthographic forms) distributed over 98 000 chunks. Part 1 contains around 83 000 word tokens, while Parts 2 and 3 contain each around 155 000 word tokens.