Nijmegen Corpus of Casual French

Corpus transcription

The corpus was orthographically annotated by two professional transcribers using Transcriber software. The audio stream was manually segmented, separately for each speaker, into small chunks of a few seconds. The transcription guidelines stated that these chunks should contain speech having a clear degree of syntactic and semantic coherence and no long stretches of silence. When possible, boundaries between chunks were placed during pauses Each speaker was transcribed in a different annotation file. Transcribers were asked to restore common elisions to their full orthographic forms. For instance, expressions characteristic of casual speech such as y a 'there is' or J'sais pas 'I don't know' were respectively transcribed as il y a and je sais pas. Filled pauses were marked in the text by using specific orthographic forms (e.g. euh, hum). Breathing, laughter and mouth noises were also indicated in the transcriptions.