Corpus Contents
The motivation behind the creation of the Nijmegen Corpus of Casual
Czech was to provide large amounts of high-quality recordings of casual
speech suitable for phonetic analysis. The uniqueness of our corpus can
be characterized as follows:
- It contains around 30 hours of orthographically-transcribed casual conversations elicited following a thoroughly tested procedure (361,977 word tokens).
- It contains high-quality recordings captured with head-mounted microphones in a sound-attenuated room.
- It contains speech from 60 speakers (30 female and 30 male) of the same age and sharing the same geographic background. This allows researchers to study inter-speaker variation.
- It contains large amounts of data for every speaker (around 90 minutes of recorded conversation for every group of three speakers). This allows researchers to study within-speaker variability.
- It contains audio as well as video data, which can be used to study facial and body gestures during verbal communication.
The following screenshot illustrates a short excerpt from one of the
conversations in the corpus (click on image for audio):