Web Harvest of Minimal Intonational Pairs
Howell, Jonathan; Rooth, Mats
This paper describes experiments on gathering spoken-language data on the web that bears on issues of the phonetics-phonology and semantics-pragmatics of intonation. The target data are tokens of fixed word strings like "than I did", where intonation varies in a way which correlates with grammatical and pragmatic context. In a web harvest procedure, audio files were identified using a search engine based in speech-to-text, downloaded, and cut to a relevant segment under program control. In an application of such a database, an SVM classifier was trained to make a grammatically determined distinction in intonation based on purely acoustic cues. Sources of error in the retrieval are quantified.
Preliminary version of paper to be presented at Web as Corpus 5, September 2009. Final version will be substituted on July 17, 2009.
intonation; focus; web as corpus; machine learning; prosody; comparatives; speech recognition; linguistics