A web application for filtering and annotating web speech data
No Access Until
A vast and growing amount of recorded speech is freely available on the web, including podcasts, radio broadcasts, and posts on media-sharing sites. However, finding specific words or phrases in online speech data remains a challenge for researchers, not least because transcripts of this data are often automatically-generated and imperfect. We have developed a web application, “ezra”, that addresses this challenge by allowing non-expert and potentially remote annotators to filter and annotate speech data collected from the web and produce large, high-quality data sets suitable for speech research. We have used this application to filter and annotate thousands of speech tokens. Ezra is freely available on GitHub1, and development continues.
Journal / Series
Volume & Issue
NSF 1035151 RAPID: Harvesting Speech Datasets for Linguistic Research on the Web (Digging into Data Challenge)
Special Interest Group of the Association for Computational Linguistics on Web as Corpus (ACL SIGWAC)
corpus; speech; web interface; annotation; filtering; prosody
Number of Workers
Based on Related Item
Has Other Format(s)
Part of Related Item
Link(s) to Related Publication(s)
Link(s) to Reference(s)
Previously Published As
Stefan Evert, Egon Stemle and Paul Rayson (editors). Proceedings of the 8th Web as Corpus Workshop. July, 2013.