Combining cues to recognize speech

Posted on 1 Nov 2012 by Joe Toscano

Acoustic measurements of phonetic cues to word-medial voicing

Toscano, J. C., & McMurray, B. (2012, November). Poster presented at the 53rd Annual Meeting of the Psychonomic Society, Minneapolis, MN.

Abstract: A great deal of work in speech has argued that invariant acoustic cues do not exist, leading many researchers to conclude that listeners use specialized representations, such as talkers’ inferred gestures, instead. Other work has emphasized that many phonological distinctions are signaled by multiple cues; Lisker (1986, Language and Speech, 29, 3-11), for example, lists 16 cues to voicing. Yet, few studies have measured the reliability of multiple cues and asked whether combining them may provide a solution to the lack of invariance. Here, we present measurements of 12 potential cues to the voiced/voiceless distinction in stops (including many of those reported by Lisker, 1986) and use a statistical modeling approach to determine which ones distinguish the two categories and how reliable those cues are. We recorded two-syllable non-words (/sVCVs/) with one of six consonants (/b,p,d,t,g,k/) and one of four vowels (/a,æ,i,u/). We found that talkers used multiple cues, but that cues varied in their usefulness. In addition, a classifier trained on the cues was able to accurately identify voicing categories. We argue that by harnessing information from multiple cues, listeners can overcome ambiguity in individual cues in specific utterances, allowing them to recognize speech across talkers and phonological contexts.

Download poster:

Tagged with: categorization, computational modeling, cue integration, cue weighting, speech perception, statistical modeling, voicing
Posted in Presentations