Acquiring and adapting phonetic categories in a computational model of speech perception

Toscano, J. C. (2013, April). Invited paper presented at the Workshop on Current Issues and Methods in Speaker Adaptation, Ohio State University, Columbus, OH.


Recent work on perceptual adaptation has demonstrated that listeners can learn novel distributions of acoustic cues in unsupervised learning tasks with only a small amount of experience (Clayards, Tanenhaus, Aslin, & Jacobs, 2008, Cognition; Munson, 2011, dissertation). The learning problem faced by listeners in these tasks is similar to the one faced by infants acquiring the phonetic categories of their native language. In both cases, sounds are unlabeled and representations must be updated continuously as new input is received.

Can the same unsupervised learning algorithms that listeners use to acquire categories over development be used to adapt those categories in adulthood? Here, I present a computational model of speech perception (a Gaussian mixture model; McMurray, Aslin, & Toscano, 2009, Developmental Science) and simulations designed to address this question. The model represents phonetic categories as Gaussian distributions along acoustic cue dimensions, and it learns to map cues onto categories using a competitive statistical learning mechanism.

Previous work has shown that the model can successfully acquire phonetic categories. Given this, we can ask whether it can also adapt those categories in a perceptual learning task. These two processes are typically viewed as distinct: Language acquisition is seen as a slow process that occurs early in development and produces stable long-term representations, whereas adaptation is seen as a rapid process that can occur over the course of an hour and may produce only transient changes. As a result, it is not clear whether the learning rates that lead to successful development will also lead to successful adaptation. Moreover, it is unclear whether listeners’ behavior in perceptual learning experiments can be explained via adaptation of long-term representations of phonetic categories.

We examined these issues in the context of the perceptual learning task presented in Munson (2011). In this study, listeners heard words varying in voice onset time (VOT) between minimal pairs differing in word-initial voicing. VOT-values were drawn from distributions in which the category boundary between voiced and voiceless tokens was either short (15 ms VOT) or long (35 ms). After hearing 300 tokens drawn from one of these distributions, listeners had adapted their category boundaries in the direction consistent with the distribution they heard.

The model was tested in the same task. First, we assessed its ability to correctly learn English voicing categories at a variety of learning rates. As expected, slower rates were more likely to yield successful acquisition and produced more stable categories. Next, we trained the model on VOT-values drawn from the distributions in Munson (2011) to ask whether a subset of the learning rates that worked for development also allowed for rapid adaptation. We found that this was the case: A common set of parameters can produce both successful acquisition and successful adaptation.

These simulations show that relatively simple unsupervised learning algorithms are sufficient for explaining speech sound learning on vastly different time-scales without changes in plasticity. Further, they suggest that some aspects of perceptual adaptation can be explained simply by the adjustment of listeners’ long-term phonetic category representations.

Talk slides:

Toscano CIMSA 2013 talk

Tagged with: , , , , , ,
Posted in Presentations