Psychology Wiki
(ref format and set up infant speech perception page)
No edit summary
Line 1: Line 1:
 
{{LangPsy}}
 
{{LangPsy}}
  +
{{Hearing}}
   
'''Speech perception''' refers to the processes by which humans are able to interpret and understand the sounds used in language. The study of speech perception is closely linked to the fields of [[phonetics]] and [[phonology]] in [[linguistics]] and [[cognitive psychology]] and [[perception]] in psychology.
+
'''Speech perception''' refers to the processes by which humans are able to interpret and understand the sounds used in language. The study of speech perception is closely linked to the fields of [[phonetics]] and [[phonology]] in [[linguistics]] and [[cognitive psychology]] and [[perception]] in [[psychology]]. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech research has applications in building computer systems that can recognize speech, as well as improving speech recognition for hearing- and language-impaired listeners.
   
 
==Basics of speech perception==
==Theories==
 
   
 
The process of perceiving speech begins at the level of the sound signal and the process of audition. (For a complete description of the process of audition see [[Hearing (sense)|Hearing]].) After processing the initial auditory signal, speech sounds are further processed to extract acoustic cues and phonetic information. This speech information can then be used for higher-level language processes, such as word recognition.
Some of the earliest work in the study of how humans perceive speech sounds was done by [[Alvin Liberman]] and his colleagues at [[Haskins Laboratories]] (1957). Using a speech synthesizer, they constructed speech sounds that varied in [[place of articulation]] along a continuum from /ba/ to /da/ to /ga/. Listeners were asked to identify which sound they heard and to discriminate between two different sounds. The results of the experiment showed that listeners grouped sounds into discrete categories, even though the sounds they were hearing were continuously varying. Based on these results, they proposed the notion of [[categorical perception]] as a mechanism by which humans are able to identify speech sounds.
 
   
  +
===Acoustic cues===
More recent research using different tasks and methodologies suggests that listeners are actually sensitive to acoustic differences within a single phonetic category.
 
   
  +
[[Image:Spectrograms of syllables dee dah doo.png|right|thumb|250px|Figure 1: Spectrograms of syllables "dee" (top), "dah" (middle), and "doo" (bottom) showing how the onset [[formant transition]]s that define perceptually the consonant {{IPA|[d]}} differ depending on the identity of the following vowel. ([[Formant]]s are highlighted by red dotted lines; transitions are the bending beginnings of the formant trajectories.)]]
==Basics of speech perception==
 
   
 
The speech sound signal contains a number of [[acoustic cues]] that are used in speech perception. The cues differentiate speech sounds belonging to different [[phonetic categories]]. For example, one of the most studied cues in speech is [[voice onset time]] or VOT. VOT is a primary cue signaling the difference between voiced and voiceless stop consonants, such as "b" and "p". Other cues differentiate sounds that are produced at different [[place of articulation|places of articulation]] or [[manner of articulation|manners of articulation]]. The speech system must also combine these cues to determine the category of a specific speech sound. This is often thought of in terms of abstract representations of [[phonemes]]. These representations can then be combined for use in word recognition and other language processes.
The process of perveicing speech begins at the level of the sound signal and the process of audition. (For a complete description of the process of audition see [[Hearing]].) After processing the initial auditory signal, speech sounds are further processed to extract acoustic cues and phonetic information.
 
   
  +
It is not easy to identify what acoustic cues listeners are sensitive to when perceiving a particular speech sound:
The sound signal contains a number of [[acoustic cues]] that are used in speech perception. The cues differentiate speech sounds belonging to different [[phonetic categories]]. For example, one of the most studied cues in speech is [[voice onset time]] or VOT. VOT is a primary cue signaling the difference between voiced and voiceless sounds, such as "b" and "p". Other cues differentiate sounds that are produced by different places of articulation or [[manner of articulation|manners of articulation]]. The speech system must also combine these cues to determine the category of a specific speech sound. This is often thought of in terms of abstract representations of [[phonemes]]. These representations can then be combined for use in word recognition and other language processes.
 
   
  +
:''At first glance, the solution to the problem of how we perceive speech seems deceptively simple. If one could identify stretches of the acoustic waveform that correspond to units of perception, then the path from sound to meaning would be clear. However, this correspondence or mapping has proven extremely difficult to find, even after some forty-five years of research on the problem.''<ref name="np">{{cite encyclopedia |author=Nygaard, L.C., Pisoni, D.B. |date=1995 |title=Speech Perception: New Directions in Research and Theory |editor=J.L. Miller, P.D. Eimas |encyclopedia=Handbook of Perception and Cognition: Speech, Language, and Communication |location=San Diego |publisher=Academic Press}}</ref>
The process of speech perception is not necessarily uni-directional. That is, higher-level language processes may interact with basic speech perception processes to aid in recognition of speech sounds.
 
  +
  +
If a specific aspect of the acoustic waveform indicated one linguistic unit, a series of tests using speech synthesizers would be sufficient to determine such a cue or cues. However, there are two significant obstacles:
  +
  +
# One acoustic aspect of the speech signal may cue different linguistically relevant dimensions. For example, the duration of a vowel in English can indicate whether or not the vowel is stressed, or whether it is in a syllable closed by a voiced or a voiceless consonant, and in some cases (like American English {{IPA|/ɛ/}} and {{IPA|/æ/}}) it can distinguish the identity of vowels.<ref>{{cite journal |author=Klatt, D.H. |year=1976 |title=Linguistic uses of segmental duration in English: Acoustic and perceptual evidence |journal=Journal of the Acoustical Society of America |volume=59(5) |pages=1208-1221}}</ref> Some experts even argue that duration can help in distinguishing of what is traditionally called short and long vowels in English.<ref>{{cite journal |author=Halle, M., Mohanan, K.P. |year=1985 |title=Segmental phonology of modern English |journal=Linguistic Inquiry |volume=16(1) |pages=57-116}}</ref>
  +
# One linguistic unit can be cued by several acoustic properties. For example in a classic experiment, [[Alvin Liberman]] (1957) showed that the onset [[formant transitions]] of {{IPA|/d/}} differ depending on the following vowel (see Figure 1) but they are all interpreted as the phoneme {{IPA|/d/}} by listeners.<ref>{{cite journal |author=Liberman, A.M. |year=1957 |title=Some results of research on speech perception |journal=Journal of the Acoustical Society of America |volume=29(1) |pages=117-123 |url=http://www.haskins.yale.edu/Reprints/HL0016.pdf |format=[[PDF]] |accessdate=2007-05-17}}</ref>
  +
  +
{{Anchor|segmentation}}
  +
===Linearity and the segmentation problem===
  +
  +
[[Image:Spectrogram of I owe you.png|right|thumb|300px|Figure 2: A spectrogram of the phrase "I owe you". There are no clearly distinguishable boundaries between speech sounds.]]
  +
  +
Although listeners perceive speech as a stream of discrete units ([[phonemes]], [[syllables]], and [[words]]), this linearity is difficult to be seen in the physical speech signal (see Figure 2 for an example). Speech sounds do not strictly follow one another, rather, they overlap.<ref name="fow">{{cite encyclopedia |author=Fowler, C. A. |date=1995 |title=Speech production |editor=J.L. Miller, P.D. Eimas |encyclopedia=Handbook of Perception and Cognition: Speech, Language, and Communication |location=San Diego |publisher=Academic Press}}</ref> A speech sound is influenced by the ones that precede and the ones that follow. This influence can even be exerted at a distance of two or more segments (and across syllable- and word-boundaries)<ref name="fow"/>.
  +
  +
Having disputed the linearity of the speech signal, the problem of segmentation arises: one encounters serious difficulties trying to delimit a stretch of speech signal as belonging to a single perceptual unit. This can be again illustrated by the fact that the acoustic properties of the phoneme {{IPA|/d/}} will depend on the identity of the following vowel (because of [[coarticulation]]).
  +
  +
===Lack of Invariance===
  +
  +
The research and application of speech perception has to deal with several problems which result from what has been termed the lack of invariance. As was suggested above, reliable constant relations between a phoneme of a language and its acoustic manifestation in speech are difficult to find. There are several reasons for this:
  +
  +
* ''Context-induced variation.'' Phonetic environment affects the acoustic properties of speech sounds. For example, {{IPA|/u/}} in English is fronted when surrounded by [[coronal consonant]]s.<ref>{{cite journal |author=Hillenbrand, J.M., Clark, M.J., Nearey, T.M. |year=2001 |title=Effects of consonant environment on vowel formant patterns |journal=Journal of the Acoustical Society of America |volume=109(2) |pages=748–763}}</ref> Or, the VOT values marking the boundary between voiced and voiceless stops are different for labial, alveolar and velar stops and they shift under stress or depending on the position within a syllable.<ref>{{cite journal |author=Lisker, L., Abramson, A.S. |year=1967 |title=Some effects of context on voice onset time in English stops |journal=Language and Speech |volume=10 |pages=1-28 |url=http://www.haskins.yale.edu/Reprints/HL0067.pdf |format= [[PDF]] |accessdate = 2007-05-17}}</ref>
  +
  +
* ''Variation due to differing speech conditions.'' One important factor that causes variation is differing speech rate. Many phonemic contrasts are constituted by temporal characteristics (short vs. long vowels or consonants, affricates vs. fricatives, stops vs. glides, voiced vs. voiceless stops, etc.) and they are certainly affected by changes in speaking tempo.<ref name="np"/> Another major source of variation is articulatory carefulness versus sloppiness which is typical for connected speech (articulatory ‘undershoot’ is obviously reflected in the acoustic properties of the sounds produced).
  +
  +
* ''Variation due to different speaker identity.'' The resulting acoustic structure of concrete speech productions depends on the physical and psychological properties of individual speakers. Men, women, and children generally produce voices having different pitch. Because speakers have vocal tracts of different sizes (due to sex and age especially) the resonant frequencies ([[formants]]), which are important for recognition of speech sounds, will vary in their absolute values across individuals<ref name="hill">{{cite journal |author=Hillenbrand, J., Getty, L.A., Clark, M.J., Wheeler, K. |year=1995 |title=Acoustic characteristics of American English vowels |journal=Journal of the Acoustical Society of America |volume=97 |pages=3099-3111}}</ref> (see Figure 3 for an illustration of this). Dialect and foreign accent cause variation as well.
  +
  +
===Perceptual constancy and normalization===
  +
  +
[[Image:Standard and normalized vowel space2.png|right|thumb|300px|Figure 3: The left panel shows the 3 peripheral American English vowels {{IPA|/i/}}, {{IPA|/ɑ/}}, and {{IPA|/u/}} in a standard F1 by F2 plot (in Hz). The mismatch between male, female, and child values is apparent. In the right panel formant distances (in [[Bark scale|Bark]]) rather than absolute values are plotted using the normalization procedure proposed by Syrdal and Gopal in 1986.<ref name="sg">{{cite journal |author=Syrdal, A.K., Gopal, H.S. |year=1986 |title=A perceptual model of vowel recognition based on the auditory representation of American English vowels |journal=Journal of the Acoustical Society of America |volume=79 |pages=1086-1100}}</ref>. Formant values are taken from Hillenbrand et al. (1995)<ref name="hill"/>]]
  +
  +
Given the lack of invariance, it is remarkable that listeners perceive vowels and consonants produced under different conditions and by different speakers as constant categories. It has been proposed that this is achieved by means of the perceptual normalization process in which listeners filter out the noise (i.e. variation) to arrive at the underlying category. Vocal-tract-size differences result in formant-frequency variation across speakers; therefore a listener has to adjust his/her perceptual system to the acoustic characteristics of a particular speaker. This may be accomplished by considering the ratios of formants rather than their absolute values.<ref name="sg"/><ref>{{cite encyclopedia |author=Strange, W. |date=1999 |title=Perception of vowels: Dynamic constancy |editor=J.M. Pickett |encyclopedia=The Acoustics of Speech Communication: Fundamentals, Speech Perception Theory, and Technology |location=Needham Heights (MA) |publisher=Allyn & Bacon}}</ref><ref name=john>{{cite encyclopedia |author=Johnson, K. |date=2005 |title=Speaker Normalization in speech perception |editor=Pisoni, D.B., Remez, R. |encyclopedia=The Handbook of Speech Perception |location=Oxford |publisher=Blackwell Publishers |url=http://corpus.linguistics.berkeley.edu/~kjohnson/papers/revised_chapter.pdf |accessdate=2007-05-17}}</ref> This process has been called vocal tract normalization (see Figure 3 for an example). Similarly, listeners are believed to adjust the perception of duration to the current tempo of the speech they are listening to – this has been referred to as speech rate normalization.
  +
  +
Whether or not normalization actually takes place and what is its exact nature is a matter of theoretical controversy (see [[Speech perception#Theories|theories]] below). [[Perceptual constancy]] is a phenomenon not specific to speech perception only; it exists in other types of perception too.
  +
  +
===Categorical perception===
  +
{{main|Categorical perception}}
  +
  +
[[Image:Categorization-and-discrimination-curves.png|right|thumb|300px|Figure 4: Example identification (red) and discrimination (blue) functions]]
  +
  +
Categorical perception is involved in processes of perceptual differentiation. We perceive speech sounds categorically, that is to say, we are more likely to notice the differences ''between'' categories (phonemes) than ''within'' categories. The perceptual space between categories is therefore warped, the centers of categories (or 'prototypes') working like a sieve<ref>{{cite book |last=Trubetzkoy |first=Nikolay S. |authorlink=Nikolai Trubetzkoy |title=Principles of phonology |publisher=University of California Press |location=Berkeley and Los Angeles |date=1969}}</ref> or like magnets<ref>{{cite journal |author=Iverson, P., Kuhl, P.K. |year=1995 |title=Mapping the perceptual magnet effect for speech using signal detection theory and multidimensional scaling |journal=Journal of the Acoustical Society of America |volume=97(1) |pages=553-562}}</ref> for in-coming speech sounds.
  +
  +
Let us consider an artificial continuum between a voiceless and a voiced [[bilabial stop]] where each new step differs from the preceding one in the amount of [[Voice onset time|VOT]]. The first sound is a [[Pre-voicing (phonetics)|pre-voiced]] {{IPA|[b]}}, i.e. it has a negative VOT. Then, increasing the VOT, we get to a point where it is zero, i.e. the stop is a plain [[Aspiration (phonetics)|unaspirated]] voiceless {{IPA|[p]}}. Gradually, adding the same amount of VOT at a time, we reach the point where the stop is a strongly aspirated voiceless bilabial {{IPA|[pʰ]}}. (Such a continuum was used in an experiment by [[Leigh Lisker|Lisker]] and [[Arthur S. Abramson|Abramson]] in 1970.<ref name=la/> The sounds they used are [http://www.haskins.yale.edu/featured/demo-liskabram/index.html available online].) In this continuum of, for example, seven sounds, native English listeners will identify the first three sounds as {{IPA|/b/}} and the last three sounds as {{IPA|/p/}} with a clear boundary between the two categories.<ref name=la>{{cite conference |author=Lisker, L., Abramson, A.S. |year=1970 |title=The voicing dimension: Some experiments in comparative phonetics |booktitle=Proc. 6th International Congress of Phonetic Sciences |pages=563-567 |location=Prague |publisher=Academia |url=http://www.haskins.yale.edu/Reprints/HL0087.pdf |format= [[PDF]] |accessdate = 2007-05-17}}</ref> A two-alternative identification (or categorization) test will yield a discontinuous categorization function (see red curve in Figure 4).
  +
  +
If we test the ability to discriminate between two sounds with varying VOT values but having a constant VOT distance from each other (20 ms for instance), listeners are likely to perform at chance level if both sounds fall within the same category and at nearly-100% level if each sound falls in a different category (see the blue discrimination curve in Figure 4).
  +
  +
The conclusion to make from both the identification and the discrimination test is that listeners will have different sensitivity to the same relative increase in VOT depending on whether or not the boundary between categories was crossed. Similar perceptual adjustment is attested for other acoustic cues as well.
  +
  +
===Top-down influences on speech perception===
  +
  +
The process of speech perception is not necessarily uni-directional. That is, higher-level language processes connected with [[Morphology (linguistics)|morphology]], [[syntax]], or [[semantics]] may interact with basic speech perception processes to aid in recognition of speech sounds. It may be the case that it is not necessary and maybe even not possible for listener to recognize phonemes before recognizing higher units, like words for example. After obtaining at least a fundamental piece of information about phonemic structure of the perceived entity from the acoustic signal, listeners are able to compensate for missing or noise-masked phonemes using their knowledge of the spoken language.
  +
  +
In a classic experiment, Richard M. Warren (1970) replaced one phoneme of a word with a cough-like sound. His subjects restored the missing speech sound perceptually without any difficulty and what is more, they were not able to identify accurately which phoneme had been disturbed.<ref>{{cite journal |author=Warren, R.M. |year=1970 |title=Restoration of missing speech sounds |journal=Science |volume=167 |pages=392-393}}</ref> Another basic experiment compares recognition of naturally spoken words presented in a sentence (or at least a phrase) and the same words presented in isolation. Perception accuracy usually drops in the latter condition. Garnes and Bond (1976) also used carrier sentences when researching the influence of semantic knowledge on perception. They created series of words differing in one phoneme (bay / day / gay, for example). The quality of the first phoneme changed along a continuum. All these stimuli were put into different sentences each of which made sense with one of the words only. Listeners had a tendency to judge the ambiguous words (when the first segment was at the boundary between categories) according to the meaning of the whole sentence.<ref>{{cite conference |author=Garnes, S., Bond, Z.S. |date=1976 |title=The relationship between acoustic information and semantic expectation |booktitle=Phonologica 1976 |location=Innsbruck |pages=285-293}}</ref>
   
 
==Research topics==
 
==Research topics==
  +
===Infant speech perception===
   
  +
Infants begin the process of language acquisition by being able to detect very small differences between speech sounds. They are able to discriminate all possible speech contrasts (phonemes). Gradually, as they are exposed to their native language, their perception becomes language-specific, i.e. they learn how to ignore the differences within phonemic categories of the language (differences that may well be contrastive in other languages - for example, English distinguishes two voicing categories of [[stop consonants]], whereas [[Thai language#Consonants|Thai has three categories]]; infants must learn which differences are distinctive in their native language uses, and which are not). As infants learn how to sort incoming speech sounds into categories, ignoring irrelevant differences and reinforcing the contrastive ones, their perception becomes [[Speech perception#Categorical perception|categorical]]. Infants learn to contrast different vowel phonemes of their native language by approximately 6 months of age. The native consonantal contrasts are acquired by 11 or 12 months of age.<ref name=kawai>{{cite journal |author=Minagawa-Kawai, Y., Mori, K., Naoi, N., Kojima, S. |year=2006 |title=Neural Attunement Processes in Ifants during the Acquisition of a Language-Specific Phonemic Contrast |journal=The Journal of Neuroscience |volume=27(2) |pages=315-321}}</ref> Some researchers have proposed that infants may be able to learn the sound categories of their native language through passive listening, using a process called [[statistical learning]]. Others even claim that certain sound categories are innate, that is, they are genetically-specified (see discussion about [[Categorical perception#Acquired distinctiveness|innate vs. acquired categorical distinctiveness]]).
One of the basic problems in the study of speech is how to deal with the noise in the speech signal. This is shown by the difficulty that computer [[voice recognition]] systems have with recognizing human speech. These programs can do well at recognizing speech when they have been trained on a specific speaker's voice, and under quiet conditions. However, these systems often do poorly in more realistic listening situations where humans are able to understand speech without difficulty.
 
   
  +
If day-old babies are presented with their mother’s voice speaking normally, abnormally (in monotone), and a stranger’s voice, they react only to their mother’s voice speaking normally. When a human and a non-human sound is played, babies turn their head only to the source of human sound. It has been suggested that auditory learning begins already in the pre-natal period.<ref name=cd>{{cite book |last=Crystal |first=David |date=2005 |title=The Cambridge Encyclopedia of Language |location=Cambridge |publisher=CUP}}</ref>
===Development===
 
   
  +
How do researchers know if infants can distinguish between speech sounds? One of the techniques used to examine how infants perceive speech, besides the head-turn procedure mentioned above, is measuring their sucking rate. In such an experiment, a baby is sucking a special nipple while presented with sounds. First, the baby’s normal sucking rate is established. Then a stimulus is played repeatedly. When the baby hears the stimulus for the first time the sucking rate increases but as the baby becomes [[Habituation|habituated]] to the stimulation the sucking rate decreases and levels off. Then, a new stimulus is played to the baby. If the baby perceives the newly introduced stimulus as different from the background stimulus the sucking rate will show an increase.<ref name=cd/> The sucking-rate and the head-turn method are some of the more traditional, behavioral methods for studying speech perception. Among the new methods (see [[Speech perception#Research methods|research methods]] below) that help us to study speech perception, [[NIRS]] is widely used in infants.<ref name=kawai/>
One of the basic questions in speech perception is how infants learn speech sound categories. Different languages use different sets of speech sounds. For example, English distinguishes two voicing categories of sounds, whereas Hindi has three categories. Infants must learn which sounds their native language uses, and which ones it does not. It remains unclear how they are able to do this. Some researchers have suggested that certain sound categories are innate, that is, they are genetically-specified. Others have suggested that infants may be able to learn the sound categories of their native language through passive listening, using a process called [[statistical learning]].
 
   
  +
===Cross-language and second-language speech perception===
Studies of infant speech perception have shown that, in general, infants are able to distinguish more categories of speech sounds than adults. Newborns are able to distinguish between many of the sounds of human languages, but by about 12 months of age, they are only able to distinguish those sounds used in their native language.
 
   
  +
A large amount of research studies focus on how users of a language perceive [[Foreign language|foreign]] speech (referred to as cross-language speech perception) or [[Second language|second-language]] speech (second-language speech perception). The latter falls within the domain of [[second language acquisition]].
{{Main|Speech perception in infants}}
 
   
  +
Languages differ in their phonemic inventories. Naturally, this creates difficulties when a foreign language is encountered. For example, if two foreign-language sounds are assimilated to a single mother-tongue category the difference between them will be very difficult to discern. A classic example of this situation is the observation that Japanese learners of English will have problems with identifying or distinguishing English liquids {{IPA|/l/}} and {{IPA|/r/}}.<ref>{{cite journal |author=Iverson, P., Kuhl, P.K., Akahane-Yamada, R., Diesh, E., Thokura, Y., Kettermann, A., Siebert, C., |year=2003 |title=A perceptual interference account of acquisition difficulties for non-native phonemes |journal=Cognition |volume=89 |pages=B47–B57}}</ref>
==See also==
 
   
  +
Best (1995) proposed a Perceptual Assimilation Model which describes possible cross-language category assimilation patterns and predicts their consequences.<ref>{{cite encyclopedia |author=Best, K., |date=1995 |title=A direct realist view of cross-language speech perception: New Directions in Research and Theory |editor=Winifred Strange |encyclopedia=Speech perception and linguistic experience: Theoretical
==References & Bibliography==
 
  +
and methodological issues |location=Baltimore |publisher=York Press |pages=171–204}}</ref>
==Key texts==
 
  +
Flege (1995) formulated a Speech Learning Model which combines several hypotheses about second-language (L2) speech acquisition and which predicts, in simple words, that an L2 sound that is not too similar to a native-language (L1) sound will be easier to acquire than an L2 sound that is relatively similar to an L1 sound (because it will be perceived as more obviously ‘different’ by the learner).<ref>{{cite encyclopedia |author=Flege, J., |date=1995 |title= Second language speech learning: Theory, findings and problems |editor=Winifred Strange |encyclopedia=Speech perception and linguistic experience: Theoretical and methodological issues |location=Baltimore |publisher=York Press |pages=233–277}}</ref>
===Books===
 
*Goodman, J C & Nusbaum, H C (1994)(eds) "The development of speech perception" MIT, Cambridge Mass.
 
   
  +
===Speech perception in language or hearing impairment===
===Papers===
 
==References==
 
*Kuhl, P K (1992) "Psychoacoustics and speech perception: Internal standards, perceptual anchors, and prototypes" in Werner, L A & Rubel, E W (eds) "Developmental Psychoacoustics" American Psychological Association, Washington DC.
 
*Liberman, A. M., Harris, K. S., Hoffman, H. S. & Griffith, B. C. (1957) The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology 54: 358 - 368.
 
*Miller, J L & Eimas, P D (1994) "Observations on speech perception, its development, and the search for a mechanism" in Goodman, J C & Nusbaum, H C (eds) "The development of speech perception" MIT, Cambridge Mass.
 
*Nygaard, L C & Pisoni, D B (1995) "Speech perception: New directions and theory" in J L Miller & P D Eimas (eds) "Handbook of perception and cognition, 2nd Edition, Vol 11, Speech, language and communication" Academic, New York.
 
 
==Additional material==
 
===Books===
 
   
  +
Research in how people with language or hearing impairment perceive speech is not only intended to discover possible treatments. It can provide insight into what principles underlie non-impaired speech perception. Two areas of research can serve as an example:
===Papers===
 
*[http://scholar.google.com/scholar?sourceid=mozclient&num=50&scoring=d&ie=utf-8&oe=utf-8&q=Speech+perception Google Scholar]
 
   
  +
* ''Listeners with aphasia.'' [[Aphasia]] affects both the expression and reception of language. Both two most common types, [[Broca's aphasia|Broca's]] and [[Receptive aphasia|Wernike's aphasia]], affect speech perception to some extent. Broca’s aphasia causes moderate difficulties for language understanding. The effect of Wernike’s aphasia on understanding is much more severe. It is agreed upon, that aphasics suffer from perceptual deficits. They are usually unable to fully distinguish place of articulation and voicing.<ref name="cse2001">{{cite journal |author=Csépe, V., Osman-Sagi, J., Molnar M., Gosy M. |year=2001 |title=Impaired speech perception in aphasic patients: event-related potential and neuropsychological assessment |journal=Neuropsychologia |volume=39(11) |pages=1194-1208}}</ref> As for other features, the difficulties vary. It has not yet been proven whether low-level speech-perception skills are affected in aphasia sufferers or whether their difficulties are caused by higher-level impairment alone.<ref name="cse2001"/>
==External links==
 
   
  +
* ''Listeners with cochlear implants.'' [[Cochlear implant]]ation allows partial restoration of hearing in deaf people. The acoustic information conveyed by an implant is usually sufficient for implant users to properly recognize speech of people they know even without visual clues.<ref name="loi1998"/> For cochlear implant users, it is more difficult to understand unknown speakers and sounds. The perceptual abilities of children that received an implant after the age of two are significantly better than of those who were implanted in adulthood. A number of factors have been shown to influence perceptual performance. These are especially duration of deafness prior to implantation, age of onset of deafness, age at implantation (such age affects may be related to the [[Critical period hypothesis]]) and the duration of using an implant. There are differences between children with congenital and acquired deafness. Postlingually deaf children have better results than the prelingually deaf and adapt to a cochlear implant faster.<ref name="loi1998">{{cite journal |author=Loizou, P. |year=1998 |title=Introduction to cochlear implants |journal=IEEE Signal Processing Magazine |volume=39(11) |pages=101-130}}</ref>
   
 
===Noise===
   
 
One of the basic problems in the study of speech is how to deal with the noise in the speech signal. This is shown by the difficulty that computer [[speech recognition]] systems have with recognizing human speech. These programs can do well at recognizing speech when they have been trained on a specific speaker's voice, and under quiet conditions. However, these systems often do poorly in more realistic listening situations where humans are able to understand speech without difficulty.
[[Category:]]
 
   
  +
===Research methods===
   
  +
The methods used in speech perception research can be roughly divided into three groups: behavioral, computational, and, more recently, neurophysiological methods. Behavioral experiments are based on an active role of a participant, i.e. subjects are presented with stimuli and asked to make conscious decisions about them. This can take the form of an identification test, a [[discrimination test]], similarity rating, etc. These types of experiments help to provide a basic description of how listeners perceive and categorize speech sounds.
{{Psych-stub}}
 
   
  +
Computational modeling has also been used to simulate how speech may be processed by the brain to produce behaviors that are observed. Computer models have been used to address several questions in speech perception, including how the sound signal itself is processed to extract the acoustic cues used in speech, as well as how speech information is used for higher-level processes, such as word recognition.<ref name="mcc1986">{{cite journal |author=McClelland, J. L. and Elman, J. L. |year=1986 |title=The TRACE model of speech perception |journal=Cognitive Psychology |volume=18 |pages=1-86 |url=http://www.cnbc.cmu.edu/~jlm/papers/McClellandElman86.pdf |format=[[PDF]] |accessdate=2007-05-19}}</ref>
   
  +
Neurophysiological methods rely on utilizing information stemming from more direct and not necessarily conscious (pre-attentative) processes. Subjects are presented with speech stimuli in different types of tasks and the responses of the brain are measured. The brain itself can be more sensitive than it appears to be through behavioral responses. For example, the subject may not show sensitivity to the difference between two speech sounds in a discrimination test, but brain responses may reveal sensitivity to these differences.<ref name=kawai/> Methods used to measure neural responses to speech include [[event-related potential]]s, [[magnetoencephalography]], and [[near infrared spectroscopy]]. One important response used with [[event-related potential]]s is the [[mismatch negativity]], which occurs when speech stimuli are acoustically different from a stimulus that the subject heard previously.
   
  +
Neurophysiological methods were introduced into speech perception research for several reasons:
{{psych-stub}}
 
  +
:''Behavioral responses may reflect late, conscious processes and be affected by other systems such as orthography, and thus they may mask speaker’s ability to recognize sounds based on lower-level acoustic distributions.''<ref name="kaz">{{cite conference |author=Kazanina, N., Phillips, C., Idsardi, W. |year=2006 |title=The influence of meaning on the perception of speech sounds | booktitle=PNAS |volume=30 |pages=11381-11386 |url=http://aix1.uottawa.ca/~nkazanin/Papers/kazanina-phillips-idsardi_PNAS_2006_reprint.pdf |format=[[PDF]] |accessdate=2007-05-19}}</ref>
  +
Without the necessity of taking an active part in the test, even infants can be tested; this feature is crucial in research into acquisition processes. The possibility to observe low-level auditory processes independently from the higher-level ones makes it possible to address long-standing theoretical issues such as whether or not humans posses a specialized module for perceiving speech<ref>{{cite journal |author= Gocken, J. M. & Fox R. A. |year=2001 |title= Neurological Evidence in Support of a Specialized Phonetic Processing Module |journal= Brain and Language |volume=78 |pages=241-253}}</ref><ref>{{cite journal |author= Dehaene-Lambertz, G., Pallier, C., Serniclaes, W., Sprenger-Charolles, L., Jobert, A., & Dehaene, S. |year=2005 |title= Neural correlates of switching from auditory to speech perception |journal= NeuroImage |volume=24 |pages=21-33 |url=http://www.pallier.org/papers/Dehaene-LambertzPallier_Sinewaves_Neuroimgage_2004.pdf |format= [[PDF]] |accessdate = 2007-07-04}}</ref> or whether or not some complex acoustic invariance (see [[Speech perception#Lack of Invariance|lack of invariance]] above) underlies the recognition of a speech sound<ref>{{cite journal |author= Näätänen, R. |year=2001 |title= The perception of speech sounds by the human brain as reflected by the mismatch negativity (MMN) and its magnetic equivalent (MMNm) |journal= Psychophysiology |volume=38 |pages=1-21}}</ref>.
   
 
==Theories==
   
  +
Research into speech perception (SP) has by no means explained every aspect of the processes involved. A lot of what has been said about SP is a matter of theory. Several theories have been devised to develop some of the above mentioned and other unclear issues. Not all of them give satisfactory explanations of all problems, however the research they inspired has yielded a lot of useful data.
[[Category:Phonetics]]
 
  +
  +
===Motor theory of SP===
  +
 
Some of the earliest work in the study of how humans perceive speech sounds was conducted by [[Alvin Liberman]] and his colleagues at [[Haskins Laboratories]].<ref name="lib57">{{cite journal |author=Liberman, A.M., Harris, K.S., Hoffman, H.S., Griffith, B.C. |year=1957 |title=The discrimination of speech sounds within and across phoneme boundaries |journal=Journal of Experimental Psychology |volume=54 |pages=358-368 |url=http://www.haskins.yale.edu/Reprints/HL0022.pdf |format=[[PDF]] |accessdate=2007-05-18}}</ref> Using a speech synthesizer, they constructed speech sounds that varied in [[place of articulation]] along a continuum from {{IPA|//}} to {{IPA|//}} to {{IPA|//}}. Listeners were asked to identify which sound they heard and to discriminate between two different sounds. The results of the experiment showed that listeners grouped sounds into discrete categories, even though the sounds they were hearing were varying continuously. Based on these results, they proposed the notion of [[categorical perception]] as a mechanism by which humans are able to identify speech sounds.
  +
 
More recent research using different tasks and methodologies suggests that listeners are highly sensitive to acoustic differences within a single phonetic category, contrary to a strict categorical account of speech perception.
  +
  +
In order to provide a theoretical account of the [[categorical perception]] data, Liberman and colleagues<ref name="lib67">{{cite journal |author= Liberman, A.M., Cooper, F.S., Shankweiler, D.P., & Studdert-Kennedy, M. |year=1967 |title= Perception of the speech code. |journal= Psychological Review |volume=74 |pages=431-461 |url=http://www.haskins.yale.edu/Reprints/HL0069.pdf |format=[[PDF]] |accessdate=2007-05-19}}</ref> worked out the motor theory of speech perception, where “the complicated articulatory encoding was assumed to be decoded in the perception of speech by the same processes that are involved in production”<ref name="np"/> (this is referred to as analysis-by-synthesis). For instance, the English consonant {{IPA|/d/}} may vary in its acoustic details across different phonetic contexts (see [[Speech perception#Acoustic cues|above]]), yet all {{IPA|/d/}}’s as perceived by a listener fall within one category (voiced alveolar stop) and that is because "lingustic representantations are abstract, canonical, phonetic segments or the gestures that underlie these segments."<ref name="np"/> When describing units of perception, Liberman later abandoned articulatory movements and proceeded to the neural commands to the articulators<ref name="lib70">{{cite journal |author= Liberman, A. M. |year=1970 |title=The grammars of speech and language |journal=Cognitive Psychology |volume=1 |pages=301-323 |url=http://www.haskins.yale.edu/Reprints/HL0099.pdf |format=[[PDF]] |accessdate=2007-07-19}}</ref> and even later to intended articulatory gestures<ref name="lib85">{{cite journal |author= Liberman, A. M. & Mattingly, I. G. |year=1985 |title=The motor theory of speech perception revised |journal=Cognition |volume=21 |pages=1-36 |url=http://www.haskins.yale.edu/Reprints/HL0519.pdf |format=[[PDF]] |accessdate=2007-07-19}}</ref>, thus "the neural representation of the utterance that determines the speaker’s production is the [[distal stimulus|distal object]] the lister perceives"<ref name="lib85"/>. The theory is closely related to the [[Modularity of mind|modularity]] hypothesis, which proposes the existence of a special-purpose module, which is supposed to be innate and probably human-specific.
  +
  +
The theory has been criticized in terms of not being able to "provide an account of just how acoustic signals are translated into intended gestures"<ref name=hay/> by listeners. Furthmore, it is unclear how indexical information (eg. talker-identity) is encoded/decoded along with liguistically-relevant information.
  +
  +
===Direct realist theory of SP===
  +
  +
The direct realist theory of speech perception (mostly associated with [[Carol Fowler]]) is a part of the more general theory of [[direct realism]], which postulates that perception allows us to have direct awareness of the world because it involves direct recovery of the [[Distal stimulus|distal source]] of the event that is perceived. For speech perception, the theory asserts that the [[Distal stimulus|objects of perception]] are actual vocal tract movements, or gestures, and not abstract phonemes or (as in the Motor Theory) events that are causally antecedent to these movements, i.e. intended gestures. Listeners perceive gestures not by means of a specialized decoder (as in the Motor Theory) but because information in the acoustic signal specifies the gestures that form it.<ref>{{cite journal |author=Diehl, R., Lotto, A., Holt, L. |year=2004 |title=Speech perception |journal=Annual Revue of Psychology |volume=55 |pages=149–179}}</ref>
  +
By claiming that the actual articulatory gestures that produce different speech sounds are themselves the units of speech perception, the theory bypasses the problem of [[Speech perception#Lack of Invariance|lack of invariance]].
  +
  +
===Fuzzy-logical model of SP===
  +
  +
The fuzzy logical theory of speech perception developed by Massaro<ref>{{cite journal |author=Massaro, D.W. |year=1989 |title=Testing between the TRACE Model and the Fuzzy Logical Model of Speech perception |journal=Cognitive Psychology |volume=21 |pages=398-421}}</ref> proposes that people remember speech sounds in a probabilistic, or graded, way. It suggests that people remember descriptions of the perceptual units of language, called prototypes. Within each prototype various features may combine. However, features are not just binary (true or false), there is a [[fuzzy logic|fuzzy]] value corresponding to how likely it is that a sound belongs to a particular speech category. Thus, when perceiving a speech signal our decision about what we actually hear is based on the relative goodness of the match between the stimulus information and values of particular prototypes. The final decision is based on multiple features or sources of information, even visual information (this explains the [[McGurk effect]]).<ref name=hay>{{cite book |last=Hayward |first=Katrina |title=Experimental Phonetics: An Introduction |publisher=Longman |location=Harlow |year=2000}}</ref> Computer models of the fuzzy logical theory have been used to demonstrate that the theory's predictions of how speech sounds are categorized correspond to the behavior of human listeners.<ref name="oden1978">{{cite journal |author=Oden, G. C., Massaro, D. W. |year=1978 |title=Integration of featural information in speech perception |journal=Psychological Review |volume=85 |pages=172-191.}}</ref>
  +
  +
===Acoustic landmarks and distinctive features===
  +
{{main|Acoustic landmarks and distinctive features}}
  +
  +
In addition to the proposals of Motor Theory and Direct Realism about the relation between phonological features and articulatory gestures, [[Kenneth N. Stevens]] proposed another kind of relation: between phonological features and auditory properties. According to this view, listeners are inspecting the incoming signal for the so-called acoustic landmarks which are particular events in the spectrum carrying information about gestures which produced them. Since these gestures are limited by the capacities of humans’ articulators and listeners are sensitive to their auditory correlates, the [[Speech perception#Lack of invariance|lack of invariance]] simply does not exist in this model. The acoustic properties of the landmarks constitute the basis for establishing the distinctive features. Bundles of them uniquely specify phonetic segments (phonemes, syllables, words).<ref>{{cite journal |author=Stevens, K.N. |year=2002|title=Toward a model of lexical access based on acoustic landmarks and distinctive features |journal=Journal of the Acoustical Society of America |volume=111(4) |pages=1872-1891 |url=http://linguistics.berkeley.edu/~kjohnson/ling210/stevens2002.pdf |format= [[PDF]] |accessdate = 2007-05-17}}</ref>
  +
  +
===Exemplar theory===
  +
  +
Exemplar models of speech perception differ from the four theories mentioned above which suppose that there is no connection between word- and talker-recognition and that the variation across talkers is ‘noise’ to be filtered out.
  +
  +
The exemplar-based approaches claim listeners store information for word- as well as talker-recognition. According to this theory, particular instances of speech sounds are stored in the memory of a listener. In the process of speech perception, the remembered instances of e.g. a syllable stored in the listener’s memory are compared with the incoming stimulus so that the stimulus can be categorized. Similarly, when recognizing a talker, all the memory traces of utterances produced by that talker are activated and the talker’s identity is determined. Supporting this theory are several experiments reported by Johnson<ref name=john/> that suggest that our signal identification is more accurate when we are familiar with the talker or when we have visual representation of the talker’s gender. When the talker is unpredictable or the sex misidentified, the error rate in word-identification is much higher.
  +
  +
The exemplar models have to face several objections, two of which are (1) insufficient memory capacity to store every utterance ever heard and, concerning the ability to produce what was heard, (2) whether also the talker’s own articulatory gestures are stored or computed when producing utterances that would sound as the auditory memories.<ref name=hay/><ref name=john/>
  +
  +
  +
==Prominent workers in the field==
  +
*[[Alvin Liberman]]
  +
*[[Michael Studdert-Kennedy]]
  +
  +
==Journals==
  +
 
==See also==
  +
 
==References==
  +
  +
{{reflist}}
  +
  +
[[Category:Perception]]
  +
[[Category:developmental psychology]]
 
[[Category:Linguistics]]
 
[[Category:Linguistics]]
 
[[Category:Phonetics]]
 
[[Category:Cognition]]
  +
  +
  +
:de:Sprachverständnis
  +
 
[[Category:Speech perception]]
 
[[Category:Speech perception]]
  +
 
{{enWP|Speech perception]]

Revision as of 16:47, 14 December 2007

Assessment | Biopsychology | Comparative | Cognitive | Developmental | Language | Individual differences | Personality | Philosophy | Social |
Methods | Statistics | Clinical | Educational | Industrial | Professional items | World psychology |

Language: Linguistics · Semiotics · Speech


Speech perception refers to the processes by which humans are able to interpret and understand the sounds used in language. The study of speech perception is closely linked to the fields of phonetics and phonology in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech research has applications in building computer systems that can recognize speech, as well as improving speech recognition for hearing- and language-impaired listeners.

Basics of speech perception

The process of perceiving speech begins at the level of the sound signal and the process of audition. (For a complete description of the process of audition see Hearing.) After processing the initial auditory signal, speech sounds are further processed to extract acoustic cues and phonetic information. This speech information can then be used for higher-level language processes, such as word recognition.

Acoustic cues

File:Spectrograms of syllables dee dah doo.png

Figure 1: Spectrograms of syllables "dee" (top), "dah" (middle), and "doo" (bottom) showing how the onset formant transitions that define perceptually the consonant [d] differ depending on the identity of the following vowel. (Formants are highlighted by red dotted lines; transitions are the bending beginnings of the formant trajectories.)

The speech sound signal contains a number of acoustic cues that are used in speech perception. The cues differentiate speech sounds belonging to different phonetic categories. For example, one of the most studied cues in speech is voice onset time or VOT. VOT is a primary cue signaling the difference between voiced and voiceless stop consonants, such as "b" and "p". Other cues differentiate sounds that are produced at different places of articulation or manners of articulation. The speech system must also combine these cues to determine the category of a specific speech sound. This is often thought of in terms of abstract representations of phonemes. These representations can then be combined for use in word recognition and other language processes.

It is not easy to identify what acoustic cues listeners are sensitive to when perceiving a particular speech sound:

At first glance, the solution to the problem of how we perceive speech seems deceptively simple. If one could identify stretches of the acoustic waveform that correspond to units of perception, then the path from sound to meaning would be clear. However, this correspondence or mapping has proven extremely difficult to find, even after some forty-five years of research on the problem.[1]

If a specific aspect of the acoustic waveform indicated one linguistic unit, a series of tests using speech synthesizers would be sufficient to determine such a cue or cues. However, there are two significant obstacles:

  1. One acoustic aspect of the speech signal may cue different linguistically relevant dimensions. For example, the duration of a vowel in English can indicate whether or not the vowel is stressed, or whether it is in a syllable closed by a voiced or a voiceless consonant, and in some cases (like American English /ɛ/ and /æ/) it can distinguish the identity of vowels.[2] Some experts even argue that duration can help in distinguishing of what is traditionally called short and long vowels in English.[3]
  2. One linguistic unit can be cued by several acoustic properties. For example in a classic experiment, Alvin Liberman (1957) showed that the onset formant transitions of /d/ differ depending on the following vowel (see Figure 1) but they are all interpreted as the phoneme /d/ by listeners.[4]

Linearity and the segmentation problem

File:Spectrogram of I owe you.png

Figure 2: A spectrogram of the phrase "I owe you". There are no clearly distinguishable boundaries between speech sounds.

Although listeners perceive speech as a stream of discrete units (phonemes, syllables, and words), this linearity is difficult to be seen in the physical speech signal (see Figure 2 for an example). Speech sounds do not strictly follow one another, rather, they overlap.[5] A speech sound is influenced by the ones that precede and the ones that follow. This influence can even be exerted at a distance of two or more segments (and across syllable- and word-boundaries)[5].

Having disputed the linearity of the speech signal, the problem of segmentation arises: one encounters serious difficulties trying to delimit a stretch of speech signal as belonging to a single perceptual unit. This can be again illustrated by the fact that the acoustic properties of the phoneme /d/ will depend on the identity of the following vowel (because of coarticulation).

Lack of Invariance

The research and application of speech perception has to deal with several problems which result from what has been termed the lack of invariance. As was suggested above, reliable constant relations between a phoneme of a language and its acoustic manifestation in speech are difficult to find. There are several reasons for this:

  • Context-induced variation. Phonetic environment affects the acoustic properties of speech sounds. For example, /u/ in English is fronted when surrounded by coronal consonants.[6] Or, the VOT values marking the boundary between voiced and voiceless stops are different for labial, alveolar and velar stops and they shift under stress or depending on the position within a syllable.[7]
  • Variation due to differing speech conditions. One important factor that causes variation is differing speech rate. Many phonemic contrasts are constituted by temporal characteristics (short vs. long vowels or consonants, affricates vs. fricatives, stops vs. glides, voiced vs. voiceless stops, etc.) and they are certainly affected by changes in speaking tempo.[1] Another major source of variation is articulatory carefulness versus sloppiness which is typical for connected speech (articulatory ‘undershoot’ is obviously reflected in the acoustic properties of the sounds produced).
  • Variation due to different speaker identity. The resulting acoustic structure of concrete speech productions depends on the physical and psychological properties of individual speakers. Men, women, and children generally produce voices having different pitch. Because speakers have vocal tracts of different sizes (due to sex and age especially) the resonant frequencies (formants), which are important for recognition of speech sounds, will vary in their absolute values across individuals[8] (see Figure 3 for an illustration of this). Dialect and foreign accent cause variation as well.

Perceptual constancy and normalization

File:Standard and normalized vowel space2.png

Figure 3: The left panel shows the 3 peripheral American English vowels /i/, /ɑ/, and /u/ in a standard F1 by F2 plot (in Hz). The mismatch between male, female, and child values is apparent. In the right panel formant distances (in Bark) rather than absolute values are plotted using the normalization procedure proposed by Syrdal and Gopal in 1986.[9]. Formant values are taken from Hillenbrand et al. (1995)[8]

Given the lack of invariance, it is remarkable that listeners perceive vowels and consonants produced under different conditions and by different speakers as constant categories. It has been proposed that this is achieved by means of the perceptual normalization process in which listeners filter out the noise (i.e. variation) to arrive at the underlying category. Vocal-tract-size differences result in formant-frequency variation across speakers; therefore a listener has to adjust his/her perceptual system to the acoustic characteristics of a particular speaker. This may be accomplished by considering the ratios of formants rather than their absolute values.[9][10][11] This process has been called vocal tract normalization (see Figure 3 for an example). Similarly, listeners are believed to adjust the perception of duration to the current tempo of the speech they are listening to – this has been referred to as speech rate normalization.

Whether or not normalization actually takes place and what is its exact nature is a matter of theoretical controversy (see theories below). Perceptual constancy is a phenomenon not specific to speech perception only; it exists in other types of perception too.

Categorical perception

Main article: Categorical perception
File:Categorization-and-discrimination-curves.png

Figure 4: Example identification (red) and discrimination (blue) functions

Categorical perception is involved in processes of perceptual differentiation. We perceive speech sounds categorically, that is to say, we are more likely to notice the differences between categories (phonemes) than within categories. The perceptual space between categories is therefore warped, the centers of categories (or 'prototypes') working like a sieve[12] or like magnets[13] for in-coming speech sounds.

Let us consider an artificial continuum between a voiceless and a voiced bilabial stop where each new step differs from the preceding one in the amount of VOT. The first sound is a pre-voiced [b], i.e. it has a negative VOT. Then, increasing the VOT, we get to a point where it is zero, i.e. the stop is a plain unaspirated voiceless [p]. Gradually, adding the same amount of VOT at a time, we reach the point where the stop is a strongly aspirated voiceless bilabial [pʰ]. (Such a continuum was used in an experiment by Lisker and Abramson in 1970.[14] The sounds they used are available online.) In this continuum of, for example, seven sounds, native English listeners will identify the first three sounds as /b/ and the last three sounds as /p/ with a clear boundary between the two categories.[14] A two-alternative identification (or categorization) test will yield a discontinuous categorization function (see red curve in Figure 4).

If we test the ability to discriminate between two sounds with varying VOT values but having a constant VOT distance from each other (20 ms for instance), listeners are likely to perform at chance level if both sounds fall within the same category and at nearly-100% level if each sound falls in a different category (see the blue discrimination curve in Figure 4).

The conclusion to make from both the identification and the discrimination test is that listeners will have different sensitivity to the same relative increase in VOT depending on whether or not the boundary between categories was crossed. Similar perceptual adjustment is attested for other acoustic cues as well.

Top-down influences on speech perception

The process of speech perception is not necessarily uni-directional. That is, higher-level language processes connected with morphology, syntax, or semantics may interact with basic speech perception processes to aid in recognition of speech sounds. It may be the case that it is not necessary and maybe even not possible for listener to recognize phonemes before recognizing higher units, like words for example. After obtaining at least a fundamental piece of information about phonemic structure of the perceived entity from the acoustic signal, listeners are able to compensate for missing or noise-masked phonemes using their knowledge of the spoken language.

In a classic experiment, Richard M. Warren (1970) replaced one phoneme of a word with a cough-like sound. His subjects restored the missing speech sound perceptually without any difficulty and what is more, they were not able to identify accurately which phoneme had been disturbed.[15] Another basic experiment compares recognition of naturally spoken words presented in a sentence (or at least a phrase) and the same words presented in isolation. Perception accuracy usually drops in the latter condition. Garnes and Bond (1976) also used carrier sentences when researching the influence of semantic knowledge on perception. They created series of words differing in one phoneme (bay / day / gay, for example). The quality of the first phoneme changed along a continuum. All these stimuli were put into different sentences each of which made sense with one of the words only. Listeners had a tendency to judge the ambiguous words (when the first segment was at the boundary between categories) according to the meaning of the whole sentence.[16]

Research topics

Infant speech perception

Infants begin the process of language acquisition by being able to detect very small differences between speech sounds. They are able to discriminate all possible speech contrasts (phonemes). Gradually, as they are exposed to their native language, their perception becomes language-specific, i.e. they learn how to ignore the differences within phonemic categories of the language (differences that may well be contrastive in other languages - for example, English distinguishes two voicing categories of stop consonants, whereas Thai has three categories; infants must learn which differences are distinctive in their native language uses, and which are not). As infants learn how to sort incoming speech sounds into categories, ignoring irrelevant differences and reinforcing the contrastive ones, their perception becomes categorical. Infants learn to contrast different vowel phonemes of their native language by approximately 6 months of age. The native consonantal contrasts are acquired by 11 or 12 months of age.[17] Some researchers have proposed that infants may be able to learn the sound categories of their native language through passive listening, using a process called statistical learning. Others even claim that certain sound categories are innate, that is, they are genetically-specified (see discussion about innate vs. acquired categorical distinctiveness).

If day-old babies are presented with their mother’s voice speaking normally, abnormally (in monotone), and a stranger’s voice, they react only to their mother’s voice speaking normally. When a human and a non-human sound is played, babies turn their head only to the source of human sound. It has been suggested that auditory learning begins already in the pre-natal period.[18]

How do researchers know if infants can distinguish between speech sounds? One of the techniques used to examine how infants perceive speech, besides the head-turn procedure mentioned above, is measuring their sucking rate. In such an experiment, a baby is sucking a special nipple while presented with sounds. First, the baby’s normal sucking rate is established. Then a stimulus is played repeatedly. When the baby hears the stimulus for the first time the sucking rate increases but as the baby becomes habituated to the stimulation the sucking rate decreases and levels off. Then, a new stimulus is played to the baby. If the baby perceives the newly introduced stimulus as different from the background stimulus the sucking rate will show an increase.[18] The sucking-rate and the head-turn method are some of the more traditional, behavioral methods for studying speech perception. Among the new methods (see research methods below) that help us to study speech perception, NIRS is widely used in infants.[17]

Cross-language and second-language speech perception

A large amount of research studies focus on how users of a language perceive foreign speech (referred to as cross-language speech perception) or second-language speech (second-language speech perception). The latter falls within the domain of second language acquisition.

Languages differ in their phonemic inventories. Naturally, this creates difficulties when a foreign language is encountered. For example, if two foreign-language sounds are assimilated to a single mother-tongue category the difference between them will be very difficult to discern. A classic example of this situation is the observation that Japanese learners of English will have problems with identifying or distinguishing English liquids /l/ and /r/.[19]

Best (1995) proposed a Perceptual Assimilation Model which describes possible cross-language category assimilation patterns and predicts their consequences.[20] Flege (1995) formulated a Speech Learning Model which combines several hypotheses about second-language (L2) speech acquisition and which predicts, in simple words, that an L2 sound that is not too similar to a native-language (L1) sound will be easier to acquire than an L2 sound that is relatively similar to an L1 sound (because it will be perceived as more obviously ‘different’ by the learner).[21]

Speech perception in language or hearing impairment

Research in how people with language or hearing impairment perceive speech is not only intended to discover possible treatments. It can provide insight into what principles underlie non-impaired speech perception. Two areas of research can serve as an example:

  • Listeners with aphasia. Aphasia affects both the expression and reception of language. Both two most common types, Broca's and Wernike's aphasia, affect speech perception to some extent. Broca’s aphasia causes moderate difficulties for language understanding. The effect of Wernike’s aphasia on understanding is much more severe. It is agreed upon, that aphasics suffer from perceptual deficits. They are usually unable to fully distinguish place of articulation and voicing.[22] As for other features, the difficulties vary. It has not yet been proven whether low-level speech-perception skills are affected in aphasia sufferers or whether their difficulties are caused by higher-level impairment alone.[22]
  • Listeners with cochlear implants. Cochlear implantation allows partial restoration of hearing in deaf people. The acoustic information conveyed by an implant is usually sufficient for implant users to properly recognize speech of people they know even without visual clues.[23] For cochlear implant users, it is more difficult to understand unknown speakers and sounds. The perceptual abilities of children that received an implant after the age of two are significantly better than of those who were implanted in adulthood. A number of factors have been shown to influence perceptual performance. These are especially duration of deafness prior to implantation, age of onset of deafness, age at implantation (such age affects may be related to the Critical period hypothesis) and the duration of using an implant. There are differences between children with congenital and acquired deafness. Postlingually deaf children have better results than the prelingually deaf and adapt to a cochlear implant faster.[23]

Noise

One of the basic problems in the study of speech is how to deal with the noise in the speech signal. This is shown by the difficulty that computer speech recognition systems have with recognizing human speech. These programs can do well at recognizing speech when they have been trained on a specific speaker's voice, and under quiet conditions. However, these systems often do poorly in more realistic listening situations where humans are able to understand speech without difficulty.

Research methods

The methods used in speech perception research can be roughly divided into three groups: behavioral, computational, and, more recently, neurophysiological methods. Behavioral experiments are based on an active role of a participant, i.e. subjects are presented with stimuli and asked to make conscious decisions about them. This can take the form of an identification test, a discrimination test, similarity rating, etc. These types of experiments help to provide a basic description of how listeners perceive and categorize speech sounds.

Computational modeling has also been used to simulate how speech may be processed by the brain to produce behaviors that are observed. Computer models have been used to address several questions in speech perception, including how the sound signal itself is processed to extract the acoustic cues used in speech, as well as how speech information is used for higher-level processes, such as word recognition.[24]

Neurophysiological methods rely on utilizing information stemming from more direct and not necessarily conscious (pre-attentative) processes. Subjects are presented with speech stimuli in different types of tasks and the responses of the brain are measured. The brain itself can be more sensitive than it appears to be through behavioral responses. For example, the subject may not show sensitivity to the difference between two speech sounds in a discrimination test, but brain responses may reveal sensitivity to these differences.[17] Methods used to measure neural responses to speech include event-related potentials, magnetoencephalography, and near infrared spectroscopy. One important response used with event-related potentials is the mismatch negativity, which occurs when speech stimuli are acoustically different from a stimulus that the subject heard previously.

Neurophysiological methods were introduced into speech perception research for several reasons:

Behavioral responses may reflect late, conscious processes and be affected by other systems such as orthography, and thus they may mask speaker’s ability to recognize sounds based on lower-level acoustic distributions.[25]

Without the necessity of taking an active part in the test, even infants can be tested; this feature is crucial in research into acquisition processes. The possibility to observe low-level auditory processes independently from the higher-level ones makes it possible to address long-standing theoretical issues such as whether or not humans posses a specialized module for perceiving speech[26][27] or whether or not some complex acoustic invariance (see lack of invariance above) underlies the recognition of a speech sound[28].

Theories

Research into speech perception (SP) has by no means explained every aspect of the processes involved. A lot of what has been said about SP is a matter of theory. Several theories have been devised to develop some of the above mentioned and other unclear issues. Not all of them give satisfactory explanations of all problems, however the research they inspired has yielded a lot of useful data.

Motor theory of SP

Some of the earliest work in the study of how humans perceive speech sounds was conducted by Alvin Liberman and his colleagues at Haskins Laboratories.[29] Using a speech synthesizer, they constructed speech sounds that varied in place of articulation along a continuum from /bɑ/ to /dɑ/ to /gɑ/. Listeners were asked to identify which sound they heard and to discriminate between two different sounds. The results of the experiment showed that listeners grouped sounds into discrete categories, even though the sounds they were hearing were varying continuously. Based on these results, they proposed the notion of categorical perception as a mechanism by which humans are able to identify speech sounds.

More recent research using different tasks and methodologies suggests that listeners are highly sensitive to acoustic differences within a single phonetic category, contrary to a strict categorical account of speech perception.

In order to provide a theoretical account of the categorical perception data, Liberman and colleagues[30] worked out the motor theory of speech perception, where “the complicated articulatory encoding was assumed to be decoded in the perception of speech by the same processes that are involved in production”[1] (this is referred to as analysis-by-synthesis). For instance, the English consonant /d/ may vary in its acoustic details across different phonetic contexts (see above), yet all /d/’s as perceived by a listener fall within one category (voiced alveolar stop) and that is because "lingustic representantations are abstract, canonical, phonetic segments or the gestures that underlie these segments."[1] When describing units of perception, Liberman later abandoned articulatory movements and proceeded to the neural commands to the articulators[31] and even later to intended articulatory gestures[32], thus "the neural representation of the utterance that determines the speaker’s production is the distal object the lister perceives"[32]. The theory is closely related to the modularity hypothesis, which proposes the existence of a special-purpose module, which is supposed to be innate and probably human-specific.

The theory has been criticized in terms of not being able to "provide an account of just how acoustic signals are translated into intended gestures"[33] by listeners. Furthmore, it is unclear how indexical information (eg. talker-identity) is encoded/decoded along with liguistically-relevant information.

Direct realist theory of SP

The direct realist theory of speech perception (mostly associated with Carol Fowler) is a part of the more general theory of direct realism, which postulates that perception allows us to have direct awareness of the world because it involves direct recovery of the distal source of the event that is perceived. For speech perception, the theory asserts that the objects of perception are actual vocal tract movements, or gestures, and not abstract phonemes or (as in the Motor Theory) events that are causally antecedent to these movements, i.e. intended gestures. Listeners perceive gestures not by means of a specialized decoder (as in the Motor Theory) but because information in the acoustic signal specifies the gestures that form it.[34] By claiming that the actual articulatory gestures that produce different speech sounds are themselves the units of speech perception, the theory bypasses the problem of lack of invariance.

Fuzzy-logical model of SP

The fuzzy logical theory of speech perception developed by Massaro[35] proposes that people remember speech sounds in a probabilistic, or graded, way. It suggests that people remember descriptions of the perceptual units of language, called prototypes. Within each prototype various features may combine. However, features are not just binary (true or false), there is a fuzzy value corresponding to how likely it is that a sound belongs to a particular speech category. Thus, when perceiving a speech signal our decision about what we actually hear is based on the relative goodness of the match between the stimulus information and values of particular prototypes. The final decision is based on multiple features or sources of information, even visual information (this explains the McGurk effect).[33] Computer models of the fuzzy logical theory have been used to demonstrate that the theory's predictions of how speech sounds are categorized correspond to the behavior of human listeners.[36]

Acoustic landmarks and distinctive features

Main article: Acoustic landmarks and distinctive features

In addition to the proposals of Motor Theory and Direct Realism about the relation between phonological features and articulatory gestures, Kenneth N. Stevens proposed another kind of relation: between phonological features and auditory properties. According to this view, listeners are inspecting the incoming signal for the so-called acoustic landmarks which are particular events in the spectrum carrying information about gestures which produced them. Since these gestures are limited by the capacities of humans’ articulators and listeners are sensitive to their auditory correlates, the lack of invariance simply does not exist in this model. The acoustic properties of the landmarks constitute the basis for establishing the distinctive features. Bundles of them uniquely specify phonetic segments (phonemes, syllables, words).[37]

Exemplar theory

Exemplar models of speech perception differ from the four theories mentioned above which suppose that there is no connection between word- and talker-recognition and that the variation across talkers is ‘noise’ to be filtered out.

The exemplar-based approaches claim listeners store information for word- as well as talker-recognition. According to this theory, particular instances of speech sounds are stored in the memory of a listener. In the process of speech perception, the remembered instances of e.g. a syllable stored in the listener’s memory are compared with the incoming stimulus so that the stimulus can be categorized. Similarly, when recognizing a talker, all the memory traces of utterances produced by that talker are activated and the talker’s identity is determined. Supporting this theory are several experiments reported by Johnson[11] that suggest that our signal identification is more accurate when we are familiar with the talker or when we have visual representation of the talker’s gender. When the talker is unpredictable or the sex misidentified, the error rate in word-identification is much higher.

The exemplar models have to face several objections, two of which are (1) insufficient memory capacity to store every utterance ever heard and, concerning the ability to produce what was heard, (2) whether also the talker’s own articulatory gestures are stored or computed when producing utterances that would sound as the auditory memories.[33][11]


Prominent workers in the field

Journals

See also

References

  1. 1.0 1.1 1.2 1.3 Nygaard, L.C., Pisoni, D.B. (1995). "Speech Perception: New Directions in Research and Theory". Handbook of Perception and Cognition: Speech, Language, and Communication. Ed. J.L. Miller, P.D. Eimas. San Diego: Academic Press. 
  2. Klatt, D.H. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Journal of the Acoustical Society of America 59(5): 1208-1221.
  3. Halle, M., Mohanan, K.P. (1985). Segmental phonology of modern English. Linguistic Inquiry 16(1): 57-116.
  4. Liberman, A.M. (1957). Some results of research on speech perception. Journal of the Acoustical Society of America 29(1): 117-123.
  5. 5.0 5.1 Fowler, C. A. (1995). "Speech production". Handbook of Perception and Cognition: Speech, Language, and Communication. Ed. J.L. Miller, P.D. Eimas. San Diego: Academic Press. 
  6. Hillenbrand, J.M., Clark, M.J., Nearey, T.M. (2001). Effects of consonant environment on vowel formant patterns. Journal of the Acoustical Society of America 109(2): 748–763.
  7. Lisker, L., Abramson, A.S. (1967). Some effects of context on voice onset time in English stops. Language and Speech 10: 1-28.
  8. 8.0 8.1 Hillenbrand, J., Getty, L.A., Clark, M.J., Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America 97: 3099-3111.
  9. 9.0 9.1 Syrdal, A.K., Gopal, H.S. (1986). A perceptual model of vowel recognition based on the auditory representation of American English vowels. Journal of the Acoustical Society of America 79: 1086-1100.
  10. Strange, W. (1999). "Perception of vowels: Dynamic constancy". The Acoustics of Speech Communication: Fundamentals, Speech Perception Theory, and Technology. Ed. J.M. Pickett. Needham Heights (MA): Allyn & Bacon. 
  11. 11.0 11.1 11.2 Johnson, K. (2005). "Speaker Normalization in speech perception". The Handbook of Speech Perception. Ed. Pisoni, D.B., Remez, R.. Oxford: Blackwell Publishers. Retrieved on 2007-05-17. 
  12. Trubetzkoy, Nikolay S. (1969). Principles of phonology, Berkeley and Los Angeles: University of California Press.
  13. Iverson, P., Kuhl, P.K. (1995). Mapping the perceptual magnet effect for speech using signal detection theory and multidimensional scaling. Journal of the Acoustical Society of America 97(1): 553-562.
  14. 14.0 14.1 Lisker, L., Abramson, A.S. (1970). "The voicing dimension: Some experiments in comparative phonetics" (PDF). Proc. 6th International Congress of Phonetic Sciences: 563-567, Prague: Academia. Retrieved on 2007-05-17. 
  15. Warren, R.M. (1970). Restoration of missing speech sounds. Science 167: 392-393.
  16. Garnes, S., Bond, Z.S. (1976). "The relationship between acoustic information and semantic expectation". Phonologica 1976: 285-293. 
  17. 17.0 17.1 17.2 Minagawa-Kawai, Y., Mori, K., Naoi, N., Kojima, S. (2006). Neural Attunement Processes in Ifants during the Acquisition of a Language-Specific Phonemic Contrast. The Journal of Neuroscience 27(2): 315-321.
  18. 18.0 18.1 Crystal, David (2005). The Cambridge Encyclopedia of Language, Cambridge: CUP.
  19. Iverson, P., Kuhl, P.K., Akahane-Yamada, R., Diesh, E., Thokura, Y., Kettermann, A., Siebert, C., (2003). A perceptual interference account of acquisition difficulties for non-native phonemes. Cognition 89: B47–B57.
  20. Best, K., (1995). "A direct realist view of cross-language speech perception: New Directions in Research and Theory". Speech perception and linguistic experience: Theoretical and methodological issues. Ed. Winifred Strange. Baltimore: York Press. 171–204. 
  21. Flege, J., (1995). "Second language speech learning: Theory, findings and problems". Speech perception and linguistic experience: Theoretical and methodological issues. Ed. Winifred Strange. Baltimore: York Press. 233–277. 
  22. 22.0 22.1 Csépe, V., Osman-Sagi, J., Molnar M., Gosy M. (2001). Impaired speech perception in aphasic patients: event-related potential and neuropsychological assessment. Neuropsychologia 39(11): 1194-1208.
  23. 23.0 23.1 Loizou, P. (1998). Introduction to cochlear implants. IEEE Signal Processing Magazine 39(11): 101-130.
  24. McClelland, J. L. and Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology 18: 1-86.
  25. Kazanina, N., Phillips, C., Idsardi, W. (2006). "The influence of meaning on the perception of speech sounds" (PDF). PNAS 30: 11381-11386. Retrieved on 2007-05-19. 
  26. Gocken, J. M. & Fox R. A. (2001). Neurological Evidence in Support of a Specialized Phonetic Processing Module. Brain and Language 78: 241-253.
  27. Dehaene-Lambertz, G., Pallier, C., Serniclaes, W., Sprenger-Charolles, L., Jobert, A., & Dehaene, S. (2005). Neural correlates of switching from auditory to speech perception. NeuroImage 24: 21-33.
  28. Näätänen, R. (2001). The perception of speech sounds by the human brain as reflected by the mismatch negativity (MMN) and its magnetic equivalent (MMNm). Psychophysiology 38: 1-21.
  29. Liberman, A.M., Harris, K.S., Hoffman, H.S., Griffith, B.C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology 54: 358-368.
  30. Liberman, A.M., Cooper, F.S., Shankweiler, D.P., & Studdert-Kennedy, M. (1967). Perception of the speech code.. Psychological Review 74: 431-461.
  31. Liberman, A. M. (1970). The grammars of speech and language. Cognitive Psychology 1: 301-323.
  32. 32.0 32.1 Liberman, A. M. & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition 21: 1-36.
  33. 33.0 33.1 33.2 Hayward, Katrina (2000). Experimental Phonetics: An Introduction, Harlow: Longman.
  34. Diehl, R., Lotto, A., Holt, L. (2004). Speech perception. Annual Revue of Psychology 55: 149–179.
  35. Massaro, D.W. (1989). Testing between the TRACE Model and the Fuzzy Logical Model of Speech perception. Cognitive Psychology 21: 398-421.
  36. Oden, G. C., Massaro, D. W. (1978). Integration of featural information in speech perception. Psychological Review 85: 172-191..
  37. Stevens, K.N. (2002). Toward a model of lexical access based on acoustic landmarks and distinctive features. Journal of the Acoustical Society of America 111(4): 1872-1891.


de:Sprachverständnis

{{enWP|Speech perception]]