The perception of dynamic pitch in speech and non-speech
Pitch in speech varies continuously and delivers information on linguistic structure, as well as paralinguistic information on a speaker’s identity and emotional state. And although pitch plays a significant role in communication, we have little understanding on how dynamic pitch is perceived.
The focus of this study was to investigate how the perception of pitch height is affected by the direction of the pitch movement (rise-fall forming ‘peak’ vs. fall-rise ‘valley’) and F0 turn shapes (sharp vs. plateau). Moreover, we aimed to understand whether pitch contours are perceived differently in speech compared to non-speech, i. e., whether linguistic information interacts with the auditory information. This question arose because in speech perception the shape of pitch movement affects the perceived height and timing of the pitch event. Two examples are the findings that a F0 plateau following a rise sounds higher than a sharp peak with the same maximum F0 and that pitch perception may differ for the same amount of change in F0 depending on whether it is rising or falling. Whether these perceptual effects are unique to speech sounds has not been explored.
We will discuss results of an experiment that used 3 stimulus types [speech, nonsense, complex tone], 2 directions of pitch movement [peak, valley] and 2 turning point types [sharp turn, plateau] (all crossed). The speech stimuli were four English sentences: ‘is Lemmy near Nelly?’, ‘is Nelly near Lemmy?, ‘does Mona know Nina?’, and ‘does Nina know Mona?’. From them, duration and intensity contours were extracted and embedded in nonsense and complex tone stimuli. Nonsense stimuli were reiterated speech. Complex tones were harmonic complexes with energy between 200 Hz and 6000 Hz. All stimuli had a reference line F0 (F0 at start and end of stimulus) of 200Hz. All stimuli had two F0 turns; the first one always formed a sharp turn and the second one was either a sharp turn or a 100 ms plateau. The difference between the reference line and the first turn was always 5.4 semitones and that between the reference line and the second turn varied between 3.4 and 7.4 semitones. All stimuli were resynthesised with Praat. Young native English speakers with normal hearing listened to stimuli of all three types and judged which turning point sounded higher for the ‘peak’ stimuli or lower for the ‘valley’ stimuli.