One of our long term research goals is to build a synthesis model which
is able to produce spontaneous speech, including disfluencies. One of
the aims is to gain a better understanding of which features contribute
to the impression of hesitant speech on a surface level. The current
study investigates acoustic correlates to perceived hesitation based on
previous work showing that pause duration and final lengthening both
contribute to the perception of hesitation. It is the total duration
increase that is the valid cue rather than the contribution by either
factor. The present experiment using speech synthesis was designed to
evaluate F0 slope and presence vs. absence of creaky voice before the
inserted hesitation in addition to durational cues. The manipulations
occurred in two syntactic positions, within a phrase and between two
phrases, respectively. The results showed that in addition to durational
increase, variation of both F0 slope and creaky voice had perceptual
effects, although to a much lesser degree. The results have a bearing on
efforts to model spontaneous speech including disfluencies, to be
explored, for example, in spoken dialogue systems.