Summary | Past and Present Projects | Recent Talks
Speaker
State
Emotional Speech
Jennifer
Venditti, Jackson Liscombe, and I have looked at methods of eliciting both
subjective and objective judgments and of correlating judgments of single
tokens on multiple emotion scale -- i.e., if subjects rate a token high for frustration, what other emotional
states do they also rate it high for -- or low ("Classifying Subjective Ratings
of Emotional Speech," Eurospeech 2003). We conducted eye-tracking
experiments which allow us to compare subjective judgments to more objective
cues to the decision process. We have also worked with colleagues at the University
of Pittsburgh to study speaker state student speech in a tutorial system for
emotional states such as anger,
frustration, confidence and uncertainty ("Detecting
Certainness in Spoken Tutorial Dialogues," INTERSPEECH 2005). We have
also studied question form and function in this domain and performed machine
learning experiments to identify Question-Bearing Turns, as well as question
form and function, automatically (“Detecting
question-bearing turns in spoken tutorial dialogues” and “Intonational
cues to student questions in tutoring dialogs”, INTERSPEECH 2006). Agus Gravano, Elisa Sneed, Gregory Ward and I
have also looked at intonational contour and syntactic construction in the
conveyance of speaker certainty (“The effect
of contour type and epistemic modality on the assessment of speaker certainty”,
Speech Prosody 2008), and Frank Enos and I have proposed a new
methodology for eliciting emotional speech in “A
framework for eliciting emotional speech: Capitalizing on the actor's process”,
LREC Workshop on Emotional Corpora.
Deceptive Speech
Frank
Enos, Stefan Benus, and I are working with colleagues at SRI/ICSI and the
University of Colorado on automatic methods of distinguishing deceptive from
non-deceptive speech ("Distinguishing
Deceptive from Non-Deceptive Speech," INTERSPEECH 2005; “Detecting
deception using critical segments”, INTERSPEECH 2007). For this work we collected and annotated a
large corpus of deceptive and non-deceptive speech, the CSC Deception Corpus.
We have also looked at the role of pausing in deception (“Pauses in
deceptive speech”;
Speech Prosody 2006) and examined the role of personality in the accuracy of
human judges of deception (“Personality
factors in human deception detection: Comparing human to machine performance”,
INTERSPEECH 2006).
Charismatic Speech
Andrew
Rosenberg, Fadi Biadsy, and I are study the acoustic, prosodic, and lexical
cues to charismatic speech in
American English ("Acoustic/Prosodic and Lexical Correlates of Charismatic Speech",
INTERSPEECH 2005). With Fadi Biadsy we have extended our
effort to include research on Palestinian Arabic, and with Rolf Carlson (KTH)
and Eva Stangert (Umeå) we have investigated cross-cultural perceptions of
charisma and their acoustic, prosodic and lexical features (“A
cross-cultural comparison of American, Palestinian, and Swedish perception of
charismatic speech”, Speech Prosody 2008).
Speech
Summarization and Distillation
With Sameer Maskey, Andrew Rosenberg, and Fadi
Biadsy, I have worked on speech summarization, exploring new techniques which
take advantage of prosodic and acoustic information, in addition to lexical
cues and structural cues, in news broadcasts to 'gist' a broadcast (“Automatic
speech summarization of broadcast news using structural features”,
EUROSPEECH 2003; "Comparing Lexical, Acoustic/Prosodic,
Structural and Discourse Features for Speech Summarization,"
INTERSPEECH 2005; "Summarizing Speech without Text Using Hidden
Markov Models," HLT/NAACL 2006; and “Intonational Phrases for
Speech Summarization”, INTERSPEECH 2008). We have also looked at the
segmentation of news broadcasts into stories ("Story Segmentation of
Broadcast News in English, Mandarin and Arabic" HLT/NAACL
2006), the determination of speaker roles (e.g. anchor, reporter, interviewee )
(See R. Barzilay et al., "Identification of Speaker Role in Radio
Broadcasts", AAAI 2000 for earlier work.), and the extraction
of soundbites from
broadcasts (spoken ‘quotes’ included in a show) and identification of their
speaker, . “An
unsupervised approach to biography production using Wikipedia”, ACL/NAACL
2008. Elena Filatova, Martin Jansche, Mehrbod Sharifi, and Wisam Dakka are co-authors of some of this work also.
Spoken Dialogue
Systems
The
Columbia
Games
Corpus
Agus
Gravano, Stefan Benus, and I have been collecting and analyzing a large corpus
of spontaneous dialogues, produced by subjects playing a computer game we
created. We collected this data to test several theories of the way
speakers produce ‘given’ (as opposed to ‘new’) information. We are
currently labeling this corpus for intonation, in the ToBI framework; we have
also turn-taking behaviors, cue phrases, questions (identified as to form and
function) and other aspects of the corpus. This is joint work with Gregory
Ward and Elisa Sneed at
Cue Phrases
Work
on cue phrases, or discourse markers, is described in Julia Hirschberg and
Diane Litman, "Empirical Studies
on the Disambiguation of Cue Phrases," Computational Linguistics,
1992; some figures are missing in this version. More recently Agus
Gravano, Stefan Benus, Lauren Wilcox, Hector Chavez, and Shira Mitchell, Ilia
Vovsha, and I have been looking at cue phrase production and detection in the
Games corpus (“On the
role of context and prosody in the interpretation of okay”, ACL
2007; “Classification
of discourse functions of affirmative words in spoken dialogue”,
Interspeech 2007; “The prosody
of backchannels in American English”, ICPhS 2007).
Speaker Entrainment
Ani
Nenkova, Agus Gravano and I are looking at various types of speaker entrainment
in the Games Corpus (“High frequency
word entrainment in spoken dialogue”, ACL 2008). We are also examining acoustic/prosodic entrainment.
The Given/New Distinction
Agus Gravano, Ani Nenkova, Gregory Ward,
Elisa Sneed and I have studied the different ways speakers produce ‘given’ vs.
‘new’ information in “Effect
of genre, speaker, and word class on the realization of given and new
information”, INTERSPEECH 2006 and “Intonational
overload: Uses of the H* !H* L- L% contour in read and spontaneous speech”,
Laboratory Phonology 9.
Misrecognitions,
Corrections, and Error Awareness
Diane Litman, Marc Swerts and I
have studied the prosodic consequences of recognition errors in Spoken Dialogue
Systems. We are studying whether prosodic features of user utterances can tell
us a) whether a speech recognition error has occurred, as a user reacts to it
(e.g. System: "Did you say you want to go to
Predicting Prosodic
Events
Intonational Variation in
Synthetic Speech
Most
of my early work on predicting intonational phrase boundaries and prominences
was done in the Text-to-Speech synthesis group at Bell Labs. Some papers describing that work are Philipp
Koehn, Steven Abney, Julia Hirschberg, and Michael Collins, "Improving Intonational Phrasing with
Syntactic Information," ICASSP-00; Julia Hirschberg and Pilar Prieto,
"Training intonational
phrasing rules automatically for English and Spanish Text-to-Speech,"
Speech Communication, 1996; Julia Hirschberg, "Pitch Accent in Context: Predicting Intonational Prominence
from Text," Artificial Intelligence, 1993; and Michelle Wang and Julia
Hirschberg, "Automatic
Classification of Intonational Phrase Boundaries," Computer Speech and
Language, 1992. These methods were used
to assign intonational variation automatically in the Bell Labs Text-to-Speech
System. I also collaborated on two
projects in concept-to-speech generation (generating speech from an abstract
representation of the concepts to be conveyed). One, with Shimei Pan and Kathy McKeown of
Detecting Prosodic Events
More
recent work on prosody detection has been done with Andrew Rosenberg, who has
developed new ways to combine energy-based features with other acoustic and
lexical features to achieve very high accuracy in prediction. Papers documenting this work include (“On
the correlation between energy and pitch accent in read English speech”,
INTERSPEECH 2006; and “Detecting
pitch accent using pitch-corrected energy-based predictors”, INTERSPEECH
2007)
Audio Browsing and
Retrieval
Work
on our SCAN (Spoken
Content-Based Audio Navigation) browsing and retrieval system is summarized in
John Choi et al., "Spoken
Content-Based Audio Navigation (SCAN)," ICPhS-99. This project combines
ASR and IR technology to enable search of large audio databases, such as
broadcast news archives or voicemail. It started life as `AudioGrep'. Current
collaborators include Steve Abney, Brian Amento, Michiel Bacchiani, Phil
Isenhour, Diane Litman, Larry Stead, and Steve Whittaker. My particular
interests lie in the use of acoustic information to segment audio (Julia
Hirschberg and Christine Nakatani, "Acoustic Indicators of Topic Segmentation," ICSLP-98)
and the study of how people browse and search audio databases such as broadcast
news collections (Steve Whittaker et al., "SCAN: Designing and Evaluating User Interfaces to Support
Retrieval from Speech Archives ", SIGIR-99) and voicemail (Steve
Whittaker, Julia Hirschberg and Christine Nakatani, "Play it again: a study of the
factors underlying speech browsing behavior," and Steve Whittaker,
Julia Hirschberg and Christine Nakatani, "All talk and all action: strategies for managing voicemail
messages," both presented at CHI-98). We have also studied how
differences in ASR accuracy (comparing 100%, 84%, 69%, 50% accuracy
transcripts) affect users' ability to perform tasks, finding effects for
transcript accuracy on time to solution, amount of speech played, likelihood of
subjects abandoning transcript, and various subjective measures; however, our
results hold only when we collapse our four categories into two; i.e., there
are no differences between perfect and 84% accurate transcripts or between 69%
and 50% accurate ones (Litza Stark, Steve Whittaker, and Julia Hirschberg,
"ASR Satisficing: The
effects of ASR accuracy on speech retrieval", ICSLP-00). Currently, in
a new voicemail application, SCANMail, now in friendly trial, we have ported
SCAN technology to the voicemail domain: users are able to browse and retrieve
their voicemail by content. See J. Hirschberg et al., "SCANMail: Browsing and Searching
Speech Data by Content Domain" and A. Rosenberg et al., "Caller Identification for the
SCANMail Voicemail Browser" (both presented at Eurospeech 2001).
Meredith Ringel and I have also worked on ranking voicemail messages as to
urgency and distinguishing personal from business methods, using machine
learning techniques ("Automated
Message Prioritization: Making Voicemail Retrieval More Efficient,"
presented at CHI 2002).
Intonation and
Discourse Structure
Some
results of a long collaboration with Barbara Grosz and Christine Nakatani on
the intonational correlates of discourse structure in read and spontaneous
speech is described in "A
Prosodic Analysis of Discourse Segments in Direction-Giving Monologues,"
(ACL-96). The BDC corpus (with ToBI labels) is available here. Results of
earlier studies of read speech are described in "Some Intonational Characteristics of
Discourse Structure," (a reformatted version of ICSLP-92).
Intonational
Disambiguation
Empirical
studies comparing the way native speakers of different languages employ
intonational variation to disambiguate potentially ambiguous utterances are
described in Julia Hirschberg and Cinzia Avesani, "The Role of Prosody in Disambiguating
Potentially Ambiguous Utterances in English and Italian," ESCA
Tutorial and Research Workshop on Intonation, Athens, 1997.
Disfluencies in
Spontaneous Speech
Christine
Nakatani and Julia Hirschberg, "A
Corpus-based study of repair cues in spontaneous speech," JASA, 1994,
describes studies of the acoustic/prosodic characteristics of self-repairs.
Labeling Conventions
and Labeled Corpora
I
have been an active participant in the development of the ToBI Labeling Standard for
the prosodic labeling of Standard American English (see the ToBI conventions for a quick
overview). . This standard was developed by a number of researchers from
industry and academia and has been extended for other dialects of English and
for other languages, including Italian, German, Spanish, Japanese and more.
Interlabeler reliability ratings (see John Pitrelli, Mary Beckman, and Julia
Hirschberg, "Evaluation
of Prosodic Transcription Labeling Reliability in the ToBI Framework,"
Proceedings of the Third International Conference on Spoken Language
Processing, Yokohama, September, 1994, pp. 123-126) are quite good and there
are tools and training
materials available with pdf and html versions and praat files there. There
is also a Wavesurfer version and another Praat version with cardinal examples done by Agus Gravano and available from
the Columbia ToBI site.
The Boston Directions Corpus (with ToBI labels) is available here.
Julia Hirschberg
Professor, Computer Science
Columbia University
Department of Computer Science
1214 Amsterdam Avenue
M/C 0401
450 CS Building
New York, NY 10027
email: julia@cs.columbia.edu
phone: (212) 939-7114


