Making Grounding Decisions: Data-driven Estimation of Dialogue Costs and Confidence Thresholds


I found this paper pretty straightforward and easy to follow.  Everything was explained pretty clearly and everything was well-defined.

They mention that "confidence scores typically delivered by speech recognizers should not be used as a direct measure of probability".  I didn't understand why this was the case.  Additionally, how should the probability be determined then?

The R-squared value for the "best predictor" [total number of syllables uttered] was kind of low [0.622].  I wonder how this affected the system, as well as what the R-squared value for all feature variables was.

- Since efficiency is used as a proxy for satisfaction that is expressed at the end of the dialogue, it would have been useful to have seen whether their measures actually correlated (i.e. check that satisfaction and efficiency do align)

- The paper did not conclude on any one particular grounding action that would minimize cost, or if it did, an example of the different dialogues would have been helpful.

- What do they mean by handcrafted thresholds?

Two sources of uncertainty are ambiguity of language and error­prone speech
recognition process. Level of uncertainty is one of the important factors for making grounding decisions. The model used in this paper only includes the speech
recognition confidence score, which is used to measure uncertainty from the speech recognition process. How do dialogue systems measure level of uncertainty that comes from ambiguity of language?
● Efficiency is an indicator of user satisfaction. The paper says that the best predictor of efficiency is the total number of syllables uttered from both the user and the system.
Are there any studies on which affects user satisfaction more, the length of user utterance or that of system utterance?
● Would there be any user utterances in guiding phrase? If so, how are grounding decisions made based on their model

Is the dialogue data in Swedish?  If translated to English, how does working with translated data affect the analysis?

2. They use number of syllables as a cost, based on their best predictor for user  satisfaction.  However, some exchanges might inherently require more  syllables than others.  Is number of turns a more general measure of  efficiency?

3. The author mentions that this method has not been evaluated yet.  I’d like to see some evaluation that checks if this method yields any improvement over other previous, simpler methods.

This study was conducted in Sweden – is the data in Swedish? If so, they make some assertions regarding length of utterances (in syllables) and user satisfaction. How can we compare this to English-language data? (The authors could have included some sort of analysis of Swedish that tells us how it differs from English in structure or phonetics).
 
The data was collected by having users navigate a virtual city – what are the stats of those interactions? How long were the dialogues? What kind of specific issues did the system encounter? What was the WER/misrecognition done by the ASR?
 
The authors don’t seem to have tested their system – only used it to estimate parameters. In fact they seem to state that an actual system would redo all of their calculations and get different values – so what was the point of this paper?

A. Why might the best indicator of user satisfaction be number of syllables? Does this mean that users never got frustrated and walked away in this instance? Would this always be a good indicator of situations where users did have that option?
B. Would considering more than one hypothesis per statement result in an increase or decrease in the utility of the parameters used for this experiment?
C. Is the trichotomy of "goal assertion, positioning, and guiding" phases applicable to other dialogues that don't concern 3D models of a city (is this a general concept or a specific one)?

Currently, the cost function only takes into account dialogue efficiency, consequence of task failure and information gain. It is possible to take other factors to improve the comprehensiveness of the cost function? And currently, the calculation of information gain is not computationally efficient, is it possible to improve this using other algorithm in the field?
****************************************************************
Detecting Inappropriate Clarification Requests in Spoken Dialogue Systems


I didn't know that dialogue systems were capable of so many different types of clarification requests.  I'd heard a few from automated phone systems, but more often than not, I just hear simple requests for the user to repeat the entire sentence.
The use of a decision tree classifier was interesting to me since automated decision trees are known to have poor performance.  And yet the results were really good, surprisingly.  I wonder what other classifiers were tested and how well they performed in relation.
I looked it up and this paper appears to have been published in 2014.  Has work begun on adding acoustic and prosodic features yet?  I'm curious to see if that can improve the classification even more, particularly in correcting false positives.

- The only downside to the training I can think of is that user responses to inappropriate clarifications may differ from lab to real-life usage, i.e. a person who is in a hurry or has spoken with the system for too long may respond differently than a lab subject.

- Is there another way to catch prosodic responses from users? It seems to me that if next iterations of the classifier will just catch specific exceptions, it will overfit or grow too large.

- This paper was very clear!


This paper proposed a novel way of identifying inappropriate clarification requests. The researchers trained a classifier that can identify whether the clarification prompts given by the system are appropriate or not based on user responses to the prompts. They listed evaluation of the use of an inappropriate clarification request component in speech­to­speech translation system as one task for future work. How would the information returned by inappropriate clarification request component be used by the system? In other words, if the classifier identifies a clarification request as inappropriate, what would the system do next?
● Is it possible to extend this work and build a n­class (maybe 6­class) classifier so that it does not only identify whether the clarification requests are appropriate, but also for inappropriate requests, it identifies the correct type of clarification the system shouldgive to the user?
● Were there any features that the authors tried, but did not give any improvement to the classifier? Did they try to use features such as count of trigrams and length of user response?

Were the instances of the appropriate/inappropriate clarifications that they created based on realistic instances that came up in actual dialogues?  How prevalent are these inappropriate questions?

 
2. Is the goal to detect inappropriate clarification questions from the user  response, in order to switch to a generic question, or is the goal to try to  preempt inappropriate questions before they are asked?

3. How would the model that they built to detect inappropriate clarification  questions be integrated into an actual dialogue system?  For example, parsing can add latency to the system – would syntactic features be included?
The authors say the types of clarification questions asked by the system were the ones used by an SRI speech-to-speech translation system. The clarification questions weren’t actually based on real misrecognitions, but on hand-selected ones. How realistic is this scenario then? In the example that’s given, how often is ‘furor’ actually misrecognized? If it’s not misrecognized that often (or, more realistically, if it just doesn’t come up in regular human-computer interaction), then what’s the point of basing an experiment on that word/phrase?
 
The authors say they prepared 12 sample clarification requests for each of the 19 types of requests, for a total of 228 and each subject answered 114 requests. The total amount of data they gather is 1938 responses. However, shouldn’t there be a condition for each type of clarification question mismatch (19*18 = 342); we would probably want several examples of each type of mismatch and to have several people respond to each mismatch (or match), so wouldn’t we want more data?
 
The authors say there were instances when appropriate requests were classified as inappropriate – how many times did that occur? The authors hypothesize this occurred when the users answered the request the system should have asked instead of what it actually asked – can we test this hypothesis? 

A. Are rephrase part-requests, which play back part of the user's own utterance, associated with higher or lower frustration among users?
B. What happens when the system asks a user to spell a part of an utterance that the system mispronounces? Or does the system always replay part of the user's utterance when asking for this? (e.g. How do you spell 'Afdhal', but 'Afdhal' is mispronounced.
C. Is the strategy of removing features one at a time to determine the ones with the heaviest impact a common strategy?

The experiment approach described in the paper can be summarized as a simulation of appropriate and inappropriate responses and extracted the features for the classifier. However, how could this approach be applied to other scenarios? Have the researchers to retrain classifier or can they directly apply the classifier to determine which response is appropriate?

At the current stage of system implementation, only lexical features are used to investigate user responses. As mentioned in the literature, acoustic and prosodic features could also be added to facilitate investigation. What would be the difficulty in adding such features? How much improvement could be foreseeably obtained to justify the work and effort devoted for the additional implementation?