Evaluating the Effectiveness of Information Presentation in a Full End-To-End Dialogue System 

1. In the beginning of a dialouge, how does the system decide which category  the current user falls into?  (e.g. student or business person?)

2. What is the time complexity for building the option tree, and dynamically rebuilding during the course of the dialogue if changes need to be made?

3. The results of the user study indicate that users were not very confident that  they heard all options, and thought there might be better options than the ones they were offered.  This seems to be a drawback deciding which options to present to the user, instead of asking the user about their preferences.


1.  How did they generate their initial  user model for their system? Was it done automatically or manually through a series of question/answers? 

2. Since they already have the user model do you think they would have better results if they just picked one choice based upon the the constraints and then if the user didn't pick it they could fall back to the UMSR approach? 

3. They only mentioned a few other approaches to this? Are there others besides UMSR and TownInfo? 

1. There seems to be a few parameters that have to be guessed (such as length of preferences list, how many target clusters). How do these parameters affect user satisfaction?

2. In the Natural Language Generation, when there are multiple pieces of information to convey, how are the sentences ordered in the output?

3. What are some approaches to making the User Models more personalized? How can they adapt over time?


1)   The authors say that they extended the database in TownInfo to be more realistic – but their final database only contained 80 restaurants. In an earlier section, they give an example for the Boston area, where a user query generated 983 restaurants. How applicable are the results found in this study to a more realistic setting? (And size of database is clearly important, since that’s their whole motivation for the UMSR approach).

2)   What is the runtime of this approach? The authors state that they rebuild the tree after every user query, and from what I can tell, the tree has SUM_r=1à#preferences {3^r} nodes, which may take a long time to reconstruct. (And if the database contains 800 instead of 80 restaurants, I can see this approach taking a very long time).

3)   In terms of the user model, the authors state that they expect a consistent user model, and they may modify it during the dialogue, but overall it stays the same (Eg. user prefers Indian food that is cheap and close to the city center). I do not see how that is applicable to a real life situation – for the most part, I do not have a preferred cuisine (although some days I may have a craving for Chinese or Thai, for example), and may be looking for restaurants in an area that I currently find myself in (that has nothing to do with the location of my house or any other fixed location). In this case, the system assuming that I am looking for Indian cuisine (based off of my user model) would be an annoying part of the dialogue that I would have to fix every single time. Can their UMSR approach work with a user model that is constructed from scratch every time? What are the implications on runtime and system likeability?

I think this paper covered a very logical extension of previous work, combining user-model based systems with the summarize and refine approach.  They then compared it with a sequential system, which I didn't fully understand the workings of.

When they discuss their results, I did not understand the difference between perceived task completion and actual task completion.

  There was quite a large gap between the two as well.  How were these measured and how do they actually differ?
Their results end up being surprisingly less significant than I expected.  I don't know if it was because the previous system was good enough or their method failed to be as large an improvement as expected.  Either way, I was a little disappointed.  I appreciated that they did include the actual p-values.

There are 80 restaurants in the system, and the average number of dialogue turns for UMSR is 9.24. If we do some simple calculation, we can see that each dialogue turns eliminates around 1⁄3 of the restaurants. I am wondering how would the number of turns change as the number of restaurants in the system increases.

 One thing I am curious about and the paper does not state clearly is after the user interacts with the system, which user model would the system use for future dialogues. Should they used the predefined user model or the updated user model that reflects
previous situational adjustment?

This paper covered a lot of relevant techniques and approaches developed by other researchers to build a end­to­end dialogue system. They said that previous works did the evaluation using an “overhearer” methodology and they did the evaluation in a more realistic setting. The evaluation part is innovative, but I didn’t see any innovation besides that.

- Would it make more sense to present the user with different ways to cluster information, and then let him/her choose which one they preferred? This way, the system would learn preferences instead of requiring the user to input them or pinpoint a change in their behavior.

- Perhaps the users should be quizzed for comprehension of summarized results to determine if they were indeed representative. I know the researchers asked the users if they thought they had gotten an overview of their choices, but that isn't the same thing.

- Does this type of USMR depend on having only a couple of turns in dialogue and simple decision trees?

The system seems to depend on a heavily structured database, requiring knowledge on the relations between each field and the types of data each field contains.
Instead of using questionnaires to rate the various performance measures, could other more objective measures such as sentence length be used?
What does the system fall back to in the case that the restaurants cannot be further filtered?


*************************
Crowdsourcing Language Generation Templates for Dialogue Systems

 
1. How useful is this technique, if a system developer went through all of the  crowdsourced templates and only accepted 33% as usable?  Is it that much better than a group of system developers brainstorming about different ways  to say something?

2. The motivation for this work was to produce a variety of templates to avoid sounding repetitive and unnatural.  However, one of the problems that the system developer identified was that some of the paraphrased templates were the incorrect register.  Is it possible that you would want a spoken dialogue system to have one consistent way of speaking, in order to create a certain persona?

3. Why use did they use ASR output for user turns in the HITs?  This introduces  noise into the HITs, and might have made things less clear for the crowd workers.


1. Couldn't they have built HIT's for paraphrase evaluation which better mirrored the heuristics that were eventually used by the developer for inclusion? I feel like this wouldn't have taken much more resources and would result in a higher success rate. 

2. Would only testing for one OOC evaluation HIT potentially cause problems where the new paraphrase doesn't fit well with other context?  I think they could have possibly improved their experiment is by doing paraphrase evaluation with multiple OOC's.

2. I would have liked to have seen if these new paraphrases actually caused better user satisfaction when interacting with SDS.

1. Lack of grammaticality was one of the main reasons why a crowd template was excluded. However, this does not seem to be a good criterion as lack of grammaticality could make the system seem more natural.

2. Their approach to exclusion of spam results is outlier elimination. It would be interesting to see if you could build a basic NLU system to beat this detection.

3. Why are Nao robots used in so many papers. Is it because many robotics projects want to add an SDS component? If that is the case, why do so many robotics projects use Nao?

The authors talk about the benefits of crowd-sourcing and automating the SDS building process – and yet they still submit all templates to a system developer to assess. First of all, how reliable are the ratings by one individual? And second, wouldn’t a better experiment be to actually implement these crowd-sourced paraphrases and to see how it works from there?

2)   How do we determine if a paraphrase is different enough? (eg. I can help you find that vs I’ll help you find that ok – to me, these two examples are similar enough that they don’t warrant inclusion into the dataset, and yet both would rank high on the making sense and having same meaning scales). The authors don’t say anything on this topic but to me, it seems like there may be a lot of these instances generated for phrases that just might not have a lot of possible variation.

3)   The developer excluded crowdsourced templates for a variety of reasons, such as lack of grammaticality or punctuation. Could those problems be fixed (perhaps automatically even, through spell-checker tools) in order to augment the paraphrase database?

I think this paper tackled a very interesting problem for dialogue systems.  I feel like a lot of dialogue systems use very limited or simple templates for generation, so this take on expanding existing templates was intriguing to me.

Perhaps I failed to understand the system or goals properly, but I was confused by the fact that most of the generated text they were looking to expand upon was all clarification from the system as to what the user said.  I recognize that speech recognition is somewhat lacking but I imagine the system has other interactions with the user.

I appreciated that the researchers included details of how they gathered their data, removed spam, and evaluated it.  I think it helps the reproducibility of their experiment.  Additionally, the generality of it allows for reproduction on a different spoken dialogue system.


The purpose of crowdsourcing is to reduce the workload of experts. In their experiment, 90% of the paraphrases were accepted during crowd evaluation and only 30% were accepted during developer evaluation. Ideally, the percentage accepted in the crowd evaluation should be lower because this means the crowd is doing more work to eliminate those paraphrases that do not meet the requirements.

A better generation HIT would be presenting the crowd with multiple dialogues that uses the same template and asking the crowd to paraphrase the highlighted text while preserving the meaning of all the dialogues presented.

For future work, they are interested in extending beyond paraphrasing single templates to entire system turns. I think using the same procedure and approach would work. Are there any changes they need to make?

 Would the paraphrases have been more suitable had the context included categorization or additional information (that is, clues about what tone the paraphrase should have, register, etc)
- How would the results have changed with greater number of turns in dialogues, and is 167 too little?
- Are there platforms that allow scientist to crowd-source but only from one type of crowd? It appears to me that a measurement of whether a phrase "makes sense" depends on who is reading it, and that person may not represent the intender user pool.


Is the D-Score something unique to this particular study? If not, what are the theoretical underpinnings of choosing this measure?
Could the template data be particularly skewed toward certain patterns of speech? Workers may be of a different background than the target audience.
Is the developer selection in any way further refined or substantiated by peer review?