Integrated Sentence Planner It would be interesting to see two of this system interacting with each other, especially if they were trained on different user populations. How would the two systems adapt to each other? How would they score in the task? How do changing the features affect the performance on the task? Can I get more details on how features and the grammar are combined to generate a cohesive sentence? For the first step of mining COREF’s dialogue experience, the positive and negative instances are derived. Here, I was a bit confused with the extent of “the NLG output string matches what the user actually said.” I wish that this paper also provides at least some brief descriptions of each outside papers that it refers to for the better understanding. As mentioned in discussion section, it would be useful to adapt to the degree of ambiguity of users’ utterance. This kind of system only works well if the system and the users have similar dialogue roles. Do most dialogue systems fit this model, or do systems usually play a different dialogue then users? I wonder if this method would work well in a different, less narrow domain. This COREF task is somewhat contrived – the user and agent describe a limited set of shapes and colors. Real world situations are much more complex. I like how they describe the previous work and clearly explain how they are building on the research of others. - What are the parameters used to judge a phrase's ambiguity? - Isn't it dangerous to want to base a system's utterances based on users' utterances? What if the data used to train the model isn't reliable? - Although I understand why they wanted to keep the features from DeVault and Stone, it would have been nice to see how other features affected the results. - I found it hard to believe this paper's conclusion, since it seems hard to generalize the quality of Amazon Turk for all projects based on a single study made on it. - Won't the quality of evaluation be affected by how many dialogues the users were asked to engage in? In the observed trial, 20 dialogues were asked of each user. How many were completed by each Turker? - To me, the strongest points of this study is that the price is more scalable when using Turkers than without * The paper states that Jordan and Walker's distractor features were not included, because the architecture only considers unambiguous utterances. What are these features and how might they affect the performance of the system? * In collecting the mixed-initiative data, what types of additional data were collected in the additional turns? * In an example given in the appendix, some ungrammatical or repetitive examples such as "the blue blue object" were included as candidates. Could a more complex language model be included to reduce invalid objects? 1. What does the paper mean when it talks about 'positive examples' vs 'negative examples' in training data (section 4)? This wasn't clear for a while until the end of the data analysis part, when I inferred that they were talking about positive reinforcement vs. negative reinforcement in the update function. 2. How does MALLET select a set of features? Are those features specific to user goals for this dataset? Do they have to be predefined? This is particularly for the presupposition features. 3. Why do the authors assume that humans and robots interact on a more equal footing, whereas in spoken dialogue systems, the field is tilted toward one participant? ****************** AMT evaluation I think it's very interesting that many AMT-ers were initially found to be non-native speakers of English. This could potentially be a major problem for people who want to use AMT for certain NLU tasks. How does content and scenario affect the results? For example, if the goal was to find a cheap Chinese restaurant, a user who is particularly ignorant or uncultured may mistakenly settle for a Japanese or Korean restaurant. It seems that if the paper's results were negative instead, a lot of other researchers' results would have been invalidated... I appreciated how the paper explained the terminology and the background well in the introduction. I also thought it was interesting to see the effort they put into for testing the telephone framework. One question I had was whether the more simple AMT workers would yield the same result with the goal inference algorithm. How did they come up with their heuristic algorithm to compute objective success rates based on inferred user goals? Did they evaluate these success rates? In general, users were optimistic about the success of a dialogue. If the users perceive a dialogue as successful, even if it wasn’t according to the objective metric, should it be considered successful? Is the ultimate goal to have happy users? Why do they want to encourage turkers to make more calls on average, instead of aiming for a diverse population with few calls per user? Which model is more similar to real users of the dialogue system? The paper concludes that recruiting AMT workers was more efficient and cheaper than Cambridge workers, leading to a greater amount of AMT data relative to Cambridge data. Might this difference in sample size affect the results? * Cambridge English speakers were not filtered out for accent and had objectively different standards applied to them. * The study's conclusions seem restricted to domains that the average AMT worker would be capable of interacting with. 1. Isn't this an already-studied topic? Did this paper come out near the advent of Mechanical Turk? 2. The way they 'discovered' if users were native English speakers or not seemed strange - they performed a 'manual inspection of the data.' It's unclear what threshold they used to determine this language proficiency. 3. How can they claim that their findings are significant if they used mismatched ASR and acoustic language models? 1) Conventionally, dialogue systems are evaluated in a remote environment under supervision. It is very demanding and costly, whereby subjects are required to perform tasks in the lab as per instructions, and evaluate the system consistently at the end, on it’s performance. This paper aims to perform evaluation in a real environment using Amazon’s mechanical turk. 2) Amazon’s Mechanical Turk enables large scale crowd sourcing, where a large number of small tasks are performed by people, where each task is referred to as a HIT(Human Intelligence Task). 3) AMT uses both a web based framework and a telephone framework. The latter is used primarily to ensure native speakers of english mainly contribute to the dataset, resulting in the implementation of a telephone based evaluation framework. 4) Given AMT tasks are performed under no supervision, it’s accuracy is a concern, due to which it’s results were compared to the “Closed Environment” Test results, where “success” is said to be achieved if the user successfully obtains the information he is looking for. 5) In the AMT Trial, people were native speakers of North America, and the Cambridge trial, British speakers were primarily contributing. Inspite of this, the speech performance did not vary significantly. ************************* User Simulations I'm not sure whether evaluating the User Simulator against a learnt system policy is necessarily a good measure of how well the user simulator simulates a user. Perhaps there is a more direct way to compare the user simulator against a real user. When evaluating a user simulator, the goal shouldn't be to maximize the success of the task, but rather to simulate the user as closely as possible. Shouldn't the results of the user simulator and learnt system policy be compared to the results of the actual data? In other words, we should expect a good user simulator to make the same mistakes as real users, in addition to succeeding where real users also succeeded. What are Good-Turing and Witten-Bell discounting (in the n-gram models)? When evaluating the COMMUNICATOR corpus, the study assumes the word error rate is correlated with automatic speech recognition confidence score. I am not sure how they drew this conclusion and would like to hear more about this. I also wish to hear more about a linear combustion of features that they used for learning with feature-based representations of states. When evuating the different user simulations, I wondered if the restriction they had on the users – only single-leg flight booking- would denote accurate result for all cases of users in real life. They use WER as a measure of ASR confidence – is this a fair assumption? The compare different simulation methods to each other (linear combinations, n-grams). How these simulation methods compare to results from actual users? Their evaluation against the system policy seems pretty simple, since they only consider cases where all users have the same single goal. What is the efficiency of the implementation with a model that takes into account the rest of the dialogue? - What is the performance of the model when the simulations don't have a set pattern to emulate? - Can this model be extended to emulate user errors as well? To be successful in testing a given dialogue system Their definition of 'intention' is "the minimum piece of information that can be conveyed independently within a given application". What do they mean by 'independently'? 2. Why is it so important to have ASR confidence scores? Is this just to make discrete correct/incorrect word recognition errors into a more continuous function? 3. What is the immediate applicability of this work? How were the 4 out of 8 systems chosen? * Is it safe to assume, as the paper does, that WER is correlated with ASR confidence? * It seems that the biggest gains come from transition from the 2-gram to 3-gram model. 1) The primary aim of this paper is the comparison of two methods used for the simulation of user behaviour in systems, since they are important for automatic dialogue strategy learning. User simulation is essential for system evaluation as it is inexpensive and less tedious, and any changes in the dialogue strategy does not entail a repeat of experiments. 2) User behaviour modelling for simulation is done at the intention level, where intention is the minimum piece of information conveyed within an application. Both the strategies use this method. 3) It is built on the DIPPER architecture, which enables the development of multimodal dialog systems by providing speech synthesisers, recognisers, etc. 4) User simulation based on n­grams treats a dialogue as pairs of . It restricts length of history on the basis of the order of the n­gram. If no n­gram matches history, model backs off to smaller n­grams. The other method, based on supervised learning, uses a linear feature combination where the user is modelled as astochastic process. The features are based on the whole history of the dialogue (dialogue state).