Model-based Bayesian Reinforcement Learning for Dialogue Management

- What constitutes a reward from moving from one state to the next?
- Would changing the assumption that an observation only depends on the last user act change the results? Would including more dependencies make the model more robust?
- As a general comment, I don't know if I had enough statistical background to truly understand the findings of this paper.

-       1) In terms of paper clarity – while the author does a good job of including formulas in this paper to explain his method, there were so many variables and probabilities floating around that I was left quite confused. When dealing with a paper of a more mathematical bend, I would appreciate more diagrams and examples, rather than just formulas. On the other hand, he does include pseudocode for the algorithm used, which is always helpful.

-       2) The author does not spend much time discussing the training data collection, specifically: were the dialogues annotated by multiple annotators or a single one? If multiple, what was the agreement? Were there any difficulties or issues with the annotation? Basically, we have no idea how good or bad this training data is and since it’s the basis of the rest of the paper, I find it somewhat troubling that this important information was skipped over.

-       3) In terms of paper organization, I find it strange to see the Related Work section at the end, when I would have preferred to see it closer to the beginning so I could see where this work fits in with the rest of the field.

How do these model-based methods compare with hand-crafted models? Model-free methods?
It seems that having a model imbued with no domain-specific structure (the multinomial) performs in the short term worse than a more structured model, but eventually converges in the long run.
Why were these specific models chosen?

A. The second paper said that reinforcement learning had the downside that it required a small set of initial researcher-given states and actions. Is this a big hindrance in this case?

B. The researchers said they assume conditional independence under the Bayesian learning model. Is this a significant simplification that could lead to problems?

C. "We started by gathering empirical data for our dialogue domain using Wizard-of-Oz experiments, after which we built a user simulator on the basis of the collected data." I thought the point of this method was that they didn't have to use Wizard-of-Oz style data collection?

I found this paper very dense and hard to understand.  It was hard to not get caught up in the math, which I found very hard to follow.  I'm curious how they came up with the points system for the reward model.  It's not explained, so the exact numbers feel arbitrary.  Presumably they tried a few different parameters and this worked the best.  I think it's interesting that ignoring the user is a lower penalty than performing the wrong action, and that asking to confirm the correct intention is also penalized.  I was also surprised that model-based strategies for dialogue management had not been tried before now.


This paper was in general rather dense. Although they did a fair job in trying to make it accessible there were still several concepts that weren't fully explained due to lack of space. I hope from the presentation I'll get a more fuller understanding.

How did they come up with the reward model used for the experiment?  Is it hand constructed, or is it learned from some data?

2. For the data collection – did users know which questions/commands were in the robot’s domain? 

3. I didn’t fully understand the Wizard of Oz data collection.  The users interacted with a robot – were the robot’s utterances chosen by a human?

In the evaluation session, the author only compared and evaluated two alternative models of his approach and concluded that both models achieve higher returns as the number of turns increase. In Related work session, the author mentioned that there have been some earlier work on model­based reinforcement learning for dialogue management. I would like to see a comparison between his approach and those approach in related works.

How do transitive, reward and observation models work together with each other? In order to extend the framework to estimate the reward model in parallel to the state transitions, what work needs to be done?

I’m having a hard time understanding figure1. I know what each node represents, but what does each arrow mean?