Open Dialogue Management for Relational Databases 

Different from papers we discussed last class where there collected real-world data from users voice input to test and evaluate the proposing system, this paper uses synthesis data as their evaluation input. However accessible and representative, it is  still divergent from the real world test. How difficult it is to gather some real world data from those dedicated fields desired for a particular spoken dialogue research project? 

Based on my experience with the Natural Language Processing literature review, Ifound that testing the impact of domain knowledge is a unique point of contribution for understanding and evaluating the performance of a speech recognition algorithm.

- It doesn't seem that facilitating the querying of database information can be akin to a system's "dialogue policy" of choosing which strategies to use in dialogue with a user: to some extent, querying also has implicitly defined limits to what a user may ask about, as opposed to the act of conversing, which is more open-ended.
- It also strikes me as relatively low importance to want a more natural interface to a database when, say, SQL queries already pretty much look like natural language.
- Is the fact that non-randomized prompts yielded shorter interaction dialogues a finding translatable to other domains? i.e. are most ODMs concerned with interacting with their users in a way that makes logical sense? (as opposed to being semantically correct, etc)

-       1) The organization of the paper is a bit lacking – for several topics they touched upon (what is open dialogue management? How does ODDMER do vocabulary selection?), I was left to wonder for several pages before the authors finally answered my questions. In this case, I think some definitions at the very beginning of the paper would be useful.

-       2) The corpus annotated for the vocabulary selection task has a particularly low inter-annotator agreement value and the authors do not seem like they were certain of what they actually wanted the annotators to do. The question is – could this have been done better? Perhaps in any annotation exercise, a smaller test corpus should be annotated first, giving the researchers an opportunity to refine the annotation task. 

-       3) The authors state that identifying a table as a candidate dialogue focus is as easy as selecting the biggest table with the most connectivity and having more NLP content – however, doesn’t this assume that the database is constructed in the most clean-cut and logical way? That is, would this work in a ‘real’ database, one where columns (and relations) are added as needed to fix issues at the time that they arise? (At least this has been my experience with real-world databases).


The disadvantage of hand-selected vocabulary instead of using a classifier seems an issue of scale rather than accuracy, but wouldn't most systems in other domains require other annotated training sets?
The paper mentions that an online game such as Eve has fields with low verbality with great interest to the user. In other fields, where numbers are important values, how might the vocabulary selection be adjusted?
In the situation, we are only querying a set of related databases. How might ODDMER perform or be adjusted for a set of disjoint relational databases, for a general system such as Siri?

A. This approach still requires hand-tagging for words that would be likely known by users. Is this necessary? Wouldn't it be more realistic to read in a general conversation corpus and train on just presence of words or, in the case of birthdates, patterns, to estimate whether attributes are user-intelligible?

B. How is 'basic user focus' determined? Was it the study authors' idea of what was important to the user based on their human knowledge of the purpose of a library?

C. "If the return size is small enough (here, a single tuple) it announces the result. Otherwise it continues to elicit constraints until all intelligible attributes have elicited values, at which point it announces all matching results." Why would it wait for all constraints to be satisfied before giving a list of potential matches? Often users only know one type of attribute of the book they're looking for, and normal search engines are good enough to be able to give them 10 pretty good results based on that. This seems frustrating.


I would like to hear more description on testing reliability, especially on move segmentation. I think the paper does a good job describing, but I don't feel like I fully understood that part. When checking for reliability of transaction coding, I was wondering if simplifying the task would alter the result. 

Though I was at first very confused by the terminology used, I ultimately found this paper very clear and thorough.  I thought it was interesting how they managed to build a system that showed potential for cross-domain usage, as they demonstrated how it worked on a few different databases.  They did mention however that identifying intelligible data was not always a straightforward task [in part due to the wording of the question asked].  I wonder if changing the question would change the results, and whether or not more obscure database information would work with their system.

In general liked the article. They mentioned that table names in general are not very descriptive of their actual content, and that they had to manually change the names of some of the attributes/tables to be more humanly understandable. I was hh if this is something that could have been done with a classifier as well? Instead of doing what they did could they have built a category classifier and then used the the contents of the database rows to come up with an appropriate humanly understandable category?

Also it was interesting how they used user simulation -- it would however have still been nice to see some results with non simulated users.

1. How is semantic specificity calculated?

2. Are simulated users an effective evaluation method?  (Are they typically representative of real users?)

3. The authors assume that “unintelligible” keywords are less desirable.  However, there might be cases where you would want to search using a keyword that is unintelligible.  For example, if you search by ISBN you are guaranteed to get the exact book (and edition) that you want.

I have a few questions about the training process of the intelligible attributes classifier. First, the training set contains only 67 attributes, which is relatively small, and all the training data are attributes from Microsoft AdventureWorks Cycle Company database.

The attribute in that database may not be representative of how database designers name the attributes. Second, some of the features for attribute classification do not make intuitive sense to me, such as mean ratio of unique to total characters per values. If the mean ratio of unique to total characters per values is high, is it mean that the attribute is less intelligible or more intelligible?

They estimate the probabilities for attribute value knowledge by calculating the frequency of total value occurrences in a subset of the English Gigaword corpus. Is it a standard way to estimate human knowledge? I expect the probability for the Place Published to be higher than TITLE, because TITLE may contain words that do not co­occur in one article. However, the probability of TITLE is significantly higher.

In table 4, it is interesting to see that regardless of the number of the tables in each database and the different domains of the database, the shortest dialogue lengths for each database are all 9.0 without any deviation from their means.

**What does this mean ­ “For each candidate focus, ODDMER instantiates a focus agent that prompts users for values of intelligible attributes, ordered by efficiency” ­ Is the user ranking the different important attributes/data extracted from the DB?