Natural language utterances generated by machine can expect an
audience of at most one.  When there are many readers or many
listeners, there is usually also the time and money for a human writer
or human speaker.

In evaluating NLG, therefore, the thing that matters most is whether
that one person understands the system easily and can put its
information to good use.  This standard is equally applicable no
matter how language is generated.  It calls for end-to-end
systems---or at least situations in which human subjects put system's
information to use---and the fine-grained methodology and theories of
psycholinguistics.

I don't think there are any shortcuts to this evaluation.  But it may
make sense to lower our standards for it.  The fact is that all
methods of NLG are making rapid progress and genuine improvements.
There are dangers: we could be misled by vague intuitions or focus on
unimportant problems.  We must make a good faith effort to avoid these
dangers.  But we don't have to run subject after subject until we
hammer out trends to p < 0.01, every time we introduce a new tweak to
our algorithms.  In HCI you usually see the big bugs with a handful of
subjects.  If a researcher can report a trend in the right direction
at that scale, we should be satisfied.  Just around the corner lies
the next incarnation of their new idea, different for sure and
probably better.

What is around the corner?  Consider this.  

I take it that a method is symbolic to the extent that it is
formulated in terms of richly structured representations with a clear
correspondence with real-world phenomena.  In computational
linguistics, that means real parse trees, real semantic
representations, and principled formalizations of the commitments that
speakers make when they adopt the intention to use an utterance to
push a conversation forward.  

I take it that a method is statistical to the extent that it involves
the estimation of parameters of a model from noisy data that only
imperfectly reflect underlying real-world regularities.

So I see every reason to look forward to NLG that is both statistical
and symbolic, where systems deploy models that feature
richly-structured, principled representations and are moreover
estimated from noisy real-world data.  After all, each of us at some
level knows what we mean, and what our utterances mean, in a
principled way---despite having learned our language through our
observation of and interaction with members of our own sometimes
fallible and obscure species.

But I don't know whether we can expect our data for this enterprise to
come from official repositories annotated once-and-for-all just right.
When it comes to the good stuff, purpose and meaning, there may be no
practical alternative to the interactive bootstrapping methods that
everything else that learns the good stuff seems to come with: working
collaboratively with a conversational partner to understand or to
produce each new utterance in light of background knowledge, and using
that understanding to refine knowledge for the future.