Natural language utterances generated by machine can expect an audience of at most one. When there are many readers or many listeners, there is usually also the time and money for a human writer or human speaker. In evaluating NLG, therefore, the thing that matters most is whether that one person understands the system easily and can put its information to good use. This standard is equally applicable no matter how language is generated. It calls for end-to-end systems---or at least situations in which human subjects put system's information to use---and the fine-grained methodology and theories of psycholinguistics. I don't think there are any shortcuts to this evaluation. But it may make sense to lower our standards for it. The fact is that all methods of NLG are making rapid progress and genuine improvements. There are dangers: we could be misled by vague intuitions or focus on unimportant problems. We must make a good faith effort to avoid these dangers. But we don't have to run subject after subject until we hammer out trends to p < 0.01, every time we introduce a new tweak to our algorithms. In HCI you usually see the big bugs with a handful of subjects. If a researcher can report a trend in the right direction at that scale, we should be satisfied. Just around the corner lies the next incarnation of their new idea, different for sure and probably better. What is around the corner? Consider this. I take it that a method is symbolic to the extent that it is formulated in terms of richly structured representations with a clear correspondence with real-world phenomena. In computational linguistics, that means real parse trees, real semantic representations, and principled formalizations of the commitments that speakers make when they adopt the intention to use an utterance to push a conversation forward. I take it that a method is statistical to the extent that it involves the estimation of parameters of a model from noisy data that only imperfectly reflect underlying real-world regularities. So I see every reason to look forward to NLG that is both statistical and symbolic, where systems deploy models that feature richly-structured, principled representations and are moreover estimated from noisy real-world data. After all, each of us at some level knows what we mean, and what our utterances mean, in a principled way---despite having learned our language through our observation of and interaction with members of our own sometimes fallible and obscure species. But I don't know whether we can expect our data for this enterprise to come from official repositories annotated once-and-for-all just right. When it comes to the good stuff, purpose and meaning, there may be no practical alternative to the interactive bootstrapping methods that everything else that learns the good stuff seems to come with: working collaboratively with a conversational partner to understand or to produce each new utterance in light of background knowledge, and using that understanding to refine knowledge for the future.