Creating .utt Files for English

Create prompt file and general setup

First you need to make a .data file with the base filenames of all the utterances and the text of each utterance. e.g.
( uniph_0001 "a whole joy was reaping." )
( uniph_0002 "but they've gone south." )
( uniph_0003 "you should fetch azure mike." )

You will also need to do some general setup to get Festival and related things on your path -- put these in your .bashrc file:
(These are the newest version of Festival (2.4), containing EHMM, from the Babel Festvox scripts) and also don't forget to source .bashrc:
export PATH=/proj/tts/tools/babel_scripts/build/festival/bin:$PATH
export PATH=/proj/tts/tools/babel_scripts/build/speech_tools/bin:$PATH
export FESTVOXDIR=/proj/tts/tools/babel_scripts/build/festvox
export ESTDIR=/proj/tts/tools/babel_scripts/build/speech_tools

** Note that these are the paths that Speech Lab students should use. If you are not in Speech Lab, then set these paths to wherever you have Festival, Festvox, and EST installed.

** Also note that any labels created using the old version of Festival (in /proj/speech/tools) will be missing the feature "vowel in current syllable," which especially affects the quality of Merlin voices. Make sure the labels you are using are consistent, e.g. if using old-style labels, make sure to be comparing the voice to a baseline that also is using the old-style labels.

Fullcontext labels using EHMM alignment

EHMM stands for "ergodic HMM" and is an alignment method which accounts for the possibility that there might be pauses in between phoneme labels. This should in theory result in better duration models. This method is fairly commonly used, and is built into Festival. More information on EHMM can be found in this paper: Sub-Phonetic Modeling for Capturing Pronunciation Variations for Conversational Synthetic Speech (Prahallad et al. 2006).

Source: modified from http://www.nguyenquyhy.com/2014/07/create-full-context-labels-for-hts/

In the label step, if you get an error "Wave files are missing. Aborting ehmm." then check the file names in txt.done.data vs. those in wav/ -- something is likely missing or duplicate. The set of utterances in both places must match exactly. If you only removed a transcript line in txt.done.data and did not remove any .wav files, then you can just continue with label; you don't have to re-run the build_prompts step.

Getting alignment score from EHMM

EHMM will tell you the average likelihood after each round (under ehmm/mod/log100.txt), but does not by default record the likelihood for each utterance. Speech lab students: our version of Festvox will print this out to stdout when it runs. Everyone else: I added this by going into $FESTVOXDIR/src/ehmm/src/ehmm.cc, the function ProcessSentence, and adding this line:

cout << "Utterance: " << tF << " LL: " << lh << endl;

after the part where the variable lh gets computed for the utterance. Then, recompile by going to top-level $FESTVOXDIR and running make.

Fullcontext labels using DTW alignment [DEPRECATED]

This method synthesizes all of the utterances with an existing English Festival voice, and then uses dynamic time warping (DTW) with the synthesized and actual audio, to get the alignments between our actual audio and the text. This is what we have used for many of our English voices so far, but there are better methods out there that we should use instead (see EHMM above). This method is included for reference.

Source: modified from http://festvox.org/bsv/x3082.html