Creating .utt Files for Babel Languages

Please note: these instructions are for Columbia Speech Lab students only. Many of these tools and data sets are not yet publicly available.

See this reference page for a list of the Babel languages and their language codes.

0. General setup

Make sure these are all in your .bashrc:

export PATH=/proj/tts/tools/babel_scripts/build/festival/bin:$PATH
export BABELDIR=/proj/tts/data/babeldir
export ESTDIR=/proj/tts/tools/babel_scripts/build/speech_tools
export FESTVOXDIR=/proj/tts/tools/babel_scripts/build/festvox
export SPTKDIR=/proj/tts/tools/babel_scripts/build/SPTK
export BABELDIR=/proj/tts/data/babeldir

Then make sure the Babel language you want is in $BABELDIR, e.g. BABEL_BP_105 (Turkish), and if it's not, then symlink it in.

1. Directory setup

For this example, we will say that we are using Turkish-language data from the Omniglot dataset. Change the directory name and the Babel language ID (BABEL_BP_105) accordingly for whichever language and dataset you are using. Things you will want to tailor to your own language and data are in italics.

cd /proj/tts/tools/babel_scripts
mkdir turkish_omniglot
cd turkish_omniglot

Then run the following, which should all be on one line:
/proj/tts/tools/babel_scripts/make_build setup_voice turkish_omniglot $BABELDIR/BABEL_BP_105/conversational/reference_materials/lexicon.txt $BABELDIR/BABEL_BP_105/conversational/training/transcription $BABELDIR/BABEL_BP_105/conversational/training/audio

2. Drop in our data

Put .wav files under wav/. These should be 16k and 16 bit format.

Create txt.done.data file with transcripts under etc/. This is a file containing the utterance filename IDs with the transcripts. It should look something like this:

( uniph_0001 "a whole joy was reaping." )
( uniph_0002 "but they've gone south." )
( uniph_0003 "you should fetch azure mike." )

3. Build prompts and identify OOVs

Run this from your top-level turkish_omniglot directory:
./bin/do_build parallel build_prompts etc/txt.done.data

It is recommended to run this command in an emacs shell, since it appears to handle utf-8 the best and cause the fewest problems.

This step will reveal words that are not in the lexicon. Get a list of those words and use Sequitur G2P or Phonetisaurus to generate their pronunciations.

For syllabification, there are a few options:

For the stress markers (the numbers at the end of each syllable unit, currently all just 0) we are just continuing to put 0 for now. We have stress information available in the Babel lexicon, but it is not currently incorporated.

Once you have both the pronunciations and the syllabifications for the OOV words, add them to festvox/lex.scm in the proper format, anywhere in the file. Then, use this Festival command to sort them into the appropriate order and create the final lexicon:

cd yourvoicedirectory

$ESTDIR/../festival/bin/festival -b festvox/yourvoicename_phoneset.scm '(set! lex_syllabification nil)' '(lex.compile "festvox/lex.scm" "festvox/cmu_babel_lex.out")'

Once all of the OOV pronunciations have been added, re-run the build_prompts command. If you still need to fix any errors, remember to re-run until no more errors.

4. Label using EHMM alignment

./bin/do_build label etc/txt.done.data

If you get an error "Wave files are missing. Aborting ehmm." then check the file names in txt.done.data vs. those in wav/ - something is likely missing or duplicate. We've also found that using symlinks for wavs sometimes causes this. Use full paths for symlinks, not relative paths.

5. Last steps

./bin/do_clustergen generate_statenames
./bin/do_clustergen generate_filters
./bin/do_clustergen parallel build_utts etc/txt.done.data

Then the .utt files should be there in festival/utts.

6. Phone Mapping

If you want to use this data with HTS, you will have to phone-map them to make sure there are no non-alphabetical phoneme names (see /proj/tts/examples/map_phones.py) Then, you will have to run make label to create both full and mono labels out of these utt files. Finally, don't forget to run make list once all full and gen labels are in place.

Also, you will have to do phone mapping for Merlin as well, if any of the phoneme names are the same as any delimiters in the label file format before converting to lab, followed by lab normalization for Merlin as well.