Creating Label Files for Training Data

When preparing a new corpus for TTS using either Merlin or HTS, label files in the .lab format are required. Also, for corpora where we may already have label files, they may need to be regenerated to fix Merlin errors. This page describes the procedure for creating these files.

Please see here for generating test synthesis labels only.

0. Understanding the Label File Format

The fullcontext label format represents phonemes in context. The different context elements are extracted by a frontend tool (we use Festival). The label file basically has one phoneme-in-context per line (with the start and end times of that phoneme in ten-millionths of seconds), and the line contains each context element separated by unique delimiters to enable pattern matching. See lab_format.pdf from the HTS demo for more information on the format and the standard contextual features.

1. Use Festival to Create Utts

Festival is the frontend tool we use to extract phonemes and contextual features from text. For training data, we have both text and audio, and Festival does the alignment with the audio to get the start and end times for each phone. Festival produces utt-format files containing a structured representation of the utterance and its properties.

Are you working with English-language data? Follow these instructions.

Are you working with Babel data? Follow these instructions. They assume you already have a baseline Clustergen voice trained for your language, so make sure that that's been done as well.

You should end up with a set of .utt files that correspond to your data.

2. Use HTS to Convert from Utt to Lab

HTS and Merlin use the same label file format. Use HTS to convert from the structured .utt format to the flattened .lab format. Speech lab students, do the following:

First, check the following:
- Do you need to use a phone mapping? If you are working with Babel data, those lexicon files have phoneme symbols which are illegal in HTS and Merlin. More info about that here. Make sure that your .utt files have the correct mapped phonemes for your language; if not, then you will need to create phone-mapped versions of your .utt files and use those for converting to .lab. Look in /proj/tts/examples/map_phones.py for an example of how to map phones in .utt files.
- Are vowels indicated correctly? You only need to do this if you are working with new Babel data; otherwise this has probably been done already. The conversion looks to the phoneset of the festival voice that you point to, to fill in the "name of the vowel of the current syllable" feature in the label files. Thus, all vowels must be marked as such in the voice's phoneset file (which does not always happen by default). Look in /proj/tts/tools/babel_scripts/yourchosenvoice/festvox/*_phoneset.scm and make sure that all vowel phonemes are marked + in the second column. Check the LSP document for the language if you are unsure. Also, make sure the mapped phoneme names are present in the phoneset file as well. More info on what should be yourchosenvoice in the steps below. To check whether the Festival voice you are using contains the correct phoneset, run the following in Festival:
  festival> (voice_yourchosenvoice) festival> (PhoneSet.description nil)
Copy the directory /proj/tts/hts-2.3/template_si_htsengine/data into wherever you are working. Non-speech-lab people: this is just the data directory from the SLT HTS demo, with all the ARCTIC-specific data removed.
Copy or symlink your utts into a directory called utts under data/
Inform Festival about the frontend you want to use. If you are using the default English frontend, then you don't have to do anything. If you are using a different language, you will have to modify the following line in the Makefile:
DUMPFEATS = /path/to/your/festival/examples/dumpfeats
You should add on the end of that: -eval "(voice_yourchosenvoice)"
Speech lab students: We want to point it to the version of Festival where we have our voices installed, and choose e.g. a Babel voice. So e.g. if you are making labels for Turkish, change this line to:
DUMPFEATS = /proj/tts/tools/babel_scripts/build/festival/examples/dumpfeats -eval "(voice_cmu_babel_turkish_cg)"
And if you want to use a different voice frontend, put that instead. To see which voices are available, run (voice.list) inside of Festival. If the voice you want to use is not there yet, you will have to add it, by copying the etc, festvox, and festival directories from your Clustergen frontend into /proj/tts/tools/babel_scripts/build/festival/lib/voices/language/yourvoicename. You may have to make the language directory, and you will definitely have to make the voicename directory.
In the data/ directory, run:
make lab
The output .lab files should be under labels/full/
Finally, don't forget to normalize the labels for Merlin. The HTS and Merlin label formats are mostly the same but have slight differences; use the conversion script in merlin/misc/scripts/frontend/utils/normalize_lab_for_merlin.py with the phone_align setting.