Creating Label Files for Training Data

When preparing a new corpus for TTS using either Merlin or HTS, label files in the .lab format are required. Also, for corpora where we may already have label files, they may need to be regenerated to fix Merlin errors. This page describes the procedure for creating these files.

Please see here for generating test synthesis labels only.

0. Understanding the Label File Format

The fullcontext label format represents phonemes in context. The different context elements are extracted by a frontend tool (we use Festival). The label file basically has one phoneme-in-context per line (with the start and end times of that phoneme in ten-millionths of seconds), and the line contains each context element separated by unique delimiters to enable pattern matching. See lab_format.pdf from the HTS demo for more information on the format and the standard contextual features.

1. Use Festival to Create Utts

Festival is the frontend tool we use to extract phonemes and contextual features from text. For training data, we have both text and audio, and Festival does the alignment with the audio to get the start and end times for each phone. Festival produces utt-format files containing a structured representation of the utterance and its properties.

Are you working with English-language data? Follow these instructions.

Are you working with Babel data? Follow these instructions. They assume you already have a baseline Clustergen voice trained for your language, so make sure that that's been done as well.

You should end up with a set of .utt files that correspond to your data.

2. Use HTS to Convert from Utt to Lab

HTS and Merlin use the same label file format. Use HTS to convert from the structured .utt format to the flattened .lab format. Speech lab students, do the following: