Preparing Data for Training an HTS Voice

This tutorial assumes that you have HTS and all its dependencies installed, and that you have successfully run the demo training scripts. This site does not provide comprehensive info on installation as this is included in the HTS README. However, some notes on installation and various errors you might encounter can be found here.

These are instructions for preparing new data to train a voice using the HTS 2.3 speaker-independent demo. For information about speaker-adaptively training a voice and other variants, please see the variants page.

In the diagram above, the blue items are files you must provide before running the HTS data preparation steps, and the green items are the output created by each step.
Note that .raw and .utt files are not required by the actual training demo scripts, so if you have your own way of creating acoustic features or HTS-format .lab files, you can just drop those in.

0. Background: Acoustic and Linguistic Features

There are two main kinds of data that we use to train TTS voices - these are acoustic and linguistic, which typically start out as the audio recordings of speech and their text transcripts, respectively. The model we are learning is a regression that maps text (typically transformed into a richer linguistic representation) to its acoustic realization, which is how we can synthesize new utterances.

The acoustic features get extracted from the raw audio signal (raw in the diagram above, in the step make features); these include lf0 (log f0, a representation of pitch) and mgc (mel generalized cepstral features, which represent the spectral properties of the audio).

The linguistic features are produced out of the text transcripts, and typically require additional resources such as pronunciation dictionaries for the language. The part of a TTS system that transforms plain text into a linguistic representation is called a frontend. We are using Festival for our frontend tools. HTS does not include frontend processing, and it assumes that you are giving it the text data in its already-processed form. .utt files are the linguistic representation of the text that Festival outputs, and the HTS scripts convert that format into the HTS .lab format, which 'flattens' the structured Festival representation into a list of phonemes in the utterance along with their contextual information. Have a look at lab_format.pdf (which is part of the HTS documentation) for information about the .lab format and the kind of information it includes.

1. Directory Setup

To start, copy the empty template to a directory with a name of your choosing, e.g. yourvoicename. You will then fill in the template with your data.

cp -r /proj/tts/hts-2.3/template_si_htsengine /path/to/yourvoicename
cd yourvoicename

Then, under scripts/Config.pm, fill in $prjdir to contain the path to your voice directory.

2. Prerequisites for HTS Data Setup

You will need each of the following to start with, before you can proceed with the HTS data preparation scripts:

Raw audio (.raw)
Fullcontext training labels (.utt)
Generation labels for synthesis (.lab)

It is expected that you already have these before proceeding with the next steps. Click on each one to learn more about how to create these.

Place your .raw files in yourvoicename/data/raw.
Place your .utt files in yourvoicename/data/utts.
Place your gen labels in yourvoicename/data/labels/gen.

3. make data steps

In yourvoicename/data you will see a Makefile. We will step through the steps in this Makefile to set up the data in HTS format. All of these steps should be run from the directory yourvoicename/data.

3.0 Changes to the Makefile

Please make the following changes in your Makefile:

3.1 make features

This step extracts various acoustic features from the raw audio. It creates the following files:

You can run:

make features

3.2 make cmp

This step composes the various different acoustic features extracted in the previous step into one combined .cmp file per utterance. Run:

make cmp

3.3 make lab

This step "flattens" the structured .utt file format into the HTS .lab format. This step creates labels/full/*.lab, the fullcontext labels, and labels/mono/*.lab, which are monophone labels for each utterance. Run:

make lab

The fullcontext labels (full) contain phonemes in context as determined by the fronted. The monophone labels (mono) are just the phoneme sequence. Both formats have the start and end times of each phoneme, in ten-millionths of a second, so to get times in seconds, add a decimal point before 7 digits from the end.

3.4 make mlf

These files are "Master Label Files," which can contain all of the information in the .lab files in one file, or can contain pointers to the individual .lab files. We will be creating .mlf files that are pointers to the .lab files. Run:

make mlf

3.5 make list

This step creates full.list, full_all.list, and mono.list, which are lists of all of the unique labels. full.list contains all of the fullcontext labels in the training data, and full_all.list contains all of the training labels plus all of the gen labels. Run:

make list

** Note that make list as-is in the demo scripts relies on the cmp files already being there -- it checks that there is both a cmp and a lab file there before adding the labels to the list. However, it does not use any of the information actually in the cmp file, beyond checking that it exists.

3.6 make scp

This step creates training and generation script files, train.scp and gen.scp. This is just a list of the files you want to use to train the voice, and a list of files from which you want to synthesize examples. Run:

make scp

If you ever want to train on just a subset of utterances, you only have to modify train.scp.

4. Questions File

Make sure you are using a questions file appropriate to the data you are using. The default one in the template is for English. We have also created a questions file for Turkish, as well as ones for custom frontend features. Read more about questions files here.

Errors and Solutions

Next:

Continue on to voice training.

Notes for Columbia Speech Lab Students:

We typically keep the data separate from the voice training, since we train many voices from the same data. Data lives in /proj/tts/data, which contain basically the data subdirectory for voices. Voices themselves live in /proj/tts/voices, and the data itself is symbolically linked into each voice's data directory to avoid copying it multiple times. For more information about what gets copied, symlinked, or changed for each voice, see the voice training page notes for Speech Lab students.