Festvox and Clustergen Scripts for Training Voices from Babel Data

Many thanks to Alan Black for providing these scripts.

Please note that these instructions are meant for Columbia Speech Lab students only.

Run these scripts on kucing because all the dependencies are installed and working on there.

Data and Setup

Always do this first, or make sure these are in your .bashrc (recommended):

export ESTDIR=/proj/tts/tools/babel_scripts/build/speech_tools
export FESTVOXDIR=/proj/tts/tools/babel_scripts/build/festvox
export SPTKDIR=/proj/tts/tools/babel_scripts/build/SPTK
export BABELDIR=/proj/tts/data/babeldir

You will need to make sure that the language's Babel data is present in $BABELDIR. E.g., to add Amharic:

ln -s /proj/speech/corpora/babel/IARPA/IARPA-babel307b-v1.0b-build/BABEL_OP3_307 /proj/tts/data/babeldir/

You'll have to find the original directory under /proj/speech/corpora/babel by looking around, as each is named somewhat differently. The numerical language codes for each language can be found here.

Then, when you run each of the commands for voice training, replace BABEL_BP_105, which is the Turkish language directory name, with the directory for your new language, everywhere it appears, and also substituting the name of your voice directory for turkish.

These scripts are supposed to work on the Babel language packs as-is, and for the most part they do, only we have run into issues for languages that have both .wav and .sph-format audio data, since the scripts expect .sph data only. (.sph files are telephone conversations, .wav files are other types of recording conditions.) So before you start on a new language, check whether there are .wav files mixed in with the audio data, under
$BABELDIR/[yourlanguagecode]/conversational/training/audio
and if so, then create your own directories, one containing just the sph files, and another containing just the corresponding .txt transcript files for those .sph audio files, and use those directories instead of $BABELDIR/[yourlanguagecode]/conversational/training/transcription and $BABELDIR/[yourlanguagecode]/conversational/training/audio respectively, in all of the commands that require them.

Languages we have run these scripts on so far, and issues we ran into:

Basic Voice Training

e.g. for Turkish:

cd /proj/tts/tools/babel_scripts
mkdir turkish
cd turkish
/proj/tts/tools/babel_scripts/make_build setup_voice turkish \
  $BABELDIR/BABEL_BP_105/conversational/reference_materials/lexicon.txt \
  $BABELDIR/BABEL_BP_105/conversational/training/transcription \
  $BABELDIR/BABEL_BP_105/conversational/training/audio

/proj/tts/tools/babel_scripts/make_build make_voice turkish \
  $BABELDIR/BABEL_BP_105/conversational/reference_materials/lexicon.txt \
  $BABELDIR/BABEL_BP_105/conversational/training/transcription \
  $BABELDIR/BABEL_BP_105/conversational/training/audio

make_voice will take a long time so be sure to run it under screen.

The resulting voice may be used to synthesize new utterances as follows:

./bin/do_clustergen cg_test tts tts_test etc/txt.done.data.test

(You must provide your own txt.done.data.test. tts_test is the name of the directory under test/ where your output .wav files will go.)

Note that any words that are OOV (with respect to the lexicon used for training) in your test utterances just get skipped over when synthesizing, using these scripts as-is.

Using your Own Data

You might have your own .wav files and transcripts, and want to use these scripts to train a voice. Run the EXPORTs above, create a yourvoicename directory and cd to it, then run the setup_voice command, substituting yourvoicename for turkish.

Then you can drop in your own data: put .wav files under wav/, and for transcripts, replace etc/txt.done.data.

Then you can run the make_voice command, again substituting in your own voice name. This used to also re-run setup_voice, which would clobber any new data you dropped in, but we have commented this out.

TODO: we haven't actually done this. We've only ever used the frontend and then dropped in the files to HTS to train a voice there. We should try this out and document it and any errors you may come across.

Use Frontend Only to Get Labels

Training labels (.utt)

Generation labels (.lab)