Festvox and Clustergen Scripts for Training Voices from Babel Data

Many thanks to Alan Black for providing these scripts.

Please note that these instructions are meant for Columbia Speech Lab students only.

Run these scripts on kucing because all the dependencies are installed and working on there. The old instructions for full clustergen voice training can be found here, however since we are mainly only using these for frontend processing (to get utts), these are instructions for that.

Data and Setup

Always do this first, or make sure these are in your .bashrc (recommended):

export ESTDIR=/proj/tts/tools/babel_scripts/build/speech_tools
export FESTVOXDIR=/proj/tts/tools/babel_scripts/build/festvox
export SPTKDIR=/proj/tts/tools/babel_scripts/build/SPTK
export BABELDIR=/proj/tts/data/babeldir

You will need to make sure that the language's Babel data is present in $BABELDIR. E.g., to add Amharic:

ln -s /proj/speech/corpora/babel/IARPA/IARPA-babel307b-v1.0b-build/BABEL_OP3_307 /proj/tts/data/babeldir/

You'll have to find the original directory under /proj/speech/corpora/babel by looking around, as each is named somewhat differently. The numerical language codes for each language can be found here.

Then, when you run each of the commands for voice training, replace BABEL_BP_105, which is the Turkish language directory name, with the directory for your new language, everywhere it appears, and also substituting the name of your voice directory for turkish.

These scripts are supposed to work on the Babel language packs as-is, and for the most part they do, only we have run into issues for languages that have both .wav and .sph-format audio data, since the scripts expect .sph data only. (.sph files are telephone conversations, .wav files are other types of recording conditions.) So before you start on a new language, check whether there are .wav files mixed in with the audio data, under
$BABELDIR/[yourlanguagecode]/conversational/training/audio
and if so, then create your own directories, one containing just the sph files, and another containing just the corresponding .txt transcript files for those .sph audio files, and use those directories instead of $BABELDIR/[yourlanguagecode]/conversational/training/transcription and $BABELDIR/[yourlanguagecode]/conversational/training/audio respectively, in all of the commands that require them.

Voice Setup:

e.g. for Turkish:

cd /proj/tts/tools/babel_scripts
mkdir yourusername
cd yourusername
cp ../make_build .
mkdir turkish
cd turkish
../make_build setup_voice turkish \
  $BABELDIR/BABEL_BP_105/conversational/reference_materials/lexicon.txt \
  $BABELDIR/BABEL_BP_105/conversational/training/transcription \
  $BABELDIR/BABEL_BP_105/conversational/training/audio

Check Phoneset:

Under the festvox directory, check the phoneset file to make sure there are no phonemes with special characters that will break things later on. Brackets should have gotten replaced already, but we have also been replacing things like underscores (just removing them) and tildes (replace with TL).

Also, make sure that all the vowels are in fact set as vowels in the phoneset file. Any vowel that's not already in the default Festival phoneset ('radio') will not be set. Check the LSP file for the language if you are unsure.

Also, in the Babel lexicon files, the symbol # is commonly used to denote word boundaries. This should get converted to wb because # is a delimiter character in the label file format.

Check Lexicon:

Check the file cmu_babel_lex.out by find-and-replacing any phonemes that you've renamed in the phones file (make sure to ONLY replace them on the phoneme side, not on the word side).

Also check whether there are any weird characters on the word side. E.g. for Lithuanian, letters which were spoken as letters were in the lexicon like this: /C/ /D/ /T/ etc. This broke the scripts, and the fix was to remove the slashes in the lexicon entries.

Segment audio into utterances:

Back in the top-level directory for your language, run these commands one by one:

If you are working with conversational data:

../make_build make_raw_waves /path/to/babel/audio
../make_build make_prompts /path/to/babel/transcripts
../make_build reduce_prompts etc/txt.done.data.all
../make_build make_extract_subutts etc/txt.done.data
./bin/do_build parallel build_prompts etc/txt.done.data
./bin/do_build label etc/txt.done.data
./bin/do_clustergen parallel build_utts etc/txt.done.data

If you are working with scripted data:

../make_build make_raw_waves /path/to/babel/audio      This should create recording/*.wav
../make_build make_scripted_prompts /path/to/babel/transcripts      This should create etc/txt.done.data.all
../make_build reduce_prompts etc/txt.done.data.all      This should create etc/txt.done.data
../make_build clean_conv_subutts      This does some audio cleanup and should create wav/*.wav
./bin/do_build parallel build_prompts etc/txt.done.data      This creates prompt-utt/*.utt and prompt-lab/*.lab
./bin/do_build label etc/txt.done.data      This does EHMM alignment. It takes a long time. Save the output when done so you can get the log likelihoods later on.
./bin/do_clustergen parallel build_utts etc/txt.done.data      This produces utterance files in festival/utts/*.utt

[[TODO this is still buggy]]

Your .utt files should be present under festival/utts.

Languages we have run these scripts on so far, and issues we ran into: