Preparing .raw Audio for HTS Voice Training

16k .wav to .raw

Your wav files should be in 16k format, e.g. using soxi:

Input File : 'fae_0001.wav' Channels : 1 Sample Rate : 16000 Precision : 16-bit Duration : 00:00:08.49 = 135872 samples ~ 636.9 CDDA sectors File Size : 272k Bit Rate : 256k Sample Encoding: 16-bit Signed Integer PCM

First, make sure these are on your $PATH:

/proj/tts/hts-2.3/SPTK-3.9/installation/bin/ /proj/tts/hts-2.3/speech_tools/bin

Next, assuming your audio is already in 16k .wav format, use this command to convert to the appropriate .raw format for HTS:

ch_wave -c 0 -F 32000 -otype raw in.wav | x2x +sf | interpolate -p 2 -d | ds -s 43 | x2x +fs > out.raw

See below for converting other formats.

Errors and Solutions

Segmentation fault - This is a heisenbug. If you just run it again it should work.
rateconv: failed to convert from 8000 to 32000 - This happens for very short utterances, often backchannels, that contain little to no audio. Just exclude these from training. Be careful though because it still creates the raw file....
x2x : error: input data is over the range of type 'short'!
ds : File write error! - This is from maxed-out audio or clipping. It still writes the file.

Any .wav to 16k .wav

If your audio is in some .wav format other than 16k, use sox to convert it:

sox input.wav -r 16000 output.wav

Converting .sph to .wav

.sph is the format that many LDC corpora use. Certain versions of sox can do this conversion:

sox inputfile.sph outputfile.wav

Except that this won't work if it is the particular .sph format that uses 'shorten' compression. If that's the case, you'll see this error:
sph: unsupported coding `ulaw,embedded-shorten-v2.00'

In that case, you need to use the NIST tool sph2pipe:

sph2pipe -p [-c 1|2] infile outfile

The -p forces it to the 16k format required above.
The -c picks channel 1 or 2 if you want to separate them, e.g. for speakers.
Speech lab students: we have a copy of this under /proj/speech/tools/sph2pipe_v2.5

Even after you do this, it seems to still retain a .sph header, so you'll next have to use
sox infile outfile
to force convert it to regular .wav.

Also, sometimes you need to do

sph2pipe -p -f wav in.sph out.wav
again, depending on the particular version of the .sph format.

-p -- force conversion to 16-bit linear pcm -f typ -- select alternate output header format 'typ' five types: sph, raw, au, rif(wav), aif(mac)