Preparing .raw Audio for HTS Voice Training

16k .wav to .raw

Your wav files should be in 16k format, e.g. using soxi:

Input File : 'fae_0001.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:08.49 = 135872 samples ~ 636.9 CDDA sectors
File Size : 272k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM

First, make sure these are on your $PATH:

/proj/tts/hts-2.3/SPTK-3.9/installation/bin/
/proj/tts/hts-2.3/speech_tools/bin

Next, assuming your audio is already in 16k .wav format, use this command to convert to the appropriate .raw format for HTS:

ch_wave -c 0 -F 32000 -otype raw in.wav | x2x +sf | interpolate -p 2 -d | ds -s 43 | x2x +fs > out.raw

See below for converting other formats.

Errors and Solutions

Any .wav to 16k .wav

If your audio is in some .wav format other than 16k, use sox to convert it:

sox input.wav -r 16000 output.wav

Converting .sph to .wav

.sph is the format that many LDC corpora use. Certain versions of sox can do this conversion:

sox inputfile.sph outputfile.wav

Except that this won't work if it is the particular .sph format that uses 'shorten' compression. If that's the case, you'll see this error:
sph: unsupported coding `ulaw,embedded-shorten-v2.00'

In that case, you need to use the NIST tool sph2pipe:

sph2pipe -p [-c 1|2] infile outfile

The -p forces it to the 16k format required above.
The -c picks channel 1 or 2 if you want to separate them, e.g. for speakers.
Speech lab students: we have a copy of this under /proj/speech/tools/sph2pipe_v2.5

Even after you do this, it seems to still retain a .sph header, so you'll next have to use
sox infile outfile
to force convert it to regular .wav.

Also, sometimes you need to do

sph2pipe -p -f wav in.sph out.wav
again, depending on the particular version of the .sph format.

-p -- force conversion to 16-bit linear pcm
-f typ -- select alternate output header format 'typ'
five types: sph, raw, au, rif(wav), aif(mac)