Speaker-Adaptive Training Using HTS

Using the HTS SAT demo as-is

If you are dropping in your own data, the main difference is that for each type of data, each speaker gets his or her own directory. When extracting acoustic features, keep in mind that you want to be setting appropriate f0 ranges for each speaker. Use a general male/female range, or even better, pick the range specifically for each speaker.

For synthesis, you have to have gen labels that match the speaker you want to adapt to; see how this is done in the data/Makefile.

Other changes you have to make in data/Makefile:

Changes you have to make in scripts/Config.pm:

About the training steps in scripts/Config.pm, from this thread on the HTS mailing list:

1~5 is adaptation based on SI model.
6~9 is speaker adaptive training for average voice model
10~13 is adaptation based on average voice model

Synthesizing directly from a SAT-trained AVM without adapting to a specific speaker

Note that this is theoretically not something you should do, since the AVM is in some undefined space until you adapt it to a particular speaker. However, the implementation of AVMs in HTS produce reasonable, average-sounding speech.

Speech lab students: see /proj/tts/examples/HTS-demo_AVM for an example. Modified CONVM and ENGIN steps were added to convert AVM MMFs to the HTS-engine format and synthesize from them. We basically added hts_engine synthesis to the "synthesize from SAT-trained AVM" step after the SPTK synthesis which is already done, removing everything referring to speaker transforms since we don't want to use one.