Speaker-Adaptive Training Using HTS
Using the HTS SAT demo as-is
If you are dropping in your own data, the
main difference is that for each type of data, each speaker gets his
or her own directory. When extracting acoustic features, keep in mind
that you want to be setting appropriate f0 ranges for each speaker.
Use a general male/female range, or even better, pick the range
specifically for each speaker.
For synthesis, you have to have gen labels that match the
speaker you want to adapt to; see how this is done in
Other changes you have to make in data/Makefile:
- DATASET if you are using
- Change the list of speakers to the appropriate speaker
IDs. TRAINSPKR, ADAPTSPKR, and ALLSPKR.
- Change ADAPTHEAD to whatever is appropriate.
- Change F0_RANGES to the correct thing for each
speaker. We are currently using 110 280 for female
speakers and 50 280 for male, but it's better to customize
for each speaker if possible.
Changes you have to make in scripts/Config.pm:
About the training steps in scripts/Config.pm, from
thread on the HTS mailing list:
- $spkrPat -- the %%% is the mask for the part
of the filename that represents the speaker ID.
1~5 is adaptation based on SI model.
6~9 is speaker adaptive training for average voice model
10~13 is adaptation based on average voice model
Synthesizing directly from a SAT-trained AVM without adapting to a
Note that this is theoretically not something you should do, since the
AVM is in some undefined space until you adapt it to a particular
speaker. However, the implementation of AVMs in HTS produce
reasonable, average-sounding speech.
Speech lab students: see /proj/tts/examples/HTS-demo_AVM for
an example. Modified CONVM and ENGIN steps were
added to convert AVM MMFs to the HTS-engine format and synthesize from
them. We basically added hts_engine synthesis to the "synthesize from
SAT-trained AVM" step after the SPTK synthesis which is already done, removing everything referring to speaker transforms
since we don't want to use one.