Adding and Modifying Frontend Features

Adding new frontend features at the utterance level

If you are only adding a new feature at the utterance level, then every line in the label file can get the same value for that feature, and you don't need to dig into the Festival utterance structure at all.

See /proj/tts/examples/hts_labeler/add_labels.py for an example script that adds new labels at the utterance level. Public version here.

See the Trustworthy Voices project page for an example of a full pipeline of adding multiple new frontend features, training a voice using them, and synthesizing new utterances with those voices.

Modifying the Questions File

The questions file is used to parse the label files to feed features to the model. If there are no questions that correspond to the new features you added to the label file, then these new features won't end up getting used.

If you are adding one new feature at the frontend (e.g. some acoustic or prosodic feature) using the script above, and the feature can take on some categorical values such as e.g. 'hi', 'med', or 'lo', then you need to add these corresponding questions to your questions file (e.g. just at the top):

  QS "Ftr-Hi" {/K:hi}
  QS "Ftr-Med" {/K:med}
  QS "Ftr-Lo" {/K:lo}

Adding continuous numerical features

Neural networks accept continuous numeric features, however the legacy HTS label and question file format is based on the use of discrete categorical features. Nevertheless, Merlin has made it easy to add in continuous features by adding in a new question type to the question file format, "continuous question."

Usage info from label_normalisation.py:

  (\d+) -- handles digit without decimal point
  ([\d\.]+) -- handles digits with and without decimal point

These digit patterns can be used in larger regular expressions in place of categorical feature values that you want to exactly match; see ends of existing question files for examples.

We have not actually tried using this yet, and might need to consider how and whether numeric features should be normalized, etc.

Adding new features at the word/phrase/syllable etc. level

TODO. This would involve cross-referencing the label files with the Festival utterance structure.

Frontend experiments we have done

BURNC radio news with high, middle, and low values for acoustic and prosodic features labeled at utterance level at the frontend and then synthesis done at each setting. [paper]

Yishak: Amharic audiobible with high, middle, and low acoustic and prosodic features labeled at utterance level at the frontend.

Rose: Different ways of assigning phrase boundaries.

Trustworthy voices: label acoustic and prosodic features associated with trustworthiness at the frontend, and then set them to trustworthy or untrustworthy settings for synthesis.