Trustworthy Voices Project

We are using BURNC data and features which correlate with trusted or untrusted speech. These are the features for which a high value corresponds with trustworthiness or untrustworthiness:

Trusted:

Untrusted:

Basic Voice

We did a first pass on experimentally training a voice with these switches built in to synthesize in "trusted" or "untrusted" styles. There are many things we could experiment with doing differently, so the basic process is documented here. Suggestions for future experiments are in blue.

1. Feature extraction using Praat

You shouldn't really have to redo this unless you want to look at different features. See these two scripts in /proj/tts/data/english/brn/trustworthy/scripts/:

extractAcousticFeatures.praat
extractVoiceQualityFeatures.praat

Output is under trustworthy/ftrs/ (raw feature csv files).

2. Z-score normalization

In the lab's prior studies on deceptive speech, features normalized by speaker were examined. So, we also normalize the features by speaker. See scripts/zscore.py; output is ftrs/*_znorm.csv.

3. Thirds partitions

In accordance with our prior work on using frontend features to alter the style of synthesized speech, we decided to set thresholds for what is a "high," "medium," or "low" value for each feature such that each partition consists of a third of the data. This is very simplistic and there are probably better ways to partition the data, for example based on standard deviations around the mean, or based on at what value of the feature it becomes salient for trustworthiness. These are the steps for getting hi/med/lo partitions of the data based on each feature:

4. Label files

Label files are needed for voice training, and the method we are using to alter the speaking style relies on augmenting the frontend label files with this additional information about acoustic and prosodic features. The fullcontext label format represents a list of phonemes in context for one utterance. The different context elements are extracted by a frontend tool (we use Festival). The label file basically has one phoneme-in-context per line (with the start and end times of that phoneme in ten-millionths of seconds), and the line contains each context element separated by unique delimiters to enable pattern matching. See lab_format.pdf from the HTS demo for more information on the format and the standard contextual features. We are adding features at the utterance level; that is, every phoneme in the utterance gets the same value.

5. Questions file

In order for the new features in the label files to actually get used in voice training, they have to be included in the questions file. The questions file is used to parse the label files to feed features to the model, converting each line in the label file into a binary representation that corresponds with the answer to each yes/no question, where the answer is "yes" (1) if the pattern indicated in the question is matched in the label.

The default English question file is in $MERLINDIR/misc/questions/questions-radio_dnn_416.hed. On the left side is the name of the question (e.g. "LL-Vowel" is basically asking, "Is the phoneme two to the left of the current one a vowel?") and on the right side is a list of things for which a match would mean the answer is "yes" (e.g. all vowels in the English phoneset). The fullcontext label file uses symbols to delimit the different features, so that's why everything pertaining to, e.g. the current phoneme (questions starting with "C-") has the possible matches put between - and + (because that's how you find the current phoneme in the current label, according to lab_format.pdf). We made a custom questions file for this project by just copying the default English questions file and adding new questions pertaining to our new features -- this questions file is here: /proj/tts/tools/ecooper/merlin/misc/questions/questions-trustworthy.hed The questions that were added are at the top of the file, starting with Trust.

The representation in the questions file is a good area for future experimentation. For instance, right now the qfile only checks whether a given feature value is "hi" or "not hi" -- basically a binary switch. Merlin allows for numeric features as well as discrete symbolic pattern match type features -- see the CQS question type at the end of the default English qfile; see also the section on continuous numerical features here for more info. It is common practice to normalize numeric features, thus it may make sense to use the actual zscore-normalized feature values for each utterance in place of the 'hi,' 'med,' and 'lo' symbolic values. As a somewhat simpler experiment, it may also be worth trying using 1, 0, and -1 in place of hi, med, and lo, in a numeric-feature setting, and then seeing whether extrapolation can be done (e.g, setting a value of '2' in the test label files) which in theory is possible but we have not tried.

6. Voice Training

We based voice training on the basic "build your own voice" recipe in Merlin. The voice training directory is here: /proj/tts/tools/ecooper/merlin/egs/trustworthy/s1 The training recipe is thirds_voice.sh which is run by uncommenting each section one by one. The test synthesis output is in experiments/thirds_voice/test/synthesis/wav/*.wav.

If you train a new voice based on this recipe, the important things to change are the voice name (so the old voice doesn't get overwritten), and the label files and questions file if new ones were created.

6. Synthesis

See this script:

/proj/tts/tools/ecooper/merlin/egs/trustworthy/s1/synthesize.py

It will synthesize your input sentences in both trusted and nontrusted styles.