Intelligibility Testing

Semantically-Unpredictable Sentences

We use what are called "syntactically correct but semantically unpredictable" sentences for intelligibility testing. This ensures that listeners don't have the advantage of being correctly able to guess an unintelligible word based on context.

Speech lab students: most of the time, we will be using a set of SUS that have already been generated, but if you need to generate new ones for English, use /proj/tts/examples/susgen.py.

Everyone else: you can generate SUS for English of the standard form used by NIT for the Blizzard Challenge evaluation (http://research.nii.ac.jp/src/en/NITECH-EN.html) here: https://github.com/ecooper7/SUSgen

Mechanical Turk Intelligibility HITs

On cheshire, under /var/www/amt/, create new directories, one per voice, named as whatever the next unused numbers are, and put the 12 SUS .wav files for each voice into that directory. I typically also put a README file in that directory saying which voice the audio files came from, just to keep track.

Then, open write_csv.py and edit the folders variable to contain the numbers of the folders for your new voices that you want to evaluate. Run the script, and upload the resulting .csv file to our "Transcription Data Sel New" task.

Latin Square Intelligibility HIT

Since listener bias is known to influence results, we wish to distribute the bias of each listener over each voice. Thus, we have switched from having each listener evaluate sentences spoken all by one voice, to a Latin-square setup.

To use Kai-Zhan's scripts for generating Latin-square CSV files for MTurk, please see:
cheshire.cs.columbia.edu:/var/www/macrophone/README.md

Automatic Intelligibility Evaluation using ASR APIs

Computing Word Error Rate

For evaluating Latin square HIT results from MTurk, please see:
/proj/tts/examples/wer/process_latinsquare.py