Useful Festival Info

Creating and Using a Frontend Only

You might want to create a frontend for a language without also creating an entire voice in Festival as well, because you want to use the frontend with a different backend, such as HTS or Merlin. An example script for how to get training and test labels (.utt format) using Festival/Festvox can be found here. Basically, you can create a frontend given training audio and transcripts and a pronunciation lexicon for your vocabulary.

Digging Into the Utterance Structure

You can load an .utt file into Festival like this:

festival> (define myutt (utt.load nil "cmu_us_arctic_slt_a0001.utt"))

Then the following commands can be used to look into the structure in different ways:

(utt.relationnames myutt)
(utt.relation.items myutt "Token")
(utt.relation.items myutt "Word")
(utt.relation_tree myutt "Token")
(utt.relation_tree myutt "Word")

And any of the "relationnames" (e.g. "Phrase", "Syllable", "IntEvent") can be selected as the last argument instead of "Token" or "Word".
The relation "Segment" includes both pauses and phonemes, so if you want phonemes then use the Segment relation.

You can get different information about each of the different relations, and those feature functions are described here: http://www.festvox.org/docs/manual-1.4.2/festival_32.html#SEC141 These are as C++ functions, but they are also accessible from the Festival Scheme REPL, e.g. for Syllable.syllable_duration:

festival> (define myutt (utt.load nil "f1a0001.utt"))
#<Utterance 0x7f993b6d0130>
festival> (define firstsyl (car (utt.relation.items myutt "Syllable")))
#<item 0x288bdc0>
festival> (item.feat firstsyl 'syllable_duration)
0.054999992

Getting Syllables with their Phonemes

Getting a list of syllables along with the phonemes they contain may be useful for debugging and seeing what's going on in the utterance. It seems that this information should be easily accessible from the utterance structure, but if this is the case, it is not clear from the documentation. So, here is how you can get that information from Scheme -- it assumes that all syllables have a unique start time, which should generally be the case. Pauses (pau) do not belong to any syllable according to Festival, and thus are just on their own with a "syllable start time" of 0.

  1. Get your utterance into a list of pairs of (this Segment's syllable's start time, this Segment's name).

    festival> (define myutt (utt.load nil "f1a0001.utt"))
    festival> (define allsegs (utt.relation.items myutt "Segment"))
    festival> (define pairsegs (mapcar (lambda (x) (list (item.feat x 'R:SylStructure.parent.syllable_start) (item.name x))) allsegs))
    ((0 "pau")
    (0.2 "ax")
    (0.255 "k")
    (0.255 "ey")
    (0.255 "p")
    .....

  2. Merge all segments with the same start time together.

    festival> (define (combine fthing lst final) (if (eq? '() lst) final (if (eqv? (car fthing) (car (car lst))) (combine (append fthing (cdr (car lst))) (cdr lst) final) (combine (car lst) (cdr lst) (append final (list fthing))))))
    festival> (combine '("x" "x") pairsegs '())
    (("x" "x")
    (0 "pau")
    (0.2 "ax")
    (0.255 "k" "ey" "p")
    .....

    (the '("x" "x") is just a dummy placeholder to initialize the function.) As you can see, this combines the phonemes by syllable and also shows you the syllable start time.

Using Scheme in Festival

Festival uses the SIOD implementation of Scheme. Reference and documentation of built-in functions can be found here.