Get Out-of-Vocabulary Pronunciations using Sequitur G2P

Sequitur G2P is a trainable grapheme-to-phoneme converter which can be found here: https://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html

From various Q&A we've found online, this software is not really supported anymore, and we've had issues with training on non-Latin languages such as Amharic, so we are looking to switch over to CMU Sphinx G2P. However, we are already using Sequitur G2P for English and Turkish.

1. Train pronunciation models on your existing pronunciation dictionary

Skip this step if you already have trained models for your language.
Speech Lab students: we already have trained models for English and Turkish. Our version of Sequitur G2P lives here:
/proj/tts/tools/g2p/bin/g2p.py
It only runs on kucing as the other machines do not have the dependencies.

From the Sequitur G2P README file:

Obtain a pronunciation dictionary for training. The format is one word per line. Each line contains the orthographic form of the word followed by the corresponding phonemic transcription. The word and all phonemes need to be separated by white space. The word and phoneme symbols may thus not contain blanks. We'll assume your training lexicon is called train.lex, and that you set aside some portion for testing purposes as test.lex, which is disjoint from train.lex.
Train a model.
To create a first model type:
g2p.py --train train.lex --devel 5% --write-model model-1
This first model will be rather poor because it is only a unigram.
To create higher order models you need to run g2p.py again:
g2p.py --model model-1 --ramp-up --train train.lex --devel 5% --write-model model-2
Repeat this a couple of times
g2p.py --model model-2 --ramp-up --train train.lex --devel 5% --write-model model-3 g2p.py --model model-3 --ramp-up --train train.lex --devel 5% --write-model model-4 ...

Speech Lab students: We have typically been training up to model-3.

2. Use trained models to generate pronunciations for unseen words

Speech Lab students: Trained models for English and Turkish live here:
/proj/tts/resources/g2p/cmudict/ /proj/tts/resources/g2p/babel_turkish/

From the Sequitur G2P README:

Prepare a list of words you want to transcribe as a simple text file words.txt with one word per line (and no phonemic transcription), then type:
/proj/tts/tools/g2p/bin/g2p.py --model model-3 --apply words.txt