Installing Merlin on Habanero

Habanero is a shared research computing cluster at Columbia that has GPUs. Read here for more info. You can install and run Merlin on Habanero but it requires some installation and setup, which is described below.

Your habanero account will have 10gb of quota for your own use in your home directory, and the Speech Lab has 2TB of shared space in /rigel/katt.

1. Get a Habanero User Account

Please talk to Rose if you want an account and don't already have one.

2. Install Dependencies

Pip - lets you install python libraries. Get get-pip.py from here, upload it to Habanero (I just put it in my home directory) and run it as follows:

python get-pip.py --prefix=/usr/local/

Python Libraries - numpy, theano, and bandmat. And new in the most recent version: regex, tensorflow, and keras. Run the following:

pip install numpy --user
pip install theano --user
export CFLAGS=-I/rigel/home/yourusername/.local/lib/python2.7/site-packages/numpy/core/include/
pip install bandmat --user
pip install regex --user
pip install tensorflow --user
pip install keras --user

3. Install Merlin and Run Demo

First make yourself a directory under /rigel/katt/users - this is shared space with no quota, unlike your home directory which has a quota. cd to that directory and run the following:

git clone https://github.com/CSTR-Edinburgh/merlin.git
module load gcc/4.8.5
module load anaconda/2-4.2.0
module load cudnn/5.1
module load cuda80/blas
cd merlin/tools
./compile_tools.sh

You should see "All tools successfully installed!"

Next, run the demo on the GPU cluster:

cd ../egs/slt_arctic/s1

Create a new file there called slurm_run_demo.sh (or whatever you want) and put the following in it:

#!/bin/sh
#
# Simple Merlin demo submit script for Slurm.
#
#SBATCH --account=katt # The account name for the job.
#SBATCH --job-name=MerlinDemo # The job name.
#SBATCH -c 1 # The number of cpu cores to use.
#SBATCH --gres=gpu:1 # request 1 GPU
#SBATCH --time=1:00:00 # The time the job will take to run
#SBATCH --mem-per-cpu=10gb # The memory the job will use per cpu core.
#SBATCH --mail-user=youremail@.columbia.edu
#SBATCH --mail-type=END

module load gcc/4.8.5
module load anaconda/2-4.2.0
module load cudnn/5.1
module load cuda80/blas

./run_demo.sh

# End of script

Then, run it as follows:

sbatch slurm_run_demo.sh

You will get an email when the demo finishes (~7min). You can check the output .wav files under experiments/slt_arctic_demo/test_synthesis/wav. You can see the log output under slurm-######.out.

Note that you can just remove the time limit if you don't want to specify one. This does not appear to have any negative consequences.

Troubleshooting

GPU Locks: Merlin puts a lock on the GPU card that it is using, in order to prevent crashes due to running multiple things at once. More info here: http://homepages.inf.ed.ac.uk/imurray2/code/gpu_monitoring/ This means we can only run as many voices as there are GPU cards, and any additional voices you start after that will run on the CPU, rather than waiting for the GPU to free up. It is about an order of magnitude slower to train on CPU instead of GPU, so it is recommended not to do this.

You can check which GPU cards are in use, and by whom, by running:

python yourmerlindir/src/gpu_lock.py

On Habanero, you can submit this as a slurm script. The default behavior of this locking mechanism is to free automatically if whatever you are running crashes, however Merlin specifies use of a manual lock, which has to be freed manually. This means that if the voice training crashes, or if you kill it before it has completed, then the locks might not get released automatically. So, if this happens, you should check if your locks are still there by running the command above, and then free them manually, like this:

python yourmerlindir/src/gpu_lock.py --free lockid

Where lockid is the integer ID of the GPU card that you have locked.

Because of locking, we can only train one voice per GPU card. Here are the numbers of cards we have on each machine:

GPU Drivers: Not for Habanero but for our GPU machines in lab. These are Ubuntu 16.04 and apt installs CUDA version 7.5. This did not work immediately but worked once the machines were rebooted. Hecate has CUDA version 8 on Ubuntu 14.04.