A big problem with training neural networks is having enough computational power and in search of more computational power, I've gained access to my university's cluster computer.

In my university, if you ask for an account to the locally hosted cluster computer, you probably can get an account. Thereby solving our problem?, not really. The system administrators for the cluster don't have any software to train networks installed, so the task falls on us to do so. Not to mention the cluster's core operating system and software is rarely updated. This makes installing newer pre-built binaries difficult. So, let's get into it.


After using several different methods to install or build python, I found that the easiest way was to use Miniconda. The cluster's curl command didn't have built-in SSL so we can't use it, an alternative that worked was the wget command. So, to install miniconda in the local envirnment I ran these commands

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-lastest-Linux-x86_64.sh -b -p $HOME/miniconda
rm Miniconda3-latest-Linux-x86_64.sh

To add conda to the bash environmental variables, after installing I can ran the following command.

~/miniconda/bin/conda init

The scripts adds some stuff to the .bashrc file. I personally prefer to have to activate it and to have it so we can activate it from any node we might run it on. So, I needed to move that code over to a new file called activate. My file ended up being:

# file: activate
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/ubrdog/miniconda/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
    if [ -f "/home/ubrdog/miniconda/etc/profile.d/conda.sh" ]; then
        . "/home/ubrdog/miniconda/etc/profile.d/conda.sh"
        export PATH="/home/ubrdog/miniconda/bin:$PATH"
unset __conda_setup
# <<< conda initialize <<<

The environment can be activated by running source activate.


Now with python installed, I went ahead and installed tensorflow and some useful python packages along with it in an environment called tensorflow.

conda create --name tensorflow python=3.6 Pillow matplotlib numpy scipy keras tensorflow

After creating the environment, we can have it activate as soon as conda is activated by adding this to the activate file at end:

conda activate tensorflow

It can now be run this on a single node. For instance, the cluster I used this on uses the Sun Grid Engine Scheduler where you have to submit jobs to a queue for the cluster.

If we do want to run it in a single node, which in my case has 12 core per node, in my case I need to submit it as a symmetric multiprocessing (SMP) job with the following configuration file:

#$ -N jobname
#$ -M email
#$ -o terminaloutputfile
#$ -e terminalerrorfile
#$ -S /bin/bash
#$ -V
#$ -q smp.q
#$ -cwd
#$ -pe smp 12
source activate
cd flower-conv-net-Inception-v4-horovod
python3 ./start.py

With the Sun Grid Engine, you can submit it, if the file above is called run.job by running the command qsub run.job.


To be able to run this on more than one node at a time, we need to use OpenMPI. In my particular cluster, OpenMPI is available through a module but it is version 1.10.7 which is quite old by now. GCC 6.4.0 is available, so I went ahead and built OpenMPI 4.0.0 from scratch.

wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.0.tar.gz
tar vxf openmpi-4.0.0.tar.gz
rm openmpi-4.0.0.tar.gz
cd openmpi-4.0.0/
./configure --enable-mpi-cxx --prefix=$HOME/mpi
make install
cd ../
rm -r openmpi-4.0.0/

In the activate file, I now needed to add OpenMPI to the environmental variables. At the end of the activate file add:

export PATH=$HOME/mpi/bin${PATH:+:${PATH}}
export LD_RUN_PATH=$HOME/mpi/lib${LD_RUN_PATH:+:${LD_RUN_PATH}}


Horovod is a great python package that allows for an MPI interface with Tensorflow, Keras, and others. To install it be sure to have the activate file activated. Unfortunately or fortunately, depending on your case, the newer versions require that your cluster have AVX registers, starting with version 0.15.1. So, in my case, to install I use this pip command:

pip install horovod==0.15.0

or if your cluster has AVX

pip install horovod

And you are ready.

For examples on how to use horovod see the examples page of the horovod github page. (link)

Unfortunately, if you are training on a large data-set, since the data is replicated on a per MPI thread basis, it is best to shard it.

def shard(arr, shard_index, n_shards):
    shard_size = arr.shape[0] // n_shards
    shard_start = shard_index * shard_size
    shard_end = (shard_index + 1) * shard_size
    if shard_end > arr.shape[0]:
        shard_end = arr.shape[0]
    return arr[shard_start:shard_end]

This works for both pandas dataframes and numpy arrays. It's use is like so:

dataframe = shard(dataframe, hvd.rank(), hvd.size())

Where hvd is import horovod.keras as hvd or something similar.

While the cluster I am using only has cpu nodes, making training neural networks inefficient, being able to start a training operation on 4 nodes or more with 12 cores each and forgetting it until I get an email saying the my program finished training the network is great.