A big problem with training neural networks is having enough computational power and in search of more computational power, I've gained access to my university's cluster computer.
In my university, if you ask for an account to the locally hosted cluster computer, you probably can get an account. Thereby solving our problem?, not really. The system administrators for the cluster don't have any software to train networks installed, so the task falls on us to do so. Not to mention the cluster's core operating system and software is rarely updated. This makes installing newer pre-built binaries difficult. So, let's get into it.
Python
After using several different methods to install or build python, I found that the easiest way was to use Miniconda. The cluster's curl command didn't have built-in SSL so we can't use it, an alternative that worked was the wget command. So, to install miniconda in the local envirnment I ran these commands
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-lastest-Linux-x86_64.sh -b -p $HOME/miniconda
rm Miniconda3-latest-Linux-x86_64.sh
To add conda to the bash environmental variables, after installing I can ran the following command.
~/miniconda/bin/conda init
The scripts adds some stuff to the .bashrc
file. I personally prefer to have to activate it and to have it so we can activate it from any node we might run it on. So, I needed to move that code over to a new file called activate
. My file ended up being:
# file: activate
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/ubrdog/miniconda/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/home/ubrdog/miniconda/etc/profile.d/conda.sh" ]; then
. "/home/ubrdog/miniconda/etc/profile.d/conda.sh"
else
export PATH="/home/ubrdog/miniconda/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
The environment can be activated by running source activate
.
Tensorflow
Now with python installed, I went ahead and installed tensorflow and some useful python packages along with it in an environment called tensorflow.
conda create --name tensorflow python=3.6 Pillow matplotlib numpy scipy keras tensorflow
After creating the environment, we can have it activate as soon as conda is activated by adding this to the activate
file at end:
conda activate tensorflow
It can now be run this on a single node. For instance, the cluster I used this on uses the Sun Grid Engine Scheduler where you have to submit jobs to a queue for the cluster.
If we do want to run it in a single node, which in my case has 12 core per node, in my case I need to submit it as a symmetric multiprocessing (SMP) job with the following configuration file:
#!/bin/bash
#$ -N jobname
#$ -M email
#$ -o terminaloutputfile
#$ -e terminalerrorfile
#$ -S /bin/bash
#$ -V
#$ -q smp.q
#$ -cwd
#$ -pe smp 12
source activate
cd flower-conv-net-Inception-v4-horovod
python3 ./start.py
With the Sun Grid Engine, you can submit it, if the file above is called run.job
by running the command qsub run.job
.
OpenMPI
To be able to run this on more than one node at a time, we need to use OpenMPI. In my particular cluster, OpenMPI is available through a module but it is version 1.10.7 which is quite old by now. GCC 6.4.0 is available, so I went ahead and built OpenMPI 4.0.0 from scratch.
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.0.tar.gz
tar vxf openmpi-4.0.0.tar.gz
rm openmpi-4.0.0.tar.gz
cd openmpi-4.0.0/
./configure --enable-mpi-cxx --prefix=$HOME/mpi
make
make install
cd ../
rm -r openmpi-4.0.0/
In the activate file, I now needed to add OpenMPI to the environmental variables. At the end of the activate file add:
export PATH=$HOME/mpi/bin${PATH:+:${PATH}}
export C_INCLUDE_PATH=$HOME/mpi/include${C_INCLUDE_PATH:+:${C_INCLUDE_PATH}}
export CPLUS_INCLUDE_PATH=$HOME/mpi/include${CPLUS_INCLUDE_PATH:+:${CPLUS_INCLUDE_PATH}}
export LD_LIBRARY_PATH=$HOME/mpi/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_RUN_PATH=$HOME/mpi/lib${LD_RUN_PATH:+:${LD_RUN_PATH}}
Horovod
Horovod is a great python package that allows for an MPI interface with Tensorflow, Keras, and others. To install it be sure to have the activate file activated. Unfortunately or fortunately, depending on your case, the newer versions require that your cluster have AVX registers, starting with version 0.15.1. So, in my case, to install I use this pip command:
pip install horovod==0.15.0
or if your cluster has AVX
pip install horovod
And you are ready.
For examples on how to use horovod see the examples page of the horovod github page. (link)
Unfortunately, if you are training on a large data-set, since the data is replicated on a per MPI thread basis, it is best to shard it.
def shard(arr, shard_index, n_shards):
shard_size = arr.shape[0] // n_shards
shard_start = shard_index * shard_size
shard_end = (shard_index + 1) * shard_size
if shard_end > arr.shape[0]:
shard_end = arr.shape[0]
return arr[shard_start:shard_end]
This works for both pandas dataframes and numpy arrays. It's use is like so:
dataframe = shard(dataframe, hvd.rank(), hvd.size())
Where hvd is import horovod.keras as hvd
or something similar.
While the cluster I am using only has cpu nodes, making training neural networks inefficient, being able to start a training operation on 4 nodes or more with 12 cores each and forgetting it until I get an email saying the my program finished training the network is great.