27-08-2012, 11:23 AM
PARALLELIZING NEURAL NETWORK TRAINING FOR CLUSTER SYSTEMS
PARALLELIZING NEURAL NETWORK.pdf (Size: 69.31 KB / Downloads: 67)
ABSTRACT
We present a technique for parallelizing the training of neural
networks. Our technique is designed for parallelization
on a cluster of workstations. To take advantage of
parallelization on clusters, a solution must account for the
higher network latencies and lower bandwidths of clusters
as compared to custom parallel architectures. Parallelization
approaches that may work well on special purpose parallel
hardware, such as distributing the neurons of the neural
network across processors, are not likely to work well
on cluster systems because communication costs to process
a single training pattern are too prohibitive. Our solution,
Pattern Parallel Training, duplicates the full neural network
at each cluster node. Each cooperating process in the cluster
trains the neural network on a subset of the training set
each epoch. We demonstrate the effectiveness of our approach
by implementing and testing an MPI version of Pattern
Parallel Training for the eight bit parity problem
Introduction
Artificial neural networks (ANNs) are tools for non-linear
statistical data modeling. They can be used to solve a wide
variety of problems while being robust to error in training
data. ANNs have been successfully applied to hosts of pattern
recognition and classification tasks, time series prediction,
data mining, function approximation, data clustering
and filtering, and data compression.
ANNs are trained on a collection of {input, desired
output} pairs called training patterns. The set of training
patterns is typically quite large. Backpropagation [7] is one
of the most widely used training algorithms for ANNs. It
can take a very long time to train an ANN using backpropagation,
even on a moderately sized training set. Our work
addresses the long training times of sequential ANN training
by parallelizing the training in a way that is optimized
for cluster computing.
Related Work
Most of the previous work in parallelizing neural network
training has focused on creating special purpose neural network
hardware [2, 4] and using an approach similar to what
we call Network Parallel Training where the neurons of the
ANN are parallelized. The obvious problem with specialpurpose
hardware is that it is very expensive to acquire and
time consuming to build.
There is some work on parallelizing ANN training on
cluster systems. Omer et. al. [6] uses genetic algorithms
to parallelize ANN training on a cluster of workstations.
They use a hybrid approach that combines genetic algorithms
and backpropagation. They create a diverse population
of ANNs that are distributed across the nodes. Each
node, in parallel, performs an independent full sequential
training of its ANN. At the end of a parallel training round,
a single master node collects results and chooses good candidates
for generating a new population of ANNs to independently
train in the next round. They use parallelism to
sequentially train multiple, different ANNs in parallel and
choose the best result. We use parallelization to speed up
training of a single ANN.
Neural Network Issues
In designing our solution, we needed to address several issues
related to parallelizing the ANN. These include calculating
weight updates and error when the the training is distributed,
determining the stopping condition, ensuring that
the duplicated ANNs are identical, and determining how
many training patterns should be presented each epoch.
In a training epoch in backpropagation, the gradient
of the error on the training data with respect to the connection
weights is used to compute the new weights. The
weight updates are defined as the difference between the
new weights and the old weights. For each pattern in a
batch of data, the incremental weight updates for that pattern
are added to a running total of weight updates. The
total weight updates are applied only once at the end of the
epoch. This means that when the batch is split across multiple
processes, the weight updates from each process can
be summed to produce a single final set of weight updates
Implementation
Currently, our system is implemented as an MPI [3] application
that uses the FANN [5] open source neural network
training library. We use the FANN library to perform the
local training of the ANN on each node. Our system handles
the initial replication of the full neural network and
training data across all processes. It also determines the
local set of training patterns for each local epoch by randomly
selecting them from the full set. In addition, it handles
broadcasting the weight updates at the end of each local
epoch, applying the weight updates from other nodes
to the local ANN, and computing the error to determine if
another epoch of learning is necessary. We use MPI Allgather
to synchronize epochs across cluster nodes. To implement
our system, we needed to modify FANN to export
some of its internal structures so that we could communicate
weight values at the end of each epoch. Our plan is
to eventually implement a Pattern Parallel Training Library
as an extension to FANN. ANN programmers could then
use our library to parallelize the training of their ANN on
clusters.
Conclusions and Future Work
We have shown that our Pattern Parallel Training technique
can be used to significantly speed-up training of Artificial
Neural Networks. Our results for the 8 bit parity problem
show up to a factor of 11 speed-up in training time. Because
backpropagation is such a widely used training algorithm,
we expect that Pattern Parallel Training can be used
to improve training time of a large number of ANN learning
problems.
Some areas of future work include trying PPT on a
larger set of ANN problems. We would like to examine
problems with much larger training set sizes than the eight
bit parity problem. This will allow us to further analyze
our approach at randomly selecting patterns from the training
set at each epoch and it will allow us to perform better
tests on choosing epoch size. We also want to examine
how well PPT works on problems that perform well using
incremental serial training. This is a set of problems for
which we are less confident our approach will always work
well. Our goal is to more completely characterize the types
of ANN problems for which PPT works particularly well.
We would also like to examine using PPT on training algorithms
other than backpropagation.