Abstract:
High performance computing (HPC) applications have increasingly tended to use \o the shelf"
commodity clusters of workstations as their executing platform, in the last decade due to their
generic nature and low cost. The typical cluster of nodes, as well known, is a distributed
memory structure, whose programming paradigm of message passing is well established.
This thesis presents a study of the impact of hybrid memory architectures composed of distributed
memory and distributed shared memory, within a multi-processor cluster on the design
of parallel algorithms. The thesis proposes a novel con gurable virtual cluster arrangement
interconnected by message passing links as a programming metaphor for parallel algorithm
design. The study uses solution of parallel numerical algorithms and parallel back-propagation
neural network algorithms as case studies. The parallel implementation of the shallow water
equations to model a Tsunami is emphasized.
In the tsunami model, a rectangular array of data is partitioned into sub-domains, namely a
four by three grid scheme and an eight by six grid scheme which are then used for the parallel
implementation. In the preliminary study of hybrid memory architectures, four versions of the
parallel algorithm for each grid scheme are used: distributed memory without threads, distributed
memory with threads, distributed shared memory without threads, and distributed
shared memory with threads. Experiments are realized using the Message Passing Interface
(MPI) library, C/Linda, and the Linux pthreads. Subject to the availability of memory, the
distributed shared memory version without threads performs best, but as the task is scaled
up, the threaded version becomes e cient in both distributed memory and distributed shared
memory implementations.
In the case of parallel algorithms for neural network training, the neural network is partitioned
into sub neural networks by applying a hybrid partitioning scheme. Therefore, each
partitioned network is evaluated as a matrix multiplication. Three di erent sizes of neural
networks were used and exchange rate prediction was used as the reference problem. Parallel
implementations for each of the distributed memory and distributed shared memory scenarios
were obtained. The partitioned, matrix multiplication had the fastest execution time, and the
MPI implementation was always faster than the TCP-Linda equivalent.
x
xi
The con gurable hybrid cluster scheme was implemented as two UPC/MPI sub cluster con gurations
where, in the rst con guration, con guration-1, twelve computing nodes are partitioned
into two equal sized DSM sub clusters, and in the second con guration, con guration-2,
twelve computing nodes are partitioned into four DSM sub clusters. UPC is used for intra sub
clusters while MPI is used for inter-sub-cluster communication. The shallow water model was
implemented on each of these con gurations.
The parallel algorithm for the uniform DSM implementation exhibits a moderately better
performance than either of the two parallel algorithms for the UPC/MPI hybrid cluster con-
gurations while these two parallel algorithms individually exhibit an overall better performance
than the parallel algorithm for the uniform DM architecture, in terms of running time.
Con guration-1 exhibits a moderately better performance than that for con guration-2. As
the number of computing nodes per DSM sub cluster decreases, the overall performance approaches
that of a DM.
A cluster with a large number of computing nodes when con gured as a UPC/MPI hybrid
sub cluster provides algorithmic timing values faster than that for the pure DM programming
model. Moreover, the set of DSM sub clusters can be con gured into arbitrary topologies
such as a two dimensional toroidal mesh type topology, enabling a more
exible programming
approach and the possibility of reusing well known algorithm designs for high performance
applications.