Parallel Algorithms on Con_gurable Hybrid UPC/MPI Clusters

Ganeshamoorthy, K.

Parallel Algorithms on Con_gurable Hybrid UPC/MPI Clusters

Ganeshamoorthy, K.

URI: http://archive.cmb.ac.lk:8080/xmlui/handle/70130/263

Date: 2010

Abstract:

High performance computing (HPC) applications have increasingly tended to use \o the shelf" commodity clusters of workstations as their executing platform, in the last decade due to their generic nature and low cost. The typical cluster of nodes, as well known, is a distributed memory structure, whose programming paradigm of message passing is well established. This thesis presents a study of the impact of hybrid memory architectures composed of distributed memory and distributed shared memory, within a multi-processor cluster on the design of parallel algorithms. The thesis proposes a novel con gurable virtual cluster arrangement interconnected by message passing links as a programming metaphor for parallel algorithm design. The study uses solution of parallel numerical algorithms and parallel back-propagation neural network algorithms as case studies. The parallel implementation of the shallow water equations to model a Tsunami is emphasized. In the tsunami model, a rectangular array of data is partitioned into sub-domains, namely a four by three grid scheme and an eight by six grid scheme which are then used for the parallel implementation. In the preliminary study of hybrid memory architectures, four versions of the parallel algorithm for each grid scheme are used: distributed memory without threads, distributed memory with threads, distributed shared memory without threads, and distributed shared memory with threads. Experiments are realized using the Message Passing Interface (MPI) library, C/Linda, and the Linux pthreads. Subject to the availability of memory, the distributed shared memory version without threads performs best, but as the task is scaled up, the threaded version becomes e cient in both distributed memory and distributed shared memory implementations. In the case of parallel algorithms for neural network training, the neural network is partitioned into sub neural networks by applying a hybrid partitioning scheme. Therefore, each partitioned network is evaluated as a matrix multiplication. Three di erent sizes of neural networks were used and exchange rate prediction was used as the reference problem. Parallel implementations for each of the distributed memory and distributed shared memory scenarios were obtained. The partitioned, matrix multiplication had the fastest execution time, and the MPI implementation was always faster than the TCP-Linda equivalent. x xi The con gurable hybrid cluster scheme was implemented as two UPC/MPI sub cluster con gurations where, in the rst con guration, con guration-1, twelve computing nodes are partitioned into two equal sized DSM sub clusters, and in the second con guration, con guration-2, twelve computing nodes are partitioned into four DSM sub clusters. UPC is used for intra sub clusters while MPI is used for inter-sub-cluster communication. The shallow water model was implemented on each of these con gurations. The parallel algorithm for the uniform DSM implementation exhibits a moderately better performance than either of the two parallel algorithms for the UPC/MPI hybrid cluster con- gurations while these two parallel algorithms individually exhibit an overall better performance than the parallel algorithm for the uniform DM architecture, in terms of running time. Con guration-1 exhibits a moderately better performance than that for con guration-2. As the number of computing nodes per DSM sub cluster decreases, the overall performance approaches that of a DM. A cluster with a large number of computing nodes when con gured as a UPC/MPI hybrid sub cluster provides algorithmic timing values faster than that for the pure DM programming model. Moreover, the set of DSM sub clusters can be con gured into arbitrary topologies such as a two dimensional toroidal mesh type topology, enabling a more exible programming approach and the possibility of reusing well known algorithm designs for high performance applications.

Show full item record