|
Abstract : |
A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed to be superior than the others. In this paper we analyze the performance and scalability of a number of parallel formulations of the matrix multiplication algorithm and predict the conditions under which each formulation is better than the others. We present a parallel formulation for hypercube and related architectures that performs better than any of the schemes described in the literature so far for a wide range of matrix sizes and number of processors. The superior performance and the analytical scalability expressions for this algorithm are verified through experiments on the Thinking Machines Corporation's CM-5 TM y parallel computer for up to 512 processors. We show that special hardware permitting simultaneous communication on all the ports of the processors does not improve the overall scalability of the matrix multiplication algorithms on a hypercube. We also discuss the dependence of scalability on technology dependent factors such as communication and computation speeds and show that under certain conditions, it may be better to have a parallel computer with k-fold as many processors rather than one with the same number of processors, each k-fold as fast., |