|
Abstract : |
We consider a data-parallel implementation of LU-factorization based on the LAPACK routine DGETRF. We analyze the performance of the required BLAS routines and show that high performance is inhibited by current compiler limitations. In particular, we show that optimaldata movementwhen performing rank-1 updates is crucial. The rank1 update is available as a BLAS-2 routine and can also easily be expressed using the intrinsic SPREAD in Fortran 90. However, in order to minimize processor communication, this operation should be explicitly inlined in the computational kernels. Using this observation we identify the need for an explicit LU-factorization applied to a single block. With the freedom to adjust the block-size to hardware, this is a much simpler task than writing the full code in a low level machine language. With this extension, we show that high performance is achievable without modifying the block structure of the LAPACK routine. We expect similar observations to hold for other modules in LAPACK., |