HIGH PERFORMANCE FORTRAN VERSUS EXPLICIT MESSAGE PASSING

ON THE IBM SP-2 FOR THE

PARALLEL LU, QR, AND CHOLESKY FACTORIZATIONS 1

 

by

 

Glenn R. Luecke and James J. Coyle

grl@iastate.edu and jjc@iastate.edu

Iowa State University

Ames, Iowa 50011-2251, USA

January, 1997

 

Abstract: The goal of this work was to provide high performance parallel implementations of the LU, QR, and Cholesky factorizations for the IBM SP-2 for Visual Numerics, Inc. For ease of development and maintenance, it was hoped that this could be done by implementing these routines in High Performance Fortran, rather than using explicit message passing. At the present time, this does not appear to be possible. However, we have found that the version 1.4 SCALAPACK LU, Cholesky and QR factorizations routines do provide high performance for parallel execution on the IBM SP-2 when using the MPI version of the BLACS, and the BLAS from IBM's ESSL library. Moreover, we do expect that these SCALAPACK routines will perform well on any parallel computer that provides fast communication via MPI and has high performance (selected level 3) BLAS routines available.

 

 

INTRODUCTION

 

The goal of this work was to provide high performance parallel implementations of the LU, QR, and Cholesky factorizations [8] and their associated solvers in double precision and double precision complex for the IBM SP-2 for Visual Numerics, Inc.(VNI). For ease of development and maintenance, it was hoped that this could be done by implementing these routines in High Performance Fortran (HPF), see [10], rather than using explicit message passing. For usability with existing VNI libraries, these implementations were required to work with advanced Fortran 90 features such as derived data types, modules, user-defined operators, and user-defined generic procedures.

 

 

HPF IMPLEMENTATIONS

 

In the spring of 1996, HPF compilers from Applied Parallel Research and the Portland Group were available for the SP-2 but they did not support the needed Fortran 90 features. In August, 1996, IBM's HPF compiler became available at the Maui High Performance Computing Center and it did support the needed Fortran 90 features required by VNI. The question then became whether or not the IBM HPF compiler would provide the needed parallel performance.

 

When using the IBM xlf Fortran compiler, programming techniques such as, blocking for cache and loop unrolling, will often enhance performance, see [12]. However, when using IBM’s HPF compiler, it was found that these programming techniques actually degraded performance. Unlike Cray Research’s CRAFT Fortran [4], it was discovered that HPF cannot utilize vendor-optimized serial BLAS routines [5] concurrently for the concurrent processing of local data.

 

 

1 This work was supported by a contract from Visual Numerics, Inc.

This is a major problem because then one cannot utilize these routines to enhance performance on each node.

 

Various HPF implementations of the LU factorization were written. A listing of the HPF program for the LU factorization that provided the best overall performance for the problems considered is appended at the end of this paper. This code uses the (*,cyclic) data distribution [see, 10] since all other distributions gave poorer performance. The following table gives the performance in mflops of the best overall HPF LU factorization program.

 

 

# of proc

n=512

n=1024

n=2048

1

28

29

29

2

39

50

50

4

57

77

88

8

58

91

98

16

44

70

89

Table 1: HPF double precision LU with all numbers in mflops.

 

 

IBM's LU factorization from the ESSL Library yields 218, 218 and 227 mflops for n = 512, 1024 and 2048, respectively. Thus, in all cases, the HPF LU never even comes close to the performance of the serial ESSL LU no matter how many processors were used!

 

It was also discovered that the HPF LU factorization routine required excessive amounts of memory. Storing the matrix, A, to be factored, is the dominant storage requirement for the LU factorization. For double precision arithmetic and a problem size of 4096, 128 MB are required to store A. This translates to 16 MB per processor when 8 processors are used. When using 8 wide nodes [12] on the Maui SP-2 each with 256 MB of memory, this program failed to complete and the following message was issued: "This program is unable to allocate sufficient memory to perform the required operation. The program will stop." However, this same program compiled with the xlf Fortran compiler executed with no problems when run on one of the above wide nodes. In fact, the HPF program could not be run with 1, 2, 4 or 8 processors. These experiments were carried out using SP-2 nodes dedicated to the execution of our program for accurate memory requirement assessment.

 

Because of the poor performance and excessive memory requirements, it was decided that it was not feasible at this time to use IBM's HPF compiler for developing HPF implementations of the LU, QR and Cholesky factorizations. The only alternative was to develop parallel versions of these routines with explicit message passing.

 

 

IMPLEMENTATIONS WITH EXPLICIT MESSAGE PASSING

 

On the IBM SP-2, one can use IBM's native message passing, PVM [7] or MPI [9] to develop codes with explicit message passing. Of these three, only MPI is a (defacto) standard. We have run performance tests comparing the performance of PVM and MPI on the SP-2 and have found MPI to provide significantly better performance (roughly a factor of 10) than PVM. SCALAPACK [3] from Oak Ridge National Laboratories can now be run using MPI on the IBM SP-2, see the next paragraph for more information. For these reasons it was decided to use the latest version of SCALAPACK as the basis for implementing high performance parallel versions of the LU, QR and Cholesky factorizations, see [1]. On November 17, 1996, version 1.4 of SCALAPACK was released and this version was used for this project.

 

SCALAPACK (see http://www.netlib.org/scalapack/index.html/) is a collection of linear algebra routines written for the distributed memory environment. It is highly modular, encapsulating all communication, data distribution and synchronization within the BLACS (Basic Linear Algebra Communication Subroutines) [6]. It also contains routines for manipulating distributed arrays called PBLAS (Parallel Basic Linear Algebra Subroutines) [2] that use the BLACS for message passing. SCALAPACK routines proceed by working on local blocks of data as much as possible with efficient serial routines, and working with non-local data either by invoking the BLACS and/or PBLAS. The LU, QR and Cholesky factorization routines were implemented using the

 

(cyclic(block_size),cyclic(block_size))

 

data distribution [1]. The BLACS were first implemented using PVM. During the fall of 1996, an MPI version of the BLACS was made available making it now possible to execute all SCALAPACK routines on any computer with MPI. However, to get good performance from SCALAPACK routines, the computer must have an efficient implementation of both MPI and (selected level 3) BLAS [5]. On the SP-2, MPI has been efficiently implemented and efficient BLAS routines are available from IBM's ESSL library.

 

Documentation for SCALAPACK (see http://www.netlib.org/scalapack/html/scalapack_doc.html) is available from Oak Ridge National Laboratories. A partial list of the items included in this documentation is:

 

 

 

 

 

Oak Ridge National Laboratories is currently planning to incorporate the following in future versions of SCALAPACK, see http://www.netlib.org/scalapack/html/scalapack_future.html:

 

 

 

Both of these features should enhance the usefulness and efficiency of the SCALAPACK routines.

 

 

TESTING AND PERFORMANCE RESULTS

 

All performance and functionality tests were run on the IBM SP-2 at the Maui High Performance Computing Center with AIX 4.1, version 1.4 of SCALAPACK compiled with version 3.2.5 of xlf Fortran and using the options "O3 -qarch=pwr2", the MPI version of the BLACS, and the BLAS from IBM's ESSL (ESSLP2) library. We installed version 1.4 of SCALAPACK on the Maui SP-2 by taking SCALAPACK from the Oak Ridge National Laboratories web server. As part of the installation, the file SLmake.inc was modified to include the following libraries all of which were already available on the Maui SP-2: the MPI-BLACS, the mpxlf Fortran compiler, version 1.1 of LAPACK, and the ESSL library. All functionality and performance testing was done using SCALAPACK installed as described above.

 

Over 870,000 functionality tests of the LU, QR and Chokesky decomposition routines were carried out for problem sizes ranging from 1, 2, 3, 4, ..., 128 with the number of processors varying from 1, 2, 3, 4, ..., 16 with the following data distributions for the matrix A which, using HPF notation, can be described as:

 

!hpf$ processors G(p,q)

!hpf$ distribute A(cyclic(nb),cyclic(nb)) onto G

 

where nb = 17 or 64, and all combinations of p*q = (the number of actual processors) were used. All these tests were run with drivers for a single right-hand-side and for multiple right-hand-sides for the associated linear system and were run using both double precision and double precision complex data types. Here "double precision" means "real*8" or "real(kind(1d0))" and "double precision complex" means "complex*16" or "complex(kind(1d0))". Normed residuals were computed and in all cases these residuals were no more than 3 machine epsilons after scaling was taken into account.

 

Over 1200 performance tests were conducted using wide nodes on the SP-2. IBM’s LoadLeveler was used to dedicate the use of the nodes for accurate performance assessment. The file scalapack/testing/lin/ is generated from the SCALAPACK tar file and contains the following driver programs that were used for the performance tests:

 

 

 

 

The ‘totmem’ parameter in the above driver programs had to be increased to 200 million to allow for enough memory to run our tests. Moreover, in the data files for these driver programs, error checking was disabled by using -1 for the tolerance parameter and by giving ‘false’ for error checking. This was done because error checking generates additional arrays and hence increases memory requirements. Mflops for the complex routines were calculated by counting the real floating point operations. All mflops were calculated by using a wall-clock timer.

 

Tables 3-8 report performance in mflops for the LU, Cholesky and QR factorizations where, using HPF syntax, the data distribution of the matrix A can be described as

 

!hpf$ processors G(p,q)

!hpf$ distribute A(cyclic(64),cyclic(64)) onto G

 

where p*q is the actual number of processors used. For a given number of processors, all combinations of positive integers p and q were executed and the combination giving the best performance was reported. The values given to p and q can greatly affect performance, as illustrated by the following table:

 

 

Processor Grid

LU

Cholesky

QR

p=16, q=01

607

1443

1826

p=08, q=02

965

2003

2079

p=04, q=04

1353

2001

2327

p=02, q=08

1604

1553

2172

p=01, q=16

1244

934

1519

Table 2. Problem size is 4096 with performance in mflops.

 

 

The peak theoretical performance of each wide node is 267 mflops and the matrix multiply routine in IBM's ESSL power2 (esslp2) library attains about 210 mflops for square matrices of size 512 to 1024. Thus, we consider the SCALAPACK performance numbers in Tables 3-8 to be quite good.

 

 

# of proc

n=512

n=1024

n=2048

n=4096

1

84

151

183

201

2

157

235

327

363

4

195

322

502

588

8

236

408

720

1013

16

223

464

898

1604

Table 3: SCALAPACK double precision LU performance in mflops.

 

 

# of proc

n=512

n=1024

n=2048

n=4096

1

155

195

215

na 2

2

238

316

383

428

4

298

452

619

759

8

363

619

997

1350

16

371

806

1449

2208

Table 4: Scalapack double precision complex LU performance in mflops.

2There was insufficient memory to run this test.

 

 

# of proc

n=512

n=1024

n=2048

n=4096

1

130

177

201

219

2

155

303

367

415

4

265

442

580

717

8

308

580

899

1289

16

340

762

1388

2003

Table 5: SCALAPACK double precision Cholesky performance in mflops.

 

 

# of proc

n=512

n=1024

n=2048

n=4096

1

138

187

211

na 2

2

213

320

385

429

4

314

484

650

780

8

366

709

1089

1409

16

489

969

1646

2365

Table 6: SCALAPACK double precision complex Cholesky performance in mflops.

2There was insufficient memory to run this test.

 

 

# of proc

n=512

n=1024

n=2048

n=4096

1

152

197

219

227

2

234

311

388

438

4

271

466

670

812

8

316

606

1014

1404

16

313

738

1473

2327

Table 7: SCALAPACK double precision QR performance in mflops.

 

 

# of proc

n=512

n=1024

n=2048

n=4096

1

182

210

230

na 2

2

277

367

427

463

4

397

611

765

877

8

467

877

1316

1612

16

562

1235

2116

2845

Table 8: SCALAPACK double precision complex QR performance in mflops.

2There was insufficient memory to run this test.

 

 

CONCLUSIONS

 

At the present time, it does not appear to be possible to develop high performance LU, Cholesky and QR factorizations using IBM's HPF compiler. Moreover, this compiler can generate excessive amounts of storage making it unacceptable for use with problems requiring large amounts of memory. We consider it unfortunate not to be able to use HPF for these routines because it is easier to develop and maintain code written in HPF than code with explicit message passing. However, HPF compilers are new at the present time and we expect that they will continue to improve so that within a few years it may be possible to develop high performance linear algebra routines using HPF.

 

We have found that the version 1.4 SCALAPACK LU, Cholesky and QR factorizations routines do provide high performance for parallel execution on the IBM SP-2 when using the MPI version of the BLACS, and the BLAS from IBM's ESSL library. Moreover, we do expect that these SCALAPACK routines will perform well on any parallel computer that provides fast communication via MPI and has high performance (selected level 3) BLAS routines available.

 

 

ACKNOWLEDGMENTS

 

We would like to thank Visual Numerics for supporting this work and Gyan Bhanot of the IBM Watson Research Center for his willingness to review our HPF LU factorization code for this project. We also thank Richard Hanson from Visual Numerics Inc. for reviewing this manuscript and for his assistance during this project.

 

Computer time on the Maui High Performance Computer Center’s SP-2 was sponsored by the Phillips Laboratory, Air Force Material Command, USAF, under cooperative agreement number F29601-93-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Phillips Laboratory or the U.S. Government.

 

 

REFERENCES

 

  1. J. Choi, J. Dongarra, L. Ostrouchov, A. Petitet, D. Walker, and R. Whaley, The Design and Implementation of the SCALAPACK LU, QU, and Cholesky Factorization Routines, ORNL/TM-12470, September 1994.
  2.  

  3. J. Choi, J. Dongarra, L. Ostrouchov, A. Petitet, D. Walker, and R. Whaley, A Proposal for a Set of Parallel Basic Linear Algebra Subprograms, Computer Science Department Technical Report CS-95-292, Univ. of Tennessee, Knoxville, 1995. (Available as LAPACK Working Note 100 at http://www.netlib.org/lapack/lawns/lawn100.ps)
  4.  

  5. J. Choi, J. Dongarra, R. Pozo, and D. Walker, SCALAPACK: A Scalable Linear Algebra Algebra Library for Distributed Memory Concurrent Computers, Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation, McLean Virginia, pp 120-127, IEEE Publishers, October 1992.
  6.  

  7. Cray MPP Fortran Reference Manual, SR 2504 6.2.2, Cray Research, Inc., June 1995.
  8.  

  9. J. Dongarra, J. Du Croz, I. Duff, S. Hammarlin, Algorithm 679: A set of Level 3 Basic Linear Algebra Subprograms, ACM Trans. Math. Soft., 16(1990), pp. 18-28.
  10.  

  11. J. Dongarra, R. Whaley, A User’s Guide to the BLACS v1.0, Computer Science Department Technical Report CS-95-281, Univ. of Tennessee, 1995. (Available as LAPACK Working note 94 at: http://www.netlib.org/lapack/lawns/lawn94.ps)
  12.  

  13. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM: Parallel Virtual Machine A Users’ Guide and Tutorial for Networked Parallel Computing, The MIT Press, 1994.
  14.  

  15. G. Golub, C. van Loan, Matrix Computations, The Johns Hopkins University Press, Second Edition, 1989.
  16.  

  17. W. Gropp, E. Lusk, A. Skjellum, USING MPI, The MIT Press 1994.
  18.  

  19. C. Koebel, D. Loveman, R. Schrieber, G. Steele Jr, M. Zosel, The High Performance Fortran Handbook, The MIT Press, 1994.
  20.  

  21. G. Luecke, J. Coyle, W. Haque, J. Hoekstra, H. Jespersen, Performance Comparison of Workstation Clusters for Scientific Computing, SUPERCOMPUTER, vol XII, no. 2, pp 4-20, March 1996.
  22.  

  23. Optimization and Tuning Guide for Fortran, C, and C++ for AIX version 4, second edition, IBM, June 1996.

 

 

HPF LU CODE LISTING

 

parameter(n=1024,lda = n+2 )

! Compile this with:

! xlhpf -O3 -qarch=pwr2 -qhot -qnosave -qreport=hpflist -o prog x.f

! secondr() is the name of the double precision wall clock timer used

 

implicit double precision(a-h,o-z)

double precision secondr, mflops

double precision a(lda,n), b(n), v(n), t(n)

dimension ipvt(n)

!HPF$ PROCESSORS processors(number_of_processors())

!HPF$ distribute a(*,cyclic) onto processors

!HPF$ align t(:) with a(*,:)

! !HPF$ align b(:) with a(:,*)

call random_number(a)

b = 1.d0

b=matmul(a,b)

ipvt(1:1) = maxloc(abs(b))

t2 = -1.d0

t1 = -1.d0

t0 = secondr()

t1 = secondr()

do k=1,n

ipvt(k:k) = maxloc(abs(a(k:n,k))) + k - 1

L = ipvt(k)

if ( L .ne. k ) then

!HPF$ INDEPENDENT,NEW(tmp)

do j=1,n

tmp=a(k,j)

a(k,j) = a(L,j)

a(L,j) = tmp

enddo

endif

if ( a(k,k) .ne. 0.d0 ) then

a(k+1:n,k) = a(k+1:n,k)*(1.d0/a(k,k))

endif

!HPF$ INDEPENDENT,NEW(tmp1,L3)

do L2=k+1,n

tmp1 = a(k,L2)

!HPF$ INDEPENDENT

do L3=k+1,n

a(L3,L2)=a(L3,L2) - a(L3,k)*tmp1

enddo

enddo

enddo

t2 = secondr()

! Forward elimination

do k=1,n

L=ipvt(k)

if ( L .ne. k ) then

tmp = b(k)

b(k)=b(L)

b(L)=tmp

endif

enddo

do k=1,n

b(k+1:n) = b(k+1:n) - a(k+1:n,k)*b(k)

enddo

! Backsolve

do j=n,1,-1

b(j) = b(j)/a(j,j)

b(1:j-1) = b(1:j-1) - b(j)*a(1:j-1,j)

enddo

time = (t2-t1)

if (time .eq. 0.d0) time = -1.d0

ops = 2.d0*(1.d0*n)**3/3.d0

mflops = ops/time*1.d-6

write(*,*) 'N, # of processors : ',N,number_of_processors()

write(*,*) 'time, mflops = ',time, mflops

err_max = maxval(abs(b-1.d0))

write(*,*) 'Max error is ',err_max

end