COMPARING THE PERFORMANCE OF MPI

 

ON THE CRAY RESEARCH T3E AND IBM SP-2 1

 

by

 

Glenn R. Luecke and James J. Coyle

grl@iastate.edu and jjc@iastate.edu

Iowa State University

Ames, Iowa 50011-2251, USA

 

Waqar ul Haque

haque@unbc.edu

University of Northern British Columbia

Prince George, British Columbia, Canada V2N 4Z9

February 8, 1997

 

Abstract:. This paper reports the performance of the Cray Research T3E and IBM SP-2 on a collection of communication tests that use MPI for the message passing. These tests have been designed to evaluate the performance of communication patterns that we feel are likely to occur in scientific programs. Communication tests were performed for messages of sizes 8 bytes,1 KB, 100 KB, and 10 MB with 2, 4, 8, 16, 32 and 64 processors. Both machines provided a very high level of concurrency for the nearest neighbor communication tests and moderate concurrency on the broadcast operations. On the tests used, the T3E significantly outperformed the SP-2 with most performance tests being at least three times faster than the SP-2.

 

INTRODUCTION

 

Message Passing Interface, MPI, [5.8] is rapidly becoming the standard for writing scientific programs with explicit message passing rather than PVM [3] or the BLACS [2]. The communication network of a parallel computer plays a vitally important role in its overall performance, see [4,6]. There are so many different ways that communication may occur when running scientific programs that it is not possible to test all of them. However, the MPI communication tests that have been developed have been designed to test those communication patterns that we feel are likely to occur in scientific programs. The purpose of this study is to evaluate communication performance of the Cray Research T3E and IBM SP-2 on a collection of communication tests that use MPI for the message passing.

 

DESCRIPTION OF THE PERFORMANCE TESTS AND RESULTS

 

All our communication tests have been written in Fortran with calls to MPI routines for the message passing. These tests were run with message sizes ranging from 8 bytes to 10 MB and with the number of processors ranging from 2 to 64. Some of these communication

 

1 A follow-on study is planned that will evaluate the performance of MPI on the SGI ORIGIN 2000, IBMís follow-on to the SP-2, and the Cray Research T3E-900.

 

patterns take a very short amount of time to execute so they are looped to obtain a wall-clock time of at least one second in order to obtain accurate timings. The time of the communication pattern is then obtained by dividing the total time by the number of iterations. All timings were done using a wall-clock timer. Tests were run at least ten times and the best performance numbers are reported.

 

Tests for the Cray T3E were run on a 64 processor T3E located at Cray Researchís corporate headquarters in Eagan, Minnesota. The peak theoretical performance of each processor is 600 mflops. The communication network has a bandwidth of 350 MB/second and latency of 1.5 microsecond. The T3E communication network is a bi-directional 3-D torus. For more information on the T3E see http://www.cray.com. The operating system used was UNICOS/mk version 1.3.1 and the Fortran compiler used was cf90 version 3.0 with the O2 optimization level.

 

Tests for the IBM SP-2 were run at the Maui High Performance Computing Center. The peak theoretical performance of each of these processors is 267 mflops. The communication network has a peak bi-directional bandwidth of 40 MB/second with a latency of 40.0 microseconds for thin nodes and 39.2 microseconds for wide nodes. Wide (thin) nodes have a 256 (64) KB data cache, 256 (64) bit path from memory to the data cache, and 256 (128) bit path from the data cache to the processor bus. The IBM SP-2 uses a bi-directional multistage interconnection network and may be configured with thin and/or wide nodes. Performance tests were done separately for thin nodes and for wide nodes. For more information about the SP-2, see http://www.mhpcc.edu/training/workshop/html/ibmhwsw/ibmhwsw.html. The IBM SP-2 did not have 64 wide nodes available for our tests so no performance results could be obtained with this many wide nodes. AIX version 4.1 was used and the Fortran compiler used was xlf version 3.2.4 with the O2 optimization level.

 

Communication Test 1 (Table 1):

The first communication test sends a message from one processor to another then the second processor sends a single-integer message back to the sender indicating that the message was received. This test is different from the COMMS1 test described in section 3.3.1 of [4] in that the COMMS1 test sends a message to a processor and then that same message is sent back to the first processor. Table 1 gives performance rates in KB per second. Notice that the T3E consistently outperforms the wide nodes by more than a factor of three. Thin node performance ranged from 5% to 14% less than wide nodes.

 

 

Message Size in Bytes

IBM SP-2 thin node

SP-2 wide node

Cray T3E

Ratio T3E/wide

8

106

114

374

3.3

1,000

5,625

6,519

20,374

3.1

100,000

27,619

31,349

111,742

3.6

10,000,000

32,304

33,878

111,551

3.3

Table 1: Processor to processor communication rate in KB/second.

 

 

Communication Test 2 (Table 2):

We next measure communication rates for sending a message from one processor to all of the other processors. This test is the COMMS3 test described in [4]. To better evaluate the performance of this broadcast operation, define a Normalized Broadcast Rate as

 

(total data rate)/(N-1)

 

where N is the total number of processors involved in the communication and where the total data rate is the total amount of data sent on the communication network per unit time and measured in KB per second. Let R be the data rate when sending a message from one processor to another and let D be the total data rate for broadcasting the same message to the N-1 other processors. If the broadcast operation and communication network were able to concurrently transmit the messages, then D = R*(N-1) and thus the Normalized Broadcast Rate would remain constant as N varied for a given message size. Thus, the rate at which the Normalized Broadcast Rate decreases as N increases indicates how far the broadcast operation is from being ideal. Table 2 gives the Normalized Broadcast Rates for the T3E and for both wide and thin nodes on the SP-2. Notice that in all cases instead of being constant for a given message size, the Normalized Broadcast Rate decreases significantly as the number of processors increase. Also notice that for all message sizes the T3E is roughly 3 to 5 times faster than wide nodes but this factor decreases as the number of processors used increases.

 

 

Message Size in bytes

Number of Processors

IBM SP-2 thin nodes

IBM SP-2 wide nodes

Cray T3E

Ratio T3E/wide

8

2

75

78

345

4.4

8

4

52

55

195

3.5

8

8

31

38

101

2.7

8

16

20

24

47

2.0

8

32

12

14

16

1.1

8

64

5

na

4

 

1,000

2

4,587

5,184

23,609

4.6

1,000

4

2,711

3,374

13,485

4.0

1,000

8

1,837

2,433

7,380

3.0

1,000

16

1,366

1,717

3,759

2.2

1,000

32

833

1,073

1,398

1.3

1,000

64

378

na

412

 

100,000

2

27,181

30,685

98,729

3.2

100,000

4

9,755

10,826

51,874

4.8

100,000

8

5,057

5,516

35,357

6.4

100,000

16

4,650

8,268

26,632

3.2

100,000

32

3,190

4,049

21,545

5.3

100,000

64

1,943

na

17,833

 

10,000,000

2

32,298

33,614

100,336

3.0

10,000,000

4

10,970

11,332

53,220

4.7

10,000,000

8

5,507

5,680

36,821

6.5

10,000,000

16

10,638

16,472

27,321

1.7

10,000,000

32

9,662

14,523

22,371

1.5

10,000,000

64

8,147

na

18,466

 

Table 2: Normalized Broadcast Rates in KB/second.

 

 

To see that there actually is concurrency occurring in the broadcast operation, define

the Log Normalized Broadcast Rate as

 

(total data rate)/Log(N),

 

where N is the number of processors involved in the communication and Log(N) is the log base 2 of N. Thus, if binary tree parallelism were being utilized, the Log Normalized Data Rate would be constant for a given message size as N varies. Table 2-b gives the Log Normalized Data Rates and does in fact show that concurrency is being utilized in the broadcast operation for both machines.

 

 

Message Size in bytes

Number of Processors

IBM SP-2 thin nodes

IBM SP-2 wide nodes

Cray T3E

Ratio T3E/wide

8

2

75

78

345

4.4

8

4

78

83

293

3.5

8

8

72

89

236

2.7

8

16

75

90

176

2.0

8

32

74

87

99

1.1

8

64

53

na

42

 

1,000

2

4,587

5,184

23,609

4.6

1,000

4

4,067

5,061

20,228

4.0

1,000

8

4,286

5,677

17,220

3.0

1,000

16

5,123

6,439

14,096

2.2

1,000

32

5,165

6,653

8,668

1.3

1,000

64

3,969

na

4,326

 

100,000

2

27,181

30,685

98,729

3.2

100,000

4

14,633

16,239

77,811

4.8

100,000

8

11,800

12,871

82,500

6.4

100,000

16

17,438

31,005

99,870

3.2

100,000

32

19,778

25,104

133,579

5.3

100,000

64

20,402

na

187,247

 

10,000,000

2

32,298

33,614

100,336

3.0

10,000,000

4

16,455

16,998

79,830

4.7

10,000,000

8

12,850

13,253

85,916

6.5

10,000,000

16

39,893

61,770

102,454

1.7

10,000,000

32

59,904

90,043

138,700

1.5

10,000,000

64

85,544

na

193,893

 

Table 2-b: Log Normalized Broadcast Rates in KB/second.

 

 

Communication Test 3 (Table 3):

Communication Test 3 measures the rates for broadcasting a message from a processor to all other processors and then having the receiving processors return this same message back to the originating processor. To eliminate the possibility of an optimizing compiler recognizing that it is the same message being sent back and hence need not be sent back, one element of the message is altered by the receiving processor prior to sending the message back. Notice that this communication pattern causes significantly more data traffic on the communication network than the simple broadcast operation described in Communication Test 2. Table 3 gives the Normalized Broadcast Rates for this operation. Notice that the trends in Table 3 are similar to those of Table 2 but that the data rates for 10 MB messages with 16, 32 and 64 processors drop off significantly on the SP-2.

 

 

Message Size in bytes

Number of Processors

IBM SP-2 thin nodes

IBM SP-2 wide nodes

Cray T3E

Ratio T3E/wide

8

2

103

124

426

3.4

8

4

64

74

202

2.7

8

8

40

45

96

2.1

8

16

23

26

37

1.4

8

32

12

13

12

0.9

8

64

5

na

3

 

1,000

2

5,925

7,676

28,811

3.8

1,000

4

3,665

4,735

11,482

2.4

1,000

8

2,243

2,841

5,009

1.8

1,000

16

1,229

1,578

1,584

1.0

1,000

32

638

835

418

0.5

1,000

64

294

na

106

 

100,000

2

27,226

31,479

108,573

3.4

100,000

4

11,292

12,778

68,032

5.3

100,000

8

5,629

6,420

42,286

6.6

100,000

16

2,768

3,419

25,039

7.3

100,000

32

1,466

1,731

14,037

8.1

100,000

64

758

na

7,568

 

10,000,000

2

32,109

33,766

108,512

3.2

10,000,000

4

13,117

13,542

70,423

5.2

10,000,000

8

6,518

6,764

47,711

7.1

10,000,000

16

3,333

3,964

29,399

7.4

10,000,000

32

1,899

2,016

16,962

8.4

10,000,000

64

946

na

9,310

 

Table 3: Normalized Broadcast Rates in KB/second for a broadcast and return.

 

 

Communication Test 4 (Table 4):

The rest of the communication tests are designed to measure communication between "neighboring" processors. As above, let N be the total number of processors and assume that they have been numbered from 1 to N. This communication test sends a message from processor i to processor (i+1) mod N, for i = 1, 2, Ö, N. Observe that the data rates for this test will increase proportionally with N since communication can (hopefully) be done in parallel. Thus, in a manner similar to the Normalized Broadcast Rate, we define the Normalized Data Rate to be

 

(total data rate)/N.

 

In an ideal parallel computer, the Normalized Data Rate for the above communication would be constant since all communication would be done concurrently. Thus, the degree to which the Normalized Data Rate is not constant indicates how far from ideal this type of communication can be performed by the parallel computer. Table 4 gives the Normalized Data Rates for the above communication in KB/second. Notice that for all message sizes the data rate scales well for both machines and that the T3E significantly outperforms the SP-2.

 

 

Message Size in bytes

Number of Processors

IBM SP-2 thin nodes

IBM SP-2 wide nodes

Cray T3E

Ratio T3E/wide

8

2

67

75

259

3.5

8

4

67

72

231

3.2

8

8

65

70

228

3.2

8

16

61

67

226

3.4

8

32

55

66

223

3.7

8

64

43

na

222

 

1,000

2

3,994

4,936

15,845

3.2

1,000

4

3,642

4,671

14,382

3.1

1,000

8

3,644

4,507

14,247

3.2

1,000

16

3,475

4,303

14,142

3.3

1,000

32

3,444

3,964

14,093

3.6

1,000

64

3,294

na

13,864

 

100,000

2

13,612

15,832

57,801

3.7

100,000

4

13,554

15,694

57,126

3.6

100,000

8

13,364

15,645

57,166

3.7

100,000

16

13,290

15,389

57,079

3.7

100,000

32

13,197

14,899

56,869

3.8

100,000

64

12,630

na

55,955

 

10,000,000

2

16,091

16,893

56,001

3.3

10,000,000

4

16,056

16,939

47,475

2.8

10,000,000

8

15,722

16,808

40,355

2.4

10,000,000

16

15,791

16,767

40,336

2.4

10,000,000

32

15,497

16,739

40,307

2.4

10,000,000

64

13,245

na

40,126

 

Table 4: Normalized Data Rates for processor i to processor i+1 mod N.

 

 

Communication Test 5 (Table 5):

This next communication pattern sends a message from a processor i to each of its neighbors (i-1) mod N and (i+1) mod N, for i = 1, 2, Ö, N and where N is the number of processors used in the test. Thus, the amount of data being moved on the network will be twice that of the previous test. Table 5 shows the performance results. Notice that the T3E is able to handle the extra data traffic better than the SP-2. Also notice that the data rates scale well as the number of processors used increases.

 

 

Message Size in bytes

Number of Processors

IBM SP-2 thin nodes

IBM SP-2 wide nodes

Cray T3E

Ratio T3E/wide

8

2

90

100

487

4.9

8

4

85

95

478

5.1

8

8

78

92

475

5.2

8

16

69

89

470

5.3

8

32

65

84

414

4.9

8

64

54

na

386

 

1,000

2

6,177

7,374

27,599

3.7

1,000

4

5,420

6,965

26,804

3.8

1,000

8

5,166

6,576

26,810

4.1

1,000

16

4,809

6,616

26,294

4.0

1,000

32

4,775

6,088

21,783

3.6

1,000

64

3,845

na

20,773

 

100,000

2

14,793

17,715

100,498

5.7

100,000

4

15,849

20,493

94,391

4.6

100,000

8

15,969

20,445

100,140

4.9

100,000

16

15,706

20,389

100,059

4.9

100,000

32

15,233

19,828

99,950

5.0

100,000

64

14,914

na

99,212

 

10,000,000

2

17,106

18,638

104,615

5.6

10,000,000

4

18,821

22,556

81,668

3.6

10,000,000

8

18,334

22,396

72,293

3.2

10,000,000

16

18,122

22,437

72,372

3.2

10,000,000

32

17,799

22,111

72,483

3.3

10,000,000

64

16,518

na

72,512

 

Table 5: Normalized Data Rates for processor i to processors (i+1) mod N & (i-1) mod N.

 

 

Communication Test 6 (Table 6):

To describe this last communication pattern, let N be the number of processors used for the test and let P be a permutation of the first N positive integers. This communication pattern can then be described as sending a message from processor P(i) to processor P((i+1) mod N). Notice that this is exactly the same as the communication pattern described in Communication Test 4 if one were to reorder the processor numbering from 1, 2, Ö, N to P(1), P(2), Ö, P(N). Thus, the purpose of this test is to determine the impact on performance of reordering the numbering of the processors. Of course, it is likely that the performance will depend on the particular permutation selected. Comparing Table 6 with Table 4, one observes that the Normalized Data Rates for both the T3E and SP-2 can in fact change significantly depending on the permutation selected.

 

 

 Message Size in bytes

Number of Processors

IBM SP-2 thin nodes

IBM SP-2 wide nodes

Cray T3E

Ratio T3E/wide

8

2

88

99

483

4.9

8

4

86

96

472

4.9

8

8

79

94

461

4.9

8

16

75

91

440

4.8

8

32

69

85

408

4.8

8

64

55

na

398

 

1,000

2

5,957

7,318

28,127

3.8

1,000

4

5,697

6,851

27,067

4.0

1,000

8

5,353

6,770

26,578

3.9

1,000

16

5,256

6,593

21,808

3.3

1,000

32

5,046

6,339

20,775

3.3

1,000

64

4,972

na

20,150

 

100,000

2

14,599

17,868

100,425

5.6

100,000

4

16,015

20,453

90,487

4.4

100,000

8

15,793

20,474

97,556

4.8

100,000

16

14,866

20,417

96,906

4.7

100,000

32

15,554

20,328

89,936

4.4

100,000

64

14,176

na

80,868

 

10,000,000

2

17,171

19,413

99,158

5.1

10,000,000

4

18,275

22,565

69,569

3.1

10,000,000

8

17,311

22,520

76,064

3.8

10,000,000

16

17,111

22,378

67,416

3.0

10,000,000

32

17,010

22,179

59,208

2.7

10,000,000

64

14,667

na

54,516

 

Table 6: Normalized Data Rates for processor P(i) to processor P((i+1) mod N).

 

 

CONCLUSIONS

 

The purpose of this study was to evaluate communication performance of the Cray Research T3E and IBM SP-2 on a collection of communication tests written in MPI. These tests have been designed to evaluate the performance of communication patterns that we feel are likely to occur in scientific programs. Both machines showed a very high level of concurrency on the nearest neighbor communication tests and moderate concurrency on the broadcast communication tests. On the SP-2, thin nodes performed roughly 10% less than wide nodes on our tests. The T3E significantly outperformed the SP-2 with most performance tests being at least three times faster than wide nodes on the SP-2.

 

 

ACKNOWLEDGMENTS

Computer time on the Maui High Performance Computer Centerís SP-2 was sponsored by the Phillips Laboratory, Air Force Material Command, USAF, under cooperative agreement number F29601-93-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Phillips Laboratory or the U.S. Government.

 

We would like to thank Cray Research Inc. for allowing us to use their T3E at their corporate headquarters in Eagan, Minnesota, USA.

 

 

REFERENCES

 

  1. Cray MPP Fortran Reference Manual, SR 2504 6.2.2, Cray Research, Inc., June 1995.
  2.  

  3. J. Dongarra, R. Whaley, A Userís Guide to the BLACS v1.0, Computer Science Departmen Technical Report CS-95-281, University of Tennessee, 1995. (Available as LAPACK Working Note 94 at: http://www.netlib.org/lapack/lawns/lawn94.ps)
  4.  

  5. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM: Parallel Virtual Machine A Usersí Guide and Tutorial for Networked Parallel Computing, The MIT Press, 1994.
  6.  

  7. R. Hockney, M. Berry, Public International Benchmarks for Parallel Computers: PARKBENCH Committee, Report-1, February 7, 1994.
  8.  

  9. W. Gropp, E. Lusk, A. Skjellum, USING MPI, The MIT Press 1994.
  10.  

  11. G. Luecke, J. Coyle, W. Haque, J. Hoekstra, H. Jespersen, Performance Comparison of Workstation Clusters for Scientific Computing, SUPERCOMPUTER, vol XII, no. 2, pp 4-20, March 1996.
  12.  

  13. Optimization and Tuning Guide for Fortran, C, and C++ for AIX version 4, second edition, IBM, June 1996.
  14.  

  15. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra, MPI: The Complete Reference, The MIT Press, 1996.