The first step i n eval uat i ng the performance of a mu l t iprocessor system is to establ ish the base l evel performance of the u n iprocessor relative to a we l l - known system such as the VAX - 1 1 /7 8 0 . A large number of single-user benchmarks were used to establ ish t his base leve l .
Single- User Performance
Single-user performance was eval uated by using tradi tional synthetic benchmarks, wel l -known i ndustry standards, and real appl ication programs from engineering, scientific, commercia l , and general t i mesharing environments . Most of the synthetic benchmarks are in FORTRAN; i ndustry standards are Whetstones, Dhrystones, Linpack, and others . The real applications, as mentioned, represen t four environments .
Performance Evaluation of the VAX o 200 Svstl!rns 35 (j) 30 ::.: a: <( � 25 I (.) z 20 LlJ a:J u. 1 5 0 a: w 1 0 a:J � ::J 5 z 0 1 . 6 - 1 .9 1 . 9-2.2 2 2-2.5 2.5-2.8 2 .8-3 1 3.1 -3.4 3.4-3. 7 3 7 - 4 . 0 4.0-4 3 VAX 6200 PERFORM A N C E R E LATIVE TO VAX- 1 1 /780
(VAX- 1 1 /780 SYSTEM = 1)
Figure 6 Frequency Distribution ofthe VAX 62 1 0 Performance on the Single- User lknchmark Set
These lwnchmarks were used ro eva luate uniprocessor speed compared to a VAX- J J j7HO system . A frequency distri bution of the speed u p factors on all these benchmarks was plotted. and the centra l tendency was exa mined . (Sec Fig ure 6 . ) A high percentage of the benc hmarks fell between 2 . 2 and 2 . 8 .
Table 7 sum marizes the performance o f the VAX 62 1 0 in the single-user environment relative ro a VAX- I 1 /780 system The performance aver age of the VAX 6 2 1 0 system , across a l l these benc hmarks, is 2 . 8 t i mes the performance of a VAX- l l /780 system
Decomposed Single- user Perfo rmance
VAX 6 2 0 0 performa nce on decomposed pro grams was eval uated through t he usc of manua l and d i rected decomposit ion techniques. To begin with . a program is eva luated to see i f some
Table 7 Performance of the VAX 621 0 in the Sing le-User Environment
Synthetic Benchmark Set: Singl e-user set
Ind ustry-standard Benchm arks:
Whet-s & -d
Linpack-s Linpack-d Dhrystone
Real Application Benchmark Set: Eng ineering set
Scientific set 7 2 2 . 5 2 . 3 2.7 3.2 2.8 2.8 2.6
segments can be separated i n to para l t el threads that can be run i ndependently. Then the program is decomposed and run . either manua l ly or through directives. The program is i n i ti ated as a single job; t hen the segments of the program that lend themselves to decomposition arc d ivided into subprocesses a nd executed in para l l e l on d i fferent processors. I n the manual decomposi tion method . the optimal number of subpro cesses for various levels of mu lti processor sys tems is eva luated by varying the nu mber of subprocesses and calculating the speedup fac tors In the di rective decomposi tion met hod, the com pi ler takes care of various opti m i zation fac tors . These programs were run standa lone with no i nterference from any othe r programs on the system. Figure 7 i l lustrates the decomposition process.
The benc hmark description is as fol lows . To eval uate the maximum speedup facrors that can be achieved through decomposition , code seg ment's were selected . Such segme n ts as matri x mu l r i p l l...:ation and convolution are widely used in cngineeringjscientitic applications. D i fferent array si zes ( from t O O ro 1 0 00) were used with various arithmetic data types such as i n teger, and single and double precision .
An i mage processi ng program and the Lin pack I OOOD program were used to represent real appl ication programs, where only certain seg ments can be decomposed .
The performance results are as fol l ows. The multiprocessor efficiency measure, defi ned as the relative speed up obtai ned by the addition of each
processor, is the key metric used here to evaluate
Digital Technical jounu:d
ONE PROGRAM DECOM PO S E D I NTO PARALLEL CODE R U N N I NG ON: PROCESSOR I S U B PROCESS I SPEED � T I M E TAKEN TO COMPLETE T H E JOB PROCESSOR 2 S U BPROC ESS 2 PROCESSOR 3 PROCESSOR 4 S U BPROCESS 3 S U BPROCESS 4
Figure 7 Program Decomposition Process
performanu: . As seen in Figure R , t he m u l t i pro cessor efficiency measu re on the program kernel s i s fairly l inear. Mu l ti processor synchron ization i s m i n i mal in this computing environment . The performance was very cl ose to the theoretical max i mu m . A speedup of 3 . 9 t i mes the uni proces sor performance was ach ieved on the four processor 6 24 0 system . The performance on the i mage proe<.:ssing program is sl ightly lower than what was observed on the program kernels. Thus performance ga ined by decom posi tion depends d i rectly on the a mount of code that can be run i n para l l e l . (Note: On the Lin pack I 0 0 0 0 program, d i rected decompos ition was used; whereas on the other progra ms, manual decompos i tion was used .)