Introduction
System performance may be viewed quite differently by a computational physicist and an
Internet Service Provider. And while one ISP views performance in terms of transactions per second,
another may focus on system availability. Each user must decide which of the many performance metrics
are most important to their operation and choose a benchmarking suite that measures them. Benchmark
results can uncover bottlenecks in your system and help quantify its performance. Once a baseline
performance measurement is established, subsequent benchmarking can be used to validate a system tuning
effort.
This Technical Area page includes links to a variety of benchmarking tools. Some of them give very
detailed results about a specific computing subsystem and some are designed to present an overall picture
of your system's performance. While we can neither endorse nor recommend any of them, each of them may
contain tutorial information that can help you better understand what a benchmark can do for you, and what
it cannot.
Benchmarking Links
HPL: the HPL benchmark is a software package that solves a
(random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can
thus be regarded as a portable as well as freely available implementation of the High Performance Computing
Linpack Benchmark.
LINPAK: the LINPACK benchmark is a collection of Fortran subroutines
that analyze and solve linear equations and linear least-squares problems. The package solves linear systems
whose matrices are general, banded, symmetric indefinite, symmetric positive definite, triangular, and tridiagonal
square. In addition, the package computes the QR and singular value decompositions of rectangular matrices and
applies them to least-squares problems. LINPACK uses column-oriented algorithms to increase efficiency by preserving
locality of reference.
LMBENCH: the LMBENCH benchmark is a suite of simple,
portable, ANSI/C microbenchmarks for UNIX/POSIX. In general, it measures two key features: latency and bandwidth.
LMBENCH is intended to give system developers insight into basic costs of key operations.
NAS: the NAS Parallel Benchmarks (NPB) are a small set of
programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived
from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The
NPB come in several flavors. NAS solicits performance results for each from all sources.
SPEC: the Standard Performance Evaluation Corporation is a non-profit
corporation formed to establish, maintain and endorse a standardized set of relevant benchmarks that can be
applied to the newest generation of high-performance computers. SPEC develops benchmark suites and also
reviews and publishes submitted results from SPEC member organizations and other benchmark licensees.
SPLASH: Stanford's SPLASH benchmark evaluates performance
of shared memory multiprocessors.
STREAM: the STREAM benchmark is the de facto industry standard benchmark
for the measurement of computer memory bandwidth. The STREAM benchmark measures "real world" bandwidth sustainable
from ordinary user programs -- not the theoretical "peak bandwidth" provided by most vendors.
TPC: the Transaction Processing Council defines transaction processing
and database benchmarks and delivers trusted results to the industry.
Performance Metrics
Cost-Effectiveness: An application may have an execution cost requirement that is just as important as execution
time. As the number of nodes in use by the application increases, the execution time goes down (good), but the cost goes
up (bad) due to the increased inter-process communication overhead. It may be important to determine an optimal
number of nodes that minimizes both the execution time and the operating cost.
Cost/Performance Ratio: One modern approach to low-cost computing is to interconnect a whole lot of PCs over a
very fast external data network, and either customize the operating system or run specialized middleware that turns them
into a multicomputer, or cluster. A parallel application that was intended to run on a shared memory multiprocessor
will run on a cluster. The cost/performance ratio tells you how much would it cost to buy a multiprocessor that would
give the SAME performance as the cluster, compared to the cost of the cluster. Ratios of 10:1 were common in the early
days of cluster computing, meaning the cluster ran an application as well as a commercial processor costing ten times as
much. Today, the ratios have dropped to about 3:1, but a project with more time than money has options that it did
not have twenty years ago.
Execution Times: Real-time applications may have an execution time limit. Calculating the location of a landing
zone for an incoming missile, for example, should complete before it arrives.
Processor Speed: Some applications have to process a task with a complexity that varies dramatically over time.
Predicting the weather, for example, has a complexity that depends upon current local conditions. In these cases, the
system may have a speed requirement in gflops (billions of floating point operations) per second, for example,
rather than a specific execution time limit.
Speedup: this standard measure of performance improvement calculates the ratio of the completion times on some
reference system to the system under test. Completion time is often that of a collection of batch jobs or tasks, measured
from the time the first starts until the last one completes.
Throughput: Many applications process a stream of tasks whose arrival rate is more critical than the complexity
of each task; the time it takes to process one of them is not as important as how many can be processed over a finite
period of time. Be sure to include more tasks in your test stream than you have nodes in your cluster to process them
to ensure a backlog at some point in the processing time period.
Utilization/Loading: Parallel applications may distribute processes across a network of computer nodes. A
balanced system is one in which the loading at each node, or processor utilization, is about equal. Fault
tolerance aside, one would like to have the workload distributed evenly across each node in the distributed system's
network.
Note: If you would like a related item added to any of the above lists, please send the details to the
Technical Area Coordinator:
Contact
Dr. Alex Vrenios
DSRLab, LLC
Distributed Systems
Research Laboratory
Phoenix, Arizona, USA
Voice: +1 602 377-7720
Email: alex@DSRLab.com

|