|
|
Cluster Performance:
Far too often one gets wrapped up in the sales hype, believing that the
more processors they have, the faster they are, the more physical memory
they have, and the faster the interconnection speed, the quicker their
particular application will run. To some extent, that's true, but a clear
headed approach to performance improvement, however, can net you the same
benefit at a potentially much lower cost. And, as we've said elsewhere, you
really must consider the costs and benefits together.
PHILOSOPHY:
Automobile enthusiasts may fondly recall the muscle car era
of the 1960s. We marveled at the raw power of the 327s, and we wondered
where it might end when the 348s and 409s hit the streets. Competition
from the 413s, 421s and 427s turned the sights and sounds at our local
drag strip into unforgettable memories.
Buying a street-legal racecar off the showroom floor wasn't within
everyone's reach, however, and a closer look at the winners revealed
that they went to great lengths to squeeze every drop of performance out
of their investments. More than a few of the sponsored, trailered, and
spectacularly painted behemoths lost; we sincerely hope that history
will not have to repeat itself in the cluster computing arena.
Let's begin with an understanding of what good performance
means to you: performance can be viewed quite differently by a scientist
who simulates the Earth's atmosphere and by the webmaster of an Internet
search engine. One might view website performance in terms of transactions
per second, while another may focus on system availability. And we all
hope that the weather simulation gets done before the picnic!
Each of us must decide which of the many performance metrics are the most
important and choose a benchmarking suite that measures them. Benchmark
results can uncover bottlenecks in your cluster, and help you quantify
and (hopefully) improve its performance.
The most important question to ask about your cluster is,
"How good is it?"
That is, you must find a way to quantify its performance: get a
baseline measurement for each set of metrics that are important
to your personal goals.
An important ongoing question to ask about your cluster is,
"How much better is it now?"
That is, you must also be diligent in keeping records of what you do to your
cluster's hardware and software, and what effect those actions have on its
performance.
A list of popular metrics (measurable properties of your system),
and benchmarks (utility programs that measure certain of these
properties) follows.
METRICS:
Cost-Effectiveness: An application may have an execution cost
requirement that is just as important as execution time. As the number of nodes
in use by the application increases, the execution time goes down (good), but the
cost goes up (bad) due to the increased inter-process communication overhead. It
may be important to determine an optimal number of nodes that minimizes
both the execution time and the operating cost.
Cost/Performance Ratio: One modern approach to low-cost computing
is to interconnect a whole lot of PCs over a very fast external data network, and
either customize the operating system or run specialized middleware that turns them
into a multicomputer, or cluster. A parallel application that was intended to run on
a shared memory multiprocessor will run on a cluster. The cost/performance ratio tells
you how much would it cost to buy a multiprocessor that would give the SAME performance
as the cluster, compared to the cost of the cluster. Ratios of 10:1 were common in the
early days of cluster computing, meaning the cluster ran an application as well as a
commercial processor costing ten times as much. Today, the ratios have dropped to about
3:1, but a project with more time than money has options that it did not have twenty
years ago.
Execution Times: Real-time applications may have an execution time
limit. Calculating the location of a landing zone for an incoming missile, for example,
should complete before it arrives.
Processor Speed: Some applications have to process a task with a
complexity that varies dramatically over time. Predicting the weather, for example, has
a complexity that depends upon current local conditions. In these cases, the system may
have a speed requirement in gflops (billions of floating point operations) per
second, for example, rather than a specific execution time limit.
Speedup: this standard measure of performance improvement calculates
the ratio of the completion times on some reference system to the system under test.
Completion time is often that of a collection of batch jobs or tasks, measured from the
time the first starts until the last one completes.
Throughput: Many applications process a stream of tasks whose arrival
rate is more critical than the complexity of each task; the time it takes to process one
of them is not as important as how many can be processed over a finite period of time.
Be sure to include more tasks in your test stream than you have nodes in your cluster to
process them to ensure a backlog at some point in the processing time period.
Utilization/Loading: Parallel applications may distribute processes
across a network of computer nodes. A balanced system is one in which the loading
at each node, or processor utilization, is about equal. Fault tolerance aside, one would
like to have the workload distributed evenly across each node in the distributed system's
network.
BENCHMARKS:
HPL: the HPL benchmark is a software package that solves a (random) dense
linear system in double precision (64 bits) arithmetic on distributed-memory computers. It
can thus be regarded as a portable as well as freely available implementation of the High
Performance Computing Linpack Benchmark.
LINPAK: the LINPACK benchmark is a collection of Fortran subroutines that
analyze and solve linear equations and linear least-squares problems. The package solves
linear systems whose matrices are general, banded, symmetric indefinite, symmetric positive
definite, triangular, and tridiagonal square. In addition, the package computes the QR and
singular value decompositions of rectangular matrices and applies them to least-squares
problems. LINPACK uses column-oriented algorithms to increase efficiency by preserving
locality of reference.
LMBENCH: the LMBENCH benchmark is a suite of simple, portable, ANSI/C
micro-benchmarks for UNIX/POSIX. In general, it measures two key features: latency and
bandwidth. LMBENCH is intended to give system developers insight into basic costs of key
operations.
NAS: the NAS Parallel Benchmarks (NPB) are a small set of programs
designed to help evaluate the performance of parallel supercomputers. The benchmarks,
which are derived from computational fluid dynamics (CFD) applications, consist of five
kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits
performance results for each from all sources.
SPEC: the Standard Performance Evaluation Corporation is a non-profit
corporation formed to establish, maintain and endorse a standardized set of relevant
benchmarks that can be applied to the newest generation of high-performance computers.
SPEC develops benchmark suites and also reviews and publishes submitted results from SPEC
member organizations and other benchmark licensees.
SPLASH: Stanford's SPLASH benchmark evaluates performance of shared
memory multiprocessors.
STREAM: the STREAM benchmark is the de facto industry standard benchmark
for the measurement of computer memory bandwidth. The STREAM benchmark measures "real world"
bandwidth sustainable from ordinary user programs -- not the theoretical "peak bandwidth"
provided by most vendors.
TPC: the Transaction Processing Council defines transaction processing and
database benchmarks and delivers trusted results to the industry.
|