Logo

Performance


Copyright (c) 2003 - 2008, DSRLab, LLC

All Rights Reserved

"Open source software in support of distributed systems research."


Cluster Performance:

Far too often one gets wrapped up in the sales hype, believing that the more processors they have, the faster they are, the more physical memory they have, and the faster the interconnection speed, the quicker their particular application will run. To some extent, that's true, but a clear headed approach to performance improvement, however, can net you the same benefit at a potentially much lower cost. And, as we've said elsewhere, you really must consider the costs and benefits together.

PHILOSOPHY:

Automobile enthusiasts may fondly recall the muscle car era of the 1960s. We marveled at the raw power of the 327s, and we wondered where it might end when the 348s and 409s hit the streets. Competition from the 413s, 421s and 427s turned the sights and sounds at our local drag strip into unforgettable memories.

Buying a street-legal racecar off the showroom floor wasn't within everyone's reach, however, and a closer look at the winners revealed that they went to great lengths to squeeze every drop of performance out of their investments. More than a few of the sponsored, trailered, and spectacularly painted behemoths lost; we sincerely hope that history will not have to repeat itself in the cluster computing arena.

Let's begin with an understanding of what good performance means to you: performance can be viewed quite differently by a scientist who simulates the Earth's atmosphere and by the webmaster of an Internet search engine. One might view website performance in terms of transactions per second, while another may focus on system availability. And we all hope that the weather simulation gets done before the picnic!

Each of us must decide which of the many performance metrics are the most important and choose a benchmarking suite that measures them. Benchmark results can uncover bottlenecks in your cluster, and help you quantify and (hopefully) improve its performance.

The most important question to ask about your cluster is,

"How good is it?"

That is, you must find a way to quantify its performance: get a baseline measurement for each set of metrics that are important to your personal goals.

An important ongoing question to ask about your cluster is,

"How much better is it now?"

That is, you must also be diligent in keeping records of what you do to your cluster's hardware and software, and what effect those actions have on its performance.

A list of popular metrics (measurable properties of your system), and benchmarks (utility programs that measure certain of these properties) follows.

METRICS:

Cost-Effectiveness: An application may have an execution cost requirement that is just as important as execution time. As the number of nodes in use by the application increases, the execution time goes down (good), but the cost goes up (bad) due to the increased inter-process communication overhead. It may be important to determine an optimal number of nodes that minimizes both the execution time and the operating cost.

Cost/Performance Ratio: One modern approach to low-cost computing is to interconnect a whole lot of PCs over a very fast external data network, and either customize the operating system or run specialized middleware that turns them into a multicomputer, or cluster. A parallel application that was intended to run on a shared memory multiprocessor will run on a cluster. The cost/performance ratio tells you how much would it cost to buy a multiprocessor that would give the SAME performance as the cluster, compared to the cost of the cluster. Ratios of 10:1 were common in the early days of cluster computing, meaning the cluster ran an application as well as a commercial processor costing ten times as much. Today, the ratios have dropped to about 3:1, but a project with more time than money has options that it did not have twenty years ago.

Execution Times: Real-time applications may have an execution time limit. Calculating the location of a landing zone for an incoming missile, for example, should complete before it arrives.

Processor Speed: Some applications have to process a task with a complexity that varies dramatically over time. Predicting the weather, for example, has a complexity that depends upon current local conditions. In these cases, the system may have a speed requirement in gflops (billions of floating point operations) per second, for example, rather than a specific execution time limit.

Speedup: this standard measure of performance improvement calculates the ratio of the completion times on some reference system to the system under test. Completion time is often that of a collection of batch jobs or tasks, measured from the time the first starts until the last one completes.

Throughput: Many applications process a stream of tasks whose arrival rate is more critical than the complexity of each task; the time it takes to process one of them is not as important as how many can be processed over a finite period of time. Be sure to include more tasks in your test stream than you have nodes in your cluster to process them to ensure a backlog at some point in the processing time period.

Utilization/Loading: Parallel applications may distribute processes across a network of computer nodes. A balanced system is one in which the loading at each node, or processor utilization, is about equal. Fault tolerance aside, one would like to have the workload distributed evenly across each node in the distributed system's network.

BENCHMARKS:

HPL: the HPL benchmark is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.

LINPAK: the LINPACK benchmark is a collection of Fortran subroutines that analyze and solve linear equations and linear least-squares problems. The package solves linear systems whose matrices are general, banded, symmetric indefinite, symmetric positive definite, triangular, and tridiagonal square. In addition, the package computes the QR and singular value decompositions of rectangular matrices and applies them to least-squares problems. LINPACK uses column-oriented algorithms to increase efficiency by preserving locality of reference.

LMBENCH: the LMBENCH benchmark is a suite of simple, portable, ANSI/C micro-benchmarks for UNIX/POSIX. In general, it measures two key features: latency and bandwidth. LMBENCH is intended to give system developers insight into basic costs of key operations.

NAS: the NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources.

SPEC: the Standard Performance Evaluation Corporation is a non-profit corporation formed to establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers. SPEC develops benchmark suites and also reviews and publishes submitted results from SPEC member organizations and other benchmark licensees.

SPLASH: Stanford's SPLASH benchmark evaluates performance of shared memory multiprocessors.

STREAM: the STREAM benchmark is the de facto industry standard benchmark for the measurement of computer memory bandwidth. The STREAM benchmark measures "real world" bandwidth sustainable from ordinary user programs -- not the theoretical "peak bandwidth" provided by most vendors.

TPC: the Transaction Processing Council defines transaction processing and database benchmarks and delivers trusted results to the industry.