IEEE Technical Committee on Scalable Computing

- Technical Area -

Performance Evaluation: Benchmarking

 

Introduction

System performance may be viewed quite differently by a computational physicist and an Internet Service Provider. And while one ISP views performance in terms of transactions per second, another may focus on system availability. Each user must decide which of the many performance metrics are most important to their operation and choose a benchmarking suite that measures them. Benchmark results can uncover bottlenecks in your system and help quantify its performance. Once a baseline performance measurement is established, subsequent benchmarking can be used to validate a system tuning effort.

This Technical Area page includes links to a variety of benchmarking tools. Some of them give very detailed results about a specific computing subsystem and some are designed to present an overall picture of your system's performance. While we can neither endorse nor recommend any of them, each of them may contain tutorial information that can help you better understand what a benchmark can do for you, and what it cannot.

Benchmarking Links

HPL: the HPL benchmark is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.

LINPAK: the LINPACK benchmark is a collection of Fortran subroutines that analyze and solve linear equations and linear least-squares problems. The package solves linear systems whose matrices are general, banded, symmetric indefinite, symmetric positive definite, triangular, and tridiagonal square. In addition, the package computes the QR and singular value decompositions of rectangular matrices and applies them to least-squares problems. LINPACK uses column-oriented algorithms to increase efficiency by preserving locality of reference.

LMBENCH: the LMBENCH benchmark is a suite of simple, portable, ANSI/C microbenchmarks for UNIX/POSIX. In general, it measures two key features: latency and bandwidth. LMBENCH is intended to give system developers insight into basic costs of key operations.

NAS: the NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources.

SPEC: the Standard Performance Evaluation Corporation is a non-profit corporation formed to establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers. SPEC develops benchmark suites and also reviews and publishes submitted results from SPEC member organizations and other benchmark licensees.

SPLASH: Stanford's SPLASH benchmark evaluates performance of shared memory multiprocessors.

STREAM: the STREAM benchmark is the de facto industry standard benchmark for the measurement of computer memory bandwidth. The STREAM benchmark measures "real world" bandwidth sustainable from ordinary user programs -- not the theoretical "peak bandwidth" provided by most vendors.

TPC: the Transaction Processing Council defines transaction processing and database benchmarks and delivers trusted results to the industry.

Performance Metrics

Cost-Effectiveness: An application may have an execution cost requirement that is just as important as execution time. As the number of nodes in use by the application increases, the execution time goes down (good), but the cost goes up (bad) due to the increased inter-process communication overhead. It may be important to determine an optimal number of nodes that minimizes both the execution time and the operating cost.

Cost/Performance Ratio: One modern approach to low-cost computing is to interconnect a whole lot of PCs over a very fast external data network, and either customize the operating system or run specialized middleware that turns them into a multicomputer, or cluster. A parallel application that was intended to run on a shared memory multiprocessor will run on a cluster. The cost/performance ratio tells you how much would it cost to buy a multiprocessor that would give the SAME performance as the cluster, compared to the cost of the cluster. Ratios of 10:1 were common in the early days of cluster computing, meaning the cluster ran an application as well as a commercial processor costing ten times as much. Today, the ratios have dropped to about 3:1, but a project with more time than money has options that it did not have twenty years ago.

Execution Times: Real-time applications may have an execution time limit. Calculating the location of a landing zone for an incoming missile, for example, should complete before it arrives.

Processor Speed: Some applications have to process a task with a complexity that varies dramatically over time. Predicting the weather, for example, has a complexity that depends upon current local conditions. In these cases, the system may have a speed requirement in gflops (billions of floating point operations) per second, for example, rather than a specific execution time limit.

Speedup: this standard measure of performance improvement calculates the ratio of the completion times on some reference system to the system under test. Completion time is often that of a collection of batch jobs or tasks, measured from the time the first starts until the last one completes.

Throughput: Many applications process a stream of tasks whose arrival rate is more critical than the complexity of each task; the time it takes to process one of them is not as important as how many can be processed over a finite period of time. Be sure to include more tasks in your test stream than you have nodes in your cluster to process them to ensure a backlog at some point in the processing time period.

Utilization/Loading: Parallel applications may distribute processes across a network of computer nodes. A balanced system is one in which the loading at each node, or processor utilization, is about equal. Fault tolerance aside, one would like to have the workload distributed evenly across each node in the distributed system's network.


Note: If you would like a related item added to any of the above lists, please send the details to the Technical Area Coordinator:

Contact

Dr. Alex Vrenios
DSRLab, LLC
Distributed Systems
Research Laboratory
Phoenix, Arizona, USA
Voice: +1 602 377-7720
Email: alex@DSRLab.com