Understanding Disk Performance Metrics

Article ID: 21202
Find potential performance problems and improve disk performance

Click here to download the code bundle.
To report code errors, email SystemiNetwork.com

In "Using Wait State Accounting to Determine Disk Performance" (November 2006, article ID 20700 at SystemiNetwork.com), I explained how you can query i5/OS performance data files to decide whether access to data stored on disk significantly affects the performance of your application. "Understanding Disk Performance, Part 2: Disk Operation on i5/OS" (April 2007, article ID 20870) described how i5/OS performs disk I/O operations in different disk attachment scenarios.

Here, we examine i5/OS performance metrics that provide a detailed view of disk performance. Analysis of these metrics can shed light on possible performance problems and suggest potential actions to improve disk performance.

Disk Performance Data

A simple query over job wait state accounting file QAPMJOBWT (see "Using Wait State Accounting to Determine Disk Performance" for more information) shows you how much time a job has spent waiting for disk I/O to complete. If this time makes up a significant part of a batch application's runtime or online transaction response time, it would be nice to be able to do something about it. However, job wait state accounting data does not provide details about the contribution of specific disk units, the reason for poor disk response time, the components of disk response time, and so forth.

For this kind of data, we will be looking at another i5/OS performance database file — QAPMDISK. The IBM Information Center documents the data found in the QAPMDISK performance database file (publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp?topic=/rzahx/rzahxqapmdisk.htm).

QAPMDISK contains raw performance counters that are applicable to disk performance. This file has a record for each disk resource for every performance collection interval. A disk resource is one unique path between the operating system and disk unit. Internal disks have a single path to each unit, whereas external disk units may have more than one path.

Performance data collected by the collection services (including the QAPMDISK file) represents interval summaries, so we can talk only about averages, not the execution details of individual disk operations. That would be a task for performance trace tools, which is beyond the scope of this article.

Some tools massage the raw data from the QAPMDISK file and show higher-level metrics to a user. For example, IBM Performance Tools includes disk performance data in many of the reports it produces. However, for this discussion, I concentrate on how you can calculate higher-level disk performance metrics directly from the raw data in the QAPMDISK file. You can compare the discussion in this article with the reports produced by the Performance Tools licensed program. IBM Performance Tools is documented in the IBM Information Center(publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp?topic=/rzahx/rzahxperftoolsdesc.htm).

Metrics of Disk Performance

In the following sections, I describe various metrics of disk performance, including how to calculate them based on the performance data from the QAPMDISK file. Most of the discussion in these sections is conceptual. To keep things simple, the formulas I use do not take into account some finer details of data organization in the QAPMDISK file. The SQL query (available for download at SystemiNetwork.com/code) returns all the metrics discussed in this article and considers all of the complexity. The discussion in this article is not meant to be exhaustive — you don't expect to find all the secrets of the trade in one place, do you?

A few general considerations apply to most metrics. The metrics discussed here are all averages — the average for a specific interval for a single disk unit, for all disk units in the interval, for all disk units in a disk pool over the time range of several hours, and so forth.

Statistical averages are important tools, but remember that averages can be deceiving. As the aphorism goes, "There are lies, damned lies, and statistics." Well, statistics do not "lie," but given sufficient data, "good" results can easily mask "bad" results. For example, on a big system with several hundred disk units, a couple of disk units with bad response time will not noticeably change overall average response time, which makes it easy to overlook a problem.

i5/OS stripes system objects across all available disk units. This is a good practice that generally improves disk performance (many SAN systems use this now), but there is a price to pay. With striping, it is hard to control the colocation of different system objects on disk. If a disk unit performs poorly, it has a high probability of affecting many applications. Even a single poorly performing disk unit can slow the entire system, but systemwide averages won't show you that.

Averages are fine when used carefully, but keep an eye on individual disk units and individual time intervals, especially if you have reason to suspect a problem.

Disk Response Time

Disk response time is the starting point for any disk performance analysis. It shows how long it takes to complete disk I/O operations from start to finish. Until V5R4, individual disk operations were not timed by system, and response time was calculated based on Little's Law, which states

Average number of requests on the server =

arrival rate * average response time

"Server" is used here in a sense of a queuing theory — this is a disk subsystem with IOP, a storage adapter, and disk devices attached to it. The "number of requests" includes disk operations in progress, plus operations that are waiting in a queue. All ingredients for this formula are available in the QAPMDISK file. Average disk response time in milliseconds is

(DSQUEL * INTSEC * 1000) / (DSSMPL * (DSWRTS + DSRDS))

Notice that the number of requests is obtained by sampling: DSSMPL is the number of samples made during the interval. To be accurate, the sampling technique requires a sufficient number of samples and/or a sufficient number of events sampled. The sampling rate is not particularly frequent (twice per second), the practical implication of which is that the response time calculated according to this formula is valid only if there are a significant number of operations issued to the disk unit. For a relatively idle unit, the formula can produce widely erratic results. This is one reason why disk capacity planning should never be based on runs with insignificant disk load.

Starting with V5R4, the operating system directly times disk I/O operations. This new data is now part of the QAPMDISK file. With these new counters, we can calculate disk response time much more accurately using this formula:

(DSSRVT + DSWT)/(DSWRTS + DSRDS)

This function was added in a set of PTFs in V5R4. The PTFs are included in recent cumulative packages, so if you are current on PTFs, you probably already have this new function installed.

The new fields more accurately measure disk response time, but there is a difference. The "old" data was measured by the IOP code. The "new" data, however, is measured before operations are sent to IOP for processing, so it includes the delays between the IOP and the operating system. In general, this leads to more useful data, because the "new" response time is closer to the actual delays that applications experience. But there's a catch.

Processing of disk operations on IOP is not affected by what's going on inside the partition or by the partitioning process itself. But processing of disk operations by a system is. The "new" response time now includes delays caused by switching partitions on and off the physical processors and by CPU contention inside the partition. The latter type is not a problem for most systems, except those with extreme CPU contention. But the hypervisor delays can easily add a few milliseconds to a response time, especially for micro-partitions (partitions with very small processing capacity).

This consideration also applies to IOP-less disk units. Such units do not have real IOP, so even if response time is calculated by the old formula, the result still represents the system view of disk operation delays.

The popular question is, "What is the recommended guideline for good disk response time?" There is no easy answer to that. Too many factors affect disk response time: disk technology, cache-friendliness of a disk workload, mixture of reads and writes, and tolerance of the particular application to the disk delays. Generally, with the current disk technology, I would say that an average response time of less than 5 milliseconds is good, between 5 and 10 milliseconds is normal, above 10 milliseconds requires analysis, and above 100 milliseconds is poor.

Watch for disk response time for individual disk units in specific collection intervals. Poor response times that are significantly different from averages can indicate a local problem with a storage adapter (e.g., failed cache battery) or a hot spot — a disk location that, for some reason, gets more than its fair share of accesses. One reason for the hot spot could be system configuration, an unbalanced distribution of data across disk units in the auxiliary storage pool (ASP). It makes sense to try disk-balancing tools, which can do a good job of removing a hot spot. For more information, see the Start ASP Balance (STRASPBAL) command documentation at publib.boulder.ibm.com/iseries/v5r2/ic2924/index.htm?info/cl/straspba.htm.

Hot spots can also be caused by application logic that produces an unusually high number of accesses to some common piece of data. This kind of hot spot is much harder to investigate and fix. There are tools that trace individual disk operations to determine which job issued an operation and which system objects were accessed. Two such tools are Performance Explorer, or PEX (publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp?topic=/rzahx/rzahxpexparent.htm), and Disk Watcher (www-1.ibm.com/support/docview.wss?uid=nas36dd24a9d824a23da862572b2000c2999). Disk Watcher is a new function that was recently added in a set of PTFs.

Disk Response Time Buckets

Good response times masking poor response times in averages is such a pervasive problem that disk response time buckets were added to V5R4. As the system measures response times for disk operations, it classifies them into six ranges (buckets): operations taking less than 1 millisecond, 1-16 milliseconds, 16-64 milliseconds, 64-256 milliseconds, 256-1,024 milliseconds, and more than 1,024 milliseconds.

This data is reported in new fields in the QAPMDISK file. Each bucket is represented by three fields: the number of operations that fall within the bucket, the total response time of these operations, and the total service time of these operations (look for fields DSBKCTxx, DSBKRTxx and DSBKSTxx). Response and service times reported in the buckets are measured at the system level and are consistent with the DSSRVT and DSWT fields.

Disk Percent Busy

Disk percent busy is probably the most used measure of the health of a disk subsystem. This metric shows a percentage of the time that the disk was busy processing disk operations. This metric is also often incorrectly called disk utilization. Disk percent busy and disk utilization were identical when disk units could process only one operation at a time. Now, the execution of two or more disk operations can overlap. Technically, if a disk is 100 percent busy, it might still be able to perform additional disk operations, so we can't really say that disk utilization is 100 percent in this case.

However, this metric is a convenient way to tell at a glance whether there is a potential disk performance problem. If the disk has too many operations sent to it by the system, or if disk operations are taking too long to complete, increased disk "busy-ness" will be reflected in the percent busy metric.

The disk percent busy is calculated as

100 * (DSSMPL - DSNBSY) / DSSMPL

This data is also obtained through sampling (with all the implications already discussed). Disk units with light loads might demonstrate wide variations in "busy-ness" calculated by this formula. This is the same metric as the one on the Work with Disk Status (WRKDSKSTS) screen.

Which value of this metric is considered normal? All performance guidelines are to a certain extent arbitrary. Years ago, the rule was to keep disk percent busy below 40 percent. These days, however, benchmarks show that with the faster disk hardware and new storage adapters that can process more than one operation in parallel, you can achieve acceptable response times at up to 70 percent. IBM Workload Estimator uses the significantly more conservative guideline of 25 percent. This number is heavily biased toward online transaction processing environments, which are more sensitive to variations in disk response time. It also makes room for a natural workload growth. Batch-oriented workloads might be able to tolerate much higher disk utilizations than that.

Service Time and Wait Time

Service time and wait time are components of the disk response time. When response time goes up, it is interesting to know whether time is spent actually performing disk operation or waiting in a queue.

The balance between service and wait times is closely related to disk utilization — as utilization increases, the wait component of the response time grows. Quite often, results of a queuing theory are applied to disk performance (e.g., a queuing multiplier that relates service time and response time). A word of caution, though: The most commonly used model (known as M/M/1 in a queuing theory) is based on certain assumptions about the behavior of the disk unit. Years ago, you could reasonably make these assumptions about disk performance. However, with technology changes this is no longer the case. It is theoretically possible to build a more realistic mathematical model for the behavior of the disk unit, but such a model would be too complex for practical use.

The usual way to estimate average service time is to divide disk busy time by the number of operations executed, as the following formula shows (the result is in milliseconds):

((DSSMPL - DSNBSY) * INTSEC * 1000) / (DSSMPL * (DSWRTS + DSRDS))

The wait time is the difference between disk response time and disk service time.

There are two problems with this formula. The first, as I said previously, is that disk busy time is determined through relatively infrequent sampling, which yields unreliable results at low disk loads. The second problem is that this formula does not account for the overlap between disk operations. So the formula will underestimate service time and therefore overestimate wait time. You should keep that in mind when looking at the results that various performance tools show.

To overcome these deficiencies, i5/OS (in V5R4) directly measures the components of each disk operation. Two new fields in the QAPMDISK file report the total service time and the total wait time of all disk operations. The formula for average service time is now

DSSRVT / (DSWRTS + DSRDS)

and for average wait time is

DSWT / (DSWRTS + DSRDS)

However, these new counters are measured at the system level, not an IOP level. Service time measured on a system level will now include not only actual operation execution time, but also time spent waiting in a queue at the storage adapter level. Accordingly, wait time measured on a system level will represent waiting in a system queue only — it will not show how much time was spent waiting in a storage adapter queue. This part, which used to be included in the wait time, will now be included in the service time. "New" formulas will therefore overestimate service time. The real service time experienced by the storage subsystem is somewhere in between.

The wait time component of disk response time will naturally grow as disk utilization grows, according to the results of a queuing theory. However, it can also be an indication of the "bursty" nature of a workload. A burst of disk operations arriving very close together will cause excessive operation queuing, even if disk utilization is low overall. To improve disk performance for bursty workloads, you can try to distribute data over as many disk units as possible. This is easier to achieve for external disk storage, where you can reconfigure the same physical devices to report to the system as a larger number of smaller logical units.

Read Cache Hit Ratio

Read cache hit ratio shows, usually as a percentage, how many read operations were satisfied from cache. Going to the data saved in cache is much faster than going to the data on a disk surface, so obviously we want this number to be as high as possible.

Two sets of cache statistics are reported in QAPMDISK: controller cache counters and device cache counters. For internal disk units, both caches are reported. However, device cache, which is located on a physical disk drive itself, is relatively small and does not have much effect on performance. This is why in practical cases we usually consider only controller cache.

For internal disk units, read cache hit ratio is

100 * DSCCRH / DSRDS

By the way, if data shows a reasonable number of read operations (DSRDS), but cache hit counter (DSCCRH) is consistently zero, you can tell at a glance whether the storage adapter has any cache.

For external disk units, the storage adapter (architecturally a controller) is just a communications adapter that has no cache. The cache that resides in a storage subsystem is reported as a device cache. Therefore, for the external disk units, read cache hit ratio is

100 * DSDCRH / DSDROP

Obviously, no cache or a tiny cache is not good for performance. But with a reasonable cache size and a low read cache hit ratio, it is hard to predict whether adding more cache will improve performance. This requires complex analysis of the workload and the data access pattern of an application. It might be easier to just add more cache to see whether it improves performance.

In general, the read cache hit ratio tells more about the characteristics of a workload than about the functioning of disk units. It is not unusual to see a very small read cache hit ratio, even with a very large cache if the application processes a lot of data and rarely accesses the same data more than once. For a detailed discussion of cache behavior, see "Understanding Disk Performance, Part 2: Disk Operation on i5/OS."

Write Cache Overrun Ratio

Write cache overrun ratio shows, usually as a percentage, how many write operations had to wait for the destage operations to free up buffer space in the storage adapter memory. Writing data to the buffer space is many times faster than accessing disk surface, so we want this number to be as small as possible, ideally zero.

For internal disk units, the write cache overrun ratio is

100 - DSCCFW * 100 / DSWRTS

DSCCFW shows how many operations were "fast writes" — in other words, how many operations did not have to wait for a buffer space. If the DSCCFW counter is consistently zero, there are two possibilities. One is that this particular storage adapter has no cache. Another common situation is when the adapter does have cache, but its write cache is inhibited because the adapter cannot guarantee that its content is preserved in case of a system failure. Usually this happens when the cache battery is dead.

For the external disk units, write cache overrun ratio is

100 - DSDCFW * 100 / DSDWOP

Write cache is crucial to the performance of write-intensive applications, especially when database journaling is involved. In fact, with RAID configurations, write cache is a must. It is hardly possible to have any decent disk performance with RAID and without write cache.

Total Disk Device Operation Rate

Eventually, you must access the disk surface to read or write actual data. Every disk device has limits to its throughput. When this limit is approached, the deterioration of disk performance can be sudden and dramatic. This is especially true in RAID 5 (and even more so in RAID 6) configurations, where disk devices have to carry the additional burden of parity overhead operations. This is why it is important to keep an eye on the total disk operation rate, the formula for which is

(DSDWOP + DSDROP) / INTSEC

This metric is applicable only to internal disk units. External disk units are logical units that are mapped to physical disk devices by an additional layer of software inside the external storage server. To calculate the actual load on the physical disk devices in an external storage server, you must use the built-in tools provided with the storage server.

What is the acceptable disk operation rate for a physical disk? It depends on several factors — disk technology, of course, but also the pattern of access to the disk surface. In particular, long seek distances will make the disk unit work that much harder. (Hint: Seek statistics are also available in the QAPMDISK file. Look for fields DSSKn.)

In general, with the current disk technology, the maximum operation rate is in the range of 120-150 accesses per second. This is a maximum; you do not want to be anywhere near that for an extended period.

In RAID 5 or RAID 6 configurations, the disk device has to perform parity operations for other disk devices in the same RAID set. This is usually not noticeable because these operations are balanced out as different disk units are accessed. However, if disk units from the same RAID set are allocated to different ASPs, you might be surprised to see that some disk units in the otherwise idle ASP report activity — this is work they have to do for disk units from another ASP.

Write Cache Efficiency

As data is sitting in the write cache waiting to be destaged to the disk surface, cache management logic tries to apply all kinds of optimizations to reduce the actual number of disk accesses. For example, a series of writes to the adjacent sectors can be coalesced into a single multisector operation. Multiple writes to the same sector can result in multiple updates to the data in cache before data is ever written out.

The purpose of the write cache efficiency ratio metric is to show how successful the cache software was at this. For example, if write cache efficiency is 15 percent, it means that the cache management software reduced the number of write operations by 15 percent. This metric is defined only for internal disk units and is calculated as

100 - ((DSDWOP/DIVISOR) * 100) / DSWRTS

DIVISOR depends on the disk configuration, which is 1 for non-RAID configurations, 2 for RAID 5, and 3 for RAID 6. It reflects the additional disk writes that result from the RAID overhead.

In general, this metric is more important to the storage adapter cache management software designers, but it can also give an insight into workload behavior. You have to be very careful with this metric. If the RAID configuration is in the exposed mode because of a single disk failure, the number of physical disk writes will no longer correspond to the assumptions, and this metric will no longer be valid.

Operation Rate for Read/Write Operations

Watch for the rate of system disk operations. Some of the disk performance metrics are obtained through relatively slow sampling. When the operation rate is low, some calculated metrics (e.g., disk busy percentage) may become unstable. Note that here we are talking about disk operations sent by the system, which is different from disk device operations discussed before.

Read and write operations have different behavior in terms of performance, so in addition to the total operation rate, it is important to look at read and write operations separately. A high rate of write operations is often indicative of the database journaling environment. Healthy i5/OS database performance — especially in an OLTP environment — is very sensitive to the write operations' response time staying in the sub-millisecond range. This is usually not a problem with an internal disk, provided the write cache overrun ratio is low enough. However, this might be harder to achieve with external storage, as higher communication latency is typical for this kind of attachment. Read operations are inherently slower because they have to occasionally go to the disk surface — unless the same data is being read over and over again and ends up staying in the cache all the time.

Averaging numbers across both types of operations lumps together data with very different behavior. This is fine for a high-level health check, but for detailed analysis, it helps to look at reads and writes separately.

The operation rate per second is

DSRDS / INTSEC for read operations

DSWRTS / INTSEC for write operations

Average Operation Size

It is more efficient to issue fewer disk operations that transfer more data than a large number of very small operations. This is why it is important to keep an eye on average operation size (i.e., how much data is transferred by an average operation). Operation size in kilobytes is

( DSBLKR * 520 ) / DSRDS / 1024 for reads

( DSBLKW * 520 ) / DSWRTS / 1024 for writes

These metrics are important for understanding the application behavior. There are ways to improve these ratios by making sure that applications use blocking when appropriate, by changing logical page sizes of access paths used by application, and so forth.

When Expert Cache is turned on for one or more main storage pools, you may notice that the average size of disk operations increases. This is one technique that the Expert Cache uses to improve performance.

Monitor Performance Regularly

This article is just a small excursion into the world of disk performance. More often than not, disk performance is a key to the overall performance of a computer system. Regular monitoring of vital disk performance metrics and taking timely measures based on their analysis will help you successfully manage the performance of your System i server.

Alexei Pytel is a software engineer with more than 15 years of experience on the AS/400/iSeries/System i platform. Formerly an employee of IBM Russia, Alexei now works on performance analysis and performance tools development in Rochester, Minnesota.

ProVIP Sponsors

ProVIP Sponsors