AIX, Storage, System Admin↑ Identifying a Disk Bottleneck Using filemon
This blog will display the steps required to identify an IO problem in the storage area network and/or disk arrays on AIX.
Note: Do not execute filemon with AIX 6.1 Technology Level 6 Service Pack 1 if WebSphere MQ is running. WebSphere MQ will abnormally terminate with this AIX release.
Running filemon: As a rule of thumb, a write to a cached fiber attached disk array should average less than 2.5 ms and a read from a cached fiber attached disk array should average less than 15 ms. To confirm the responsiveness of the storage area network and disk array, filemon can be utilized. The following example will collect statistics for a 90 second interval.
# filemon -PT 268435184 -O pv,detailed -o /tmp/filemon.rpt;sleep 90;trcstop Run trcstop command to signal end of trace. Tue Sep 15 13:42:12 2015 System: AIX 6.1 Node: hostname Machine: 0000868CF300 [filemon: Reporting started] # [filemon: Reporting completed] [filemon: 90.027 secs in measured interval]
Then, review the generated report (/tmp/filemon.rpt).
# more /tmp/filemon.rpt . . . ------------------------------------------------------------------------ Detailed Physical Volume Stats (512 byte blocks) ------------------------------------------------------------------------ VOLUME: /dev/hdisk11 description: XP MPIO Disk P9500 (Fibre) reads: 437296 (0 errs) read sizes (blks): avg 8.0 min 8 max 8 sdev 0.0 read times (msec): avg 11.111 min 0.122 max 75.429 sdev 0.347 read sequences: 1 read seq. lengths: avg 3498368.0 min 3498368 max 3498368 sdev 0.0 seeks: 1 (0.0%) seek dist (blks): init 3067240 seek dist (%tot blks):init 4.87525 time to next req(msec): avg 0.206 min 0.018 max 461.074 sdev 1.736 throughput: 19429.5 KB/sec utilization: 0.77 VOLUME: /dev/hdisk12 description: XP MPIO Disk P9500 (Fibre) writes: 434036 (0 errs) write sizes (blks): avg 8.1 min 8 max 56 sdev 1.4 write times (msec): avg 2.222 min 0.159 max 79.639 sdev 0.915 write sequences: 1 write seq. lengths: avg 3498344.0 min 3498344 max 3498344 sdev 0.0 seeks: 1 (0.0%) seek dist (blks): init 3067216 seek dist (%tot blks):init 4.87521 time to next req(msec): avg 0.206 min 0.005 max 536.330 sdev 1.875 throughput: 19429.3 KB/sec utilization: 0.72 . . .
In the above report, hdisk11 was the busiest disk on the system during the 90 second sample. The reads from hdisk11 averaged 11.111 ms. Since this is less than 15 ms, the storage area network and disk array were performing within scope for reads.
Also, hdisk12 was the second busiest disk on the system during the 90 second sample. The writes to hdisk12 averaged 2.222 ms. Since this is less than 2.5 ms, the storage area network and disk array were performing within scope for writes.
Other methods to measure similar information:
You can use the topas command using the -D option to get an overview of the most busiest disks on the system:
# topas -D
In the output, columns ART and AWT provide similar information. ART stands for the average time to receive a response from the hosting server for the read request sent. And AWT stands for the average time to receive a response from the hosting server for the write request sent.
You can also use the iostat command, using the -D (for drive utilization) and -l (for long listing mode) options:
# iostat -Dl 60
This will provide an overview over a 60 second period of your disks. The "avg serv" column under the read and write sections will provide you average service times for reads and writes for each disk.
An occasional peak value recorded on a system, doesn't immediately mean there is a disk bottleneck on the system. It requires longer periods of monitoring to determine if a certain disk is indeed a bottleneck for your system.