How to Analyze or Read OS Watcher Output in three easy steps -- With Example ?
Article Number: 272 | Rating: Unrated | Last Updated: Thu, Jun 21, 2018 3:22 PM
How to Analyze or Read OS Watcher Output in three easy steps -- With Example ?
I assume, you have got Oracle OS Watcher (OSWatcher) logs and Database Administrator is facing slow database performance or System hang kind of situations. So, Next step is to Analyze or Reading Oracle OS Watcher (OSWatcher) logs collected.
As we now, OS Watcher logs are collection of one main folder and it's sub folders with OS Command name suffix. OS Watcher Analysis is basically reading output taken of few Unix command at a fixed time interval. Here, we will only Read OS Watcher few commands output which are most useful for Database Administrator.
1. Read or Analyze oswiostat Output: These files will contain output from the 'iostat' command that is obtained and archive by OS Watcher on hourly basis (default). These files will only exist if 'iostat' is installed on the OS and if the osw user has privileges to run the utility.
The iostat command is used for monitoring system input/output device loading by observing the time the physical disks and their average data transfer rate. Below are two example of OSWiostat.
Time: 00:01:09 avg-cpu: %user %nice %system %iowait %steal %idle 5.22 0.01 1.77 0.10 0.00 92.90 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 10.24 101.92 10.28 29.60 569.53 1057.09 40.79 0.21 5.30 0.53 2.11 sda1 0.00 0.00 0.00 0.00 0.17 0.00 138.66 0.00 12.37 3.45 0.00 sda2 10.24 101.92 10.28 29.57 569.36 1057.09 40.81 0.21 5.30 0.53 2.11 dm-0 0.00 0.00 1.72 77.98 75.95 623.85 8.78 1.20 14.99 0.08 0.67 dm-1 0.00 0.00 0.46 2.37 3.80 18.94 8.04 0.01 2.71 0.29 0.08 dm-2 0.00 0.00 7.44 50.74 278.30 410.79 11.84 0.72 12.30 0.23 1.33 dm-3 0.00 0.00 0.00 0.00 0.15 0.00 509.61 0.00 46.78 7.53 0.00 dm-4 0.00 0.00 0.49 0.00 117.41 0.02 238.95 0.00 1.94 1.05 0.05 dm-5 0.00 0.00 0.05 0.00 10.84 0.00 230.78 0.00 2.58 1.34 0.01 dm-6 0.00 0.00 0.00 0.00 0.10 0.00 479.96 0.00 54.94 8.70 0.00Important Field's Description
device: Device name.
r/s: Shows the number of reads/second.
w/s: Shows the number of writes/second.
rsec/s: Shows the number of kilobytes read/second.
wsec/s: Shows the number of kilobytes written/second.
avgrq-sz: Average number of transactions waiting for service.
avgqu-sz : Average number of transactions actively being serviced.
%util: Percent of time the disk is busy
Time: 07:32:57 avg-cpu: %user %nice %system %iowait %steal %idle 8.16 0.00 70.29 21.55 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 163.40 7.73 2074.74 53.95 73642.61 493.47 34.83 107.13 50.07 0.47 100.07 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 163.40 7.73 2074.74 53.95 73642.61 493.47 34.83 107.13 50.07 0.47 100.07 dm-0 0.00 0.00 201.03 0.86 8412.37 6.87 41.70 58.68 281.80 4.96 100.07 dm-1 0.00 0.00 180.76 26.46 1446.05 211.68 8.00 25.24 119.01 4.83 100.07 dm-2 0.00 0.00 1868.90 34.54 63913.40 276.29 33.72 332.23 172.22 0.53 100.09 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00In this stat Remote DBA has to look at read and write happening into the system. Sudden change in read write pattern of system means some major change has done into the system.
Observation:- These are two iostats output shown above from a system having good performance at 12 o'clock (Time: 00:01:09) and dull at 7 O'clock (Time: 07:32:57). We can easily judge, that few disks sda, sda2,dm-0, dm-1 and dm-2 is 100% busy (%util) and has very high read per sec (r/s) in above screen shot. Which clearly show that, system is performing high no of read operations.
This is also evident from "avg-cpu" stats, at good time system is 92% idle and at load time it 0% idle.
To overcome this, DBA has to migrate data from disks sda, sda2, dm-0, dm-1 and dm-2 to dm-3, dm-4, dm-5 and dm-6 which has low read write and this will balance system read writes.
2. Read or Analyze mpstat Output: These files will contain output from the 'mpstat' command that is obtained and archive by OS Watcher on hourly basis (default). These files will only exist if 'mpstat' is installed on the OS and if the osw user has privileges to run the utility.
I prefer to start all OS Watcher related investigations from this file. These stats has a summary of the system for database administrator. The below output is from a Linux system.
zzz ***Tue Apr 23 06:13:44 EDT 2013 Sample interval: 5 seconds Linux 2.6.32-400.21.1.el5uek (remote.database.com) 04/23/13 06:13:44 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 06:13:49 all 6.26 0.00 1.32 0.00 0.01 0.02 0.00 92.39 36448.70 06:13:54 all 8.17 0.00 1.92 0.01 0.00 0.05 0.00 89.86 38918.09 06:13:59 all 8.11 0.00 1.18 0.01 0.00 0.05 0.00 90.65 40989.86 06:14:04 all 8.04 0.00 1.25 0.06 0.00 0.05 0.00 90.61 40242.86 06:14:09 all 6.62 0.00 1.27 0.02 0.00 0.05 0.00 92.04 37460.32 06:14:14 all 7.56 0.00 1.47 0.02 0.00 0.02 0.00 90.94 37288.67 06:14:19 all 7.19 0.00 1.21 0.14 0.00 0.02 0.00 91.44 36947.91 06:14:24 all 6.50 0.00 1.02 0.01 0.00 0.02 0.00 92.45 35792.86 06:14:29 all 7.28 0.00 1.82 0.01 0.00 0.02 0.00 90.87 36795.42 06:14:34 all 7.37 0.02 1.20 0.02 0.00 0.01 0.00 91.37 36818.80 06:14:39 all 7.41 0.00 1.05 0.02 0.00 0.02 0.00 91.49 36874.90 06:14:44 all 7.15 0.01 1.62 0.04 0.00 0.02 0.00 91.16 35904.77 06:14:49 all 7.21 0.00 1.22 0.14 0.00 0.02 0.00 91.41 38867.73 06:14:54 all 7.31 0.00 1.00 0.00 0.00 0.03 0.00 91.65 39378.74Important Field's Description: However, most of the fields has descriptive name.
CUP: "all" represents stats about all available CPU's in the system.
%user: Out of 100% CPU, How much is getting used by USER programs.
%sys: Out of 100% CPU, How much is getting used by system programs.
%iowait: Out of 100% CPU, How much is consumed by IO Waits.
%idle: After allocation CUP to all above resource, How much is still idle.
zzz ***Tue Apr 23 07:23:02 EDT 2013 Sample interval: 5 seconds Linux 2.6.32-400.21.1.el5uek (remote.database.com) 04/23/13 07:24:20 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 07:24:25 all 2.74 0.00 97.16 0.00 0.00 0.10 0.00 0.00 39066.67 07:24:30 all 3.06 0.00 96.87 0.00 0.00 0.07 0.00 0.00 37637.52 07:24:37 all 3.13 0.00 96.79 0.01 0.00 0.07 0.00 0.00 36788.64 07:24:42 all 2.69 0.00 97.17 0.05 0.00 0.09 0.00 0.00 38270.04 07:24:48 all 3.86 0.01 94.92 1.02 0.00 0.20 0.00 0.00 43247.39 07:24:53 all 3.51 0.00 96.19 0.20 0.00 0.11 0.00 0.00 39887.45 07:24:59 all 4.22 0.00 93.51 2.12 0.00 0.15 0.00 0.00 40638.08 07:25:04 all 6.26 0.00 85.04 8.56 0.00 0.13 0.00 0.00 41915.79 07:25:09 all 8.69 0.00 67.31 23.85 0.00 0.11 0.00 0.03 44586.56 07:25:15 all 8.09 0.00 80.62 11.17 0.00 0.12 0.00 0.00 44321.66 07:25:21 all 7.18 0.00 71.95 20.80 0.00 0.07 0.00 0.00 35399.65 07:25:26 all 6.69 0.00 68.20 24.97 0.01 0.12 0.00 0.00 38734.99 07:25:31 all 7.11 0.01 74.71 18.09 0.00 0.08 0.00 0.00 36695.68 07:25:36 all 7.46 0.00 14.17 78.20 0.00 0.05 0.00 0.13 32934.53 07:25:50 all 9.71 0.00 23.99 66.24 0.00 0.05 0.00 0.00 33617.64 07:25:56 all 7.80 0.00 85.97 6.13 0.00 0.10 0.00 0.00 41234.83
I hope difference is clear after seeing both mpstat output good at Tue Apr 23 06:13:44 and heavy load at Tue Apr 23 07:23:02. At good time, %sys CPU utilization was 1% while at time of load it has reached upto 97% and RAC node was rebooted after this time due to lack of Resources. The important fields to look at are %user, %sys, %iowait and %idle.
So, If Remote DBA observe any kind of load increase on system, He has to look at mpstat out put and then further dig out what has added load on the system. The reason could be many like some more database sessions are introduced, a batch job has started, RMAN backup is executing, some OS utility has introduced. We will also look into this in top command analysis detail.
3. Read or Analyze top Output: Top provides output of Top command on hourly basis on Operating system. The process listed in top output are in descending order of CPU utilization. i.e highest CUP utilizing process first and so on.
There are many situations, when Remote DBA figured out that, CPU utilization on the system has suddenly increased but number of process on the system are still same.
A situation:- Yesterday, there was CPU utilization spike despite of no load increase. In these type of situation Top command is fist place to start analysis for a DBA.
zzz ***Tue Apr 23 03:13:44 EDT 2013 Sample interval: 5 seconds. All measurements in KB (1024 bytes) top - 04:13:44 up 22 days, 21:12, 10 users, load average: 65.80, 169.78, 117.65 Tasks: 2297 total, 4 running, 2229 sleeping, 0 stopped, 64 zombie Cpu0 : 12.7%us, 2.6%sy, 0.0%ni, 84.2%id, 0.5%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 6.7%us, 2.0%sy, 0.0%ni, 91.1%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st Cpu2 : 6.4%us, 1.7%sy, 0.0%ni, 91.8%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 5.5%us, 1.3%sy, 0.0%ni, 93.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 7.6%us, 1.6%sy, 0.0%ni, 90.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 5.3%us, 1.1%sy, 0.0%ni, 93.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 11.8%us, 2.7%sy, 0.0%ni, 85.3%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st Cpu7 : 7.0%us, 2.2%sy, 0.0%ni, 90.6%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st Cpu8 : 5.8%us, 1.5%sy, 0.0%ni, 91.8%id, 0.8%wa, 0.0%hi, 0.1%si, 0.0%st Cpu9 : 8.0%us, 1.7%sy, 0.0%ni, 90.0%id, 0.1%wa, 0.0%hi, 0.2%si, 0.0%st Cpu10 : 3.8%us, 1.2%sy, 0.0%ni, 94.9%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 3.6%us, 1.0%sy, 0.0%ni, 95.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 99060540k total, 91969324k used, 7091216k free, 84044k buffers Swap: 25165816k total, 17797404k used, 7368412k free, 609612k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20343 oracle 20 0 13.4g 10g 5864 R 98.4 10.7 18:56.54 oraclevntrd2 (LOCAL=NO) 30180 root 20 0 11872 2312 656 R 98.4 0.0 0:00.68 /bin/netstat -n -p -l 6568 root 39 19 0 0 0 R 89.9 0.0 263:39.04 [kipmi0] 30262 root 20 0 23704 3116 1048 R 11.9 0.0 0:00.15 /usr/bin/top -b -d 5 -n 720 4921 root RT 0 247m 86m 55m S 6.8 0.1 328:08.44 /u01/app/11.2.0.3/grid/bin/osysmond.bin 28116 oracle 20 0 2623m 71m 14m S 6.8 0.1 51:51.62 /u01/app/11.2.0.3/grid/bin/oraagent.bin 4970 grid RT 0 359m 176m 54m S 5.1 0.2 157:05.89 /u01/app/11.2.0.3/grid/bin/ocssd.bin 64 root 20 0 0 0 0 S 1.7 0.0 4:39.22 [ksoftirqd/20] 4903 root 20 0 367m 20m 13m S 1.7 0.0 26:09.97 /u01/app/11.2.0.3/grid/bin/orarootagent.bin 6496 root 20 0 1274m 15m 11m S 1.7 0.0 28:27.53 /u01/app/11.2.0.3/grid/bin/orarootagent.bin 6535 oracle 20 0 1830m 263m 4620 S 1.7 0.3 88:05.31 /u01/app/oracle/product/agent12c/core/12.1.0.2.0/jdk/bin/java -Xmx128M -server -Djava.secu 7803 oracle -2 0 1266m 11m 4068 S 1.7 0.0 9:15.42 ora_lms0_oradb2 7874 oracle -2 0 1266m 15m 4188 S 1.7 0.0 9:16.20 ora_lms0_oradb2 7999 oracle 20 0 1284m 10m 3292 S 1.7 0.0 2:49.08 ora_lmd0_oradb2 8297 oracle 20 0 1230m 3368 2864 S 1.7 0.0 0:39.95 ora_pmon_oradb2 8333 oracle -2 0 1252m 2380 2108 S 1.7 0.0 13:19.99 ora_vktm_bid2 8443 oracle -2 0 1252m 2340 2096 S 1.7 0.0 13:21.86 ora_vktm_oradb2 8535 oracle 20 0 1253m 2712 2412 S 1.7 0.0 0:14.28 ora_dskm_oradb2 8727 oracle -2 0 1266m 11m 3656 S 1.7 0.0 9:01.37 ora_lms0_im1d2 8905 oracle 20 0 1267m 13m 3468 S 1.7 0.0 9:52.75 ora_dia0_pstd2Let's understand what does it means.
"zzz ***Tue Apr 23 03:13:44 EDT 2013 Sample interval: 5 seconds. All measurements in KB (1024 bytes)"
This Shows time when this stats was collected.
"top - 04:13:44 up 22 days, 21:12, 10 users"
Time with how long system has been up and running (22 days).
"load average: 65.80, 169.78, 117.65"
Load average is defined as the average number of processes in the run queue over the last 1, 5, and 15 minutes. A runnable Unix process is one that is available right now to consume CPU resources and is not blocked on I/O or on a system call. The higher the load average, the more work your machine is doing.
In this example, There were 65.80 process read to run one minute ago, 169.78 five minutes ago and 117.65 process read to use CPU fifteen minutes before. While in below screen shot load average is very low (load average: 2.93, 43.22, 75.56) A dramatic rise in the load average can indicate a serious performance problem.
top - 04:20:53 up 22 days, 21:19, 10 users, load average: 2.93, 43.22, 75.56 Tasks: 2247 total, 2 running, 2181 sleeping, 0 stopped, 64 zombie Cpu0 : 6.8%us, 2.7%sy, 0.0%ni, 15.4%id, 75.1%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 2.3%us, 1.0%sy, 0.0%ni, 96.5%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 3.9%us, 0.6%sy, 0.0%ni, 95.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 28.6%us, 0.4%sy, 0.0%ni, 71.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 20.2%us, 0.8%sy, 0.0%ni, 79.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 10.7%us, 6.4%sy, 0.0%ni, 82.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 7.7%us, 1.4%sy, 0.0%ni, 90.7%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 29.8%us, 1.0%sy, 0.0%ni, 69.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 1.6%us, 1.0%sy, 0.0%ni, 39.3%id, 57.9%wa, 0.0%hi, 0.2%si, 0.0%st Cpu9 : 3.1%us, 1.0%sy, 0.0%ni, 91.7%id, 4.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 1.6%us, 0.4%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Cpu11 : 1.0%us, 0.4%sy, 0.0%ni, 98.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 99060540k total, 97924488k used, 1136052k free, 219788k buffers Swap: 25165816k total, 17130968k used, 8034848k free, 741692k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 26721 oracle 20 0 1234m 15m 12m S 28.5 0.0 8:25.77 oracleoracledb2 (LOCAL=NO) 26729 oracle 20 0 1234m 15m 12m S 28.3 0.0 8:17.08 oracleoracledb2 (LOCAL=NO) 26215 oracle 20 0 1234m 13m 10m S 19.4 0.0 56:10.91 oracleoracledb2 (LOCAL=NO) 25867 oracle 20 0 1238m 17m 11m R 8.6 0.0 8:08.25 oracleoracledb2 (LOCAL=NO) 4921 root RT 0 247m 86m 55m S 7.7 0.1 328:43.44 /u01/app/11.2.0.3/grid/bin/osysmond.bin 26706 oracle 20 0 1235m 16m 13m S 6.3 0.0 8:00.49 oracleoracledb2 (LOCAL=NO) 28116 oracle 20 0 2623m 71m 14m S 5.7 0.1 52:15.90 /u01/app/11.2.0.3/grid/bin/oraagent.bin 26710 oracle 20 0 1235m 15m 12m S 3.7 0.0 8:06.62 oracleoracledb2 (LOCAL=NO) 4970 grid RT 0 359m 176m 54m S 3.3 0.2 157:20.60 /u01/app/11.2.0.3/grid/bin/ocssd.bin 26708 oracle 20 0 1235m 17m 13m S 2.4 0.0 8:09.64 oracleoracledb2 (LOCAL=NO) 30262 root 20 0 23712 3228 1128 R 1.8 0.0 0:08.14 /usr/bin/top -b -d 5 -n 720 26719 oracle 20 0 1235m 16m 13m S 1.2 0.0 8:33.47 oracleoracledb2 (LOCAL=NO) 8399 oracle -2 0 1242m 15m 4684 S 1.0 0.0 20:43.88 ora_lms0_oracledb2Tasks: 2247 total, 2 running, 2181 sleeping,
This line displays the total number of processes exists (2247) at the time of the last update. how many are sleeping (blocked on I/O or a system call) 2181, and how many are actually assigned to a CPU (2). The running processes will not be greater than the number of processors on the machine, Like load average, the total number of processes on a healthy machine usually varies just a small amount over time. Suddenly having a significantly larger or smaller number of processes could be a warning sign.
Mem: 99060540k total, 97924488k used, 1136052k free, 219788k buffers
The above information shows the Physical memory (RAM) utilization on machine in kilo bytes. Total is RAM is 99060540k out of which 97924488k is used and 1136052k is free. A remote DBA should also notice a change in the RAM utilization over time.
Swap: 25165816k total, 17130968k used, 8034848k free, 741692k cached
Swap memory is an extension of Physical memory, when system exhaust with RAM it starts using Swap memory, Swap utilization up to some extent (40%) is fine, but when it's utilization increase means system needs more RAM to work. A DBA should keep any eye on Swap utilization, high swap utilization can badly impact system performance and 100% swap utilization can tend to system reboot as well.
Cpu0 : 6.8%us, 2.7%sy, 0.0%ni, 15.4%id, 75.1%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 2.3%us, 1.0%sy, 0.0%ni, 96.5%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 3.9%us, 0.6%sy, 0.0%ni, 95.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 28.6%us, 0.4%sy, 0.0%ni, 71.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 20.2%us, 0.8%sy, 0.0%ni, 79.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 10.7%us, 6.4%sy, 0.0%ni, 82.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Above information shows no. of CPU's into the system and each CUP utilization. CPU's are named as cpu0, cpu1 and cpu2, Here, this system has 12 CPU's and each CPU utilization is shown above.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 26721 oracle 20 0 1234m 15m 12m S 28.5 0.0 8:25.77 oracleoracledb2 (LOCAL=NO) 26729 oracle 20 0 1234m 15m 12m S 28.3 0.0 8:17.08 oracleoracledb2 (LOCAL=NO) 26215 oracle 20 0 1234m 13m 10m S 19.4 0.0 56:10.91 oracleoracledb2 (LOCAL=NO) 25867 oracle 20 0 1238m 17m 11m R 8.6 0.0 8:08.25 oracleoracledb2 (LOCAL=NO)This list shows the number of process and their details running at time of snapshot taken by OS Watcher. Important fields to look at are
PID: OS process ID of the process
USER: Owner of the process like root, oracle
%CPU: How much cpu% a process is taking from a CPU.
%MEM: How much memory% out of total memory a process is taking.
COMMAND: What command a process is executing.
Things to look at:
Check if a process is taking high CPU for a consistent amount of time. If, this is some SQL command having high load, then this could be candidate for tuning. Notice, if a process is using high memory, then DBA has to look further if it's a background process or forground.
However, There are few most stats available in OS Watcher to look at, but after looking at above three stats can give you a fair idea about load on the system and DBA can start investigations accordingly.
In conclusion, I would say if you have identified the load increase on the system with OS Watcher analysis that is great, but if there was no load increase on the system at all, then this is not end of the world, you have next tool available for analyze why Database is performing slow i.e. AWR Report, Generate an awr report and start reading or analyzing AWR report to look at actual cause of the issue.