Top 4 Reasons for Node Reboot or Node Eviction in Real Application Cluster (RAC) Environment
Top 4 Reasons for Node Reboot or Node Eviction in Real Application Cluster (RAC) Environment
Top 4 Reasons for Node Reboot or Node Eviction in Real Application Cluster (RAC) Environment:
Whenever, Database Administrator face Node Reboot issue, First thing to look at should be /var/log/message and OS Watcher logs of the Database Node which was rebooted.
var/log/messages will give you an actual picture of reboot:- Exact time of restart, status of resource like swap and RAM etc.
So, Every time you see a node eviction start investigation with /var/log/messages and Analyze OS Watcher logs. Below is a situation when a Database Node was reboot due to high load.
/var/log/messages output from a Database Node just before Node eviction:
Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA: 2*4kB 1*8kB 0*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15792kB Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA32: 150*4kB 277*8kB 229*16kB 186*32kB 51*64kB 65*128kB 82*256kB 13*512kB 3*1024kB 3*2048kB 78*4096kB = 380368kB Apr 23 08:15:04 remotedb06 kernel: Node 0 Normal: 12362*4kB 58*8kB 0*16kB 0*32kB 0*64kB 0*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 52984kB Apr 23 08:15:09 remotedb06 kernel: 83907 total pagecache pages Apr 23 08:15:11 remotedb06 kernel: 39826 pages in swap cache Apr 23 08:15:11 remotedb06 kernel: Swap cache stats: add 30820220, delete 30780387, find 18044378/19694662 Apr 23 08:15:14 remotedb06 kernel: Free swap = 4kB Apr 23 08:15:15 remotedb06 kernel: Total swap = 25165816kB Apr 23 08:15:16 remotedb06 kernel: 25165808 pages RAM Apr 23 08:15:28 remotedb06 kernel: 400673 pages reserved Apr 23 08:15:30 remotedb06 kernel: 77691135 pages shared Apr 23 08:15:31 remotedb06 kernel: 9226743 pages non-shared Apr 23 08:15:33 remotedb06 kernel: osysmond.bin: page allocation failure. order:4, mode:0xd0
08:15:29 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 08:15:29 all 0.67 0.00 82.60 16.60 0.00 0.06 0.00 0.07 42521.98 08:15:35 all 0.84 0.00 23.40 73.19 0.00 0.11 0.00 2.47 35859.59 08:15:39 all 1.22 0.00 85.13 13.47 0.00 0.13 0.00 0.04 40569.31 08:15:45 all 1.57 0.00 98.22 0.13 0.00 0.08 0.00 0.00 36584.31 08:15:50 all 1.41 0.00 98.48 0.04 0.00 0.07 0.00 0.00 36643.10 08:15:54 all 0.84 0.00 99.09 0.03 0.00 0.05 0.00 0.00 36257.02 08:16:06 all 0.95 0.00 98.88 0.09 0.00 0.08 0.00 0.00 39113.15 08:16:11 all 0.87 0.00 99.00 0.07 0.00 0.06 0.00 0.00 37490.22 08:16:16 all 0.89 0.00 98.97 0.07 0.00 0.07 0.00 0.00 37681.04 08:16:22 all 0.78 0.00 99.12 0.05 0.00 0.05 0.00 0.00 36963.75 08:16:38 all 0.79 0.00 98.86 0.28 0.00 0.08 0.00 0.00 36639.21 08:16:43 all 0.78 0.00 98.79 0.34 0.00 0.08 0.00 0.01 37405.99 08:16:54 all 1.06 0.00 98.71 0.12 0.00 0.11 0.00 0.00 38102.37 08:17:08 all 1.69 0.00 67.02 30.93 0.00 0.06 0.00 0.29 37316.55
How to avoid Node Reboot due to High Load ?
Here is a scenario for voting disk not reachable:
2013-01-26 10:15:47.177 [cssd(3743)]CRS-1606:The number of voting files available, 1, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /u01/app/11.2.0/grid/log/apdbc76n1/cssd/ocssd.log
There could be two reasons for this issue:
A. Connection to the voting disk is interrupted.
B. if only one voting disk is in use and version is less than 11.2.0.3.4, hitting known bug 13869978.
How to Solve Voting Disk Outage ?
There could be many reasons for voting disk is not reachable, Here are few general approach for DBA to follow.
3. Missed Network Connection between Nodes: In technical term this is called as Missed Network Heartbeat (NHB). Whenever there is communication gap or no communication between nodes on private network (interconnect) due to network outage or some other reason. A node abort itself to avoid "split brain" situation. The most common (but not exclusive) cause of missed NHB is network problems communicating over the private interconnect.
Suggestion to troubleshoot Missed Network Heartbeat.
Database Alert log file entry for Database Hang Situation:
Wed Jun 05 03:25:01 2013
Sweep [inc][122970]: completed
System State dumped to trace file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/incident/incdir_122970/remotedb2_m005_118200_i122970_a.trc
Wed Jun 05 03:25:33 2013
DSKM process appears to be hung. Initiating system state dump.
Wed Jun 05 03:25:33 2013
System state dump requested by (instance=2, osid=9757 (GEN0)), summary=[system state dump request (ksz_check_ds)].
System State dumped to trace file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_diag_9759.trc
Wed Jun 05 03:27:04 2013
Errors in file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_pmon_9707.trc (incident=122897):
ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x143642090], [3], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/remotedb/remotedb2/incident/incdir_122897/remotedb2_pmon_9707_i122897.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_pmon_9707.trc:
ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x143642090], [3], [], [], [], [], [], [], [], [], []
PMON (ospid: 9707): terminating the instance due to error 472
Wed Jun 05 03:27:05 2013
System state dump requested by (instance=2, osid=9707 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_diag_9759.trc
Instance terminated by PMON, pid = 9707
Wed Jun 05 03:45:58 2013
Starting ORACLE instance (normal)
[/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5011:Check of resource "PCAB1DE" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log" 2013-06-05 03:27:06.649 [/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5011:Check of resource "P1PANIE" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log" 2013-06-05 03:32:41.360 [/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5011:Check of resource "P1ESXPE" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log" 2013-06-05 03:36:25.675 [cssd(67762)]CRS-1662:Member kill requested by node exs2db03 for member number 1, group DBPCAB1DE 2013-06-05 03:36:58.775 [/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5818:Aborted command 'clean' for resource 'ora.pcab1de.db'. Details at (:CRSAGF00113:) {0:6:28} in /u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log. 2013-06-05 03:37:06.784 [/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5818:Aborted command 'clean' for resource 'ora.p1panie.db'. Details at (:CRSAGF00113:) {0:6:29} in /u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log. 2013-06-05 03:37:26.307 [cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log. 2013-06-05 03:37:26.318 [cssd(67762)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log 2013-06-05 03:37:26.318 [cssd(67762)]CRS-1652:Starting clean up of CRSD resources. 2013-06-05 03:37:26.340 [cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log. 2013-06-05 03:37:26.450 [cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log. 2013-06-05 03:37:26.647 [cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log. 2013-06-05 03:37:26.869 [cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log. 2013-06-05 03:37:26.921The main reason for this issue was "DSKM process appears to be hung.", but definately this was not because of high load Since OS Watcher at this time doesn't show any load.So, I believe this could be due to some bug in database.