replacing (hot-swap) failed disk in rootvg

IBM called me because one of my hosts called them reporting a failed disk in rootvg. It was the hdisk0 that decided to quit. A few minutes later, a service engineer called my to let me know that he is driving with the new disk. I had about an hour to get ready. The following text describe the process of replacing a non-SAN disk in rootvg.

After login in, I had to verify that it is indeed hdisk0 that died,

1
2
3
4
5
grdoras1:/root >  lsvg -p rootvg
rootvg:
PV_NAME   PV STATE  TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk1    active    546         383         109..18..38..109..109
hdisk0    missing   546         383         109..18..38..109..109

Let’s try to “wake” this disk. Maybe it will come back on line?

1
2
3
grdoras1:/root >  varyonvg rootvg
0516-1747 varyonvg: Cannot varyon volume group with an active dump device on a missing physical volume. Use sysdumpdev to temporarily replace the dump device with /dev/sysdumpnull and try again.
grdoras1:/root >

I want use this opportunity to re-size the dump volumes of this host – I deactivate both as I will remove them both later.

1
2
3
4
5
6
7
grdoras1:/root >  sysdumpdev -P -p /dev/sysdumpnull
primary              /dev/sysdumpnull
secondary            /dev/dump1
copy directory       /var/adm/ras
forced copy flag     TRUE
always allow dump    TRUE
dump compression     ON
1
2
3
4
5
6
7
grdoras1:/root >  sysdumpdev -P -s /dev/sysdumpnull
primary              /dev/sysdumpnull
secondary            /dev/sysdumpnull
copy directory       /var/adm/ras
forced copy flag     TRUE
always allow dump    TRUE
dump compression     ON

For the very last time – can I wake it up?

1
2
3
4
5
6
7
8
9
10
11
12
13
grdoras1:/root >  varyonvg rootvg
grdoras1:/root >  0516-934 /etc/syncvg: Unable to synchronize logical volume hd5.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd8.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd4.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd2.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd9var.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd3.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd1.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd10opt.
0516-934 /etc/syncvg: Unable to synchronize logical volume fslv00.
0516-934 /etc/syncvg: Unable to synchronize logical volume rootlv.
0516-934 /etc/syncvg: Unable to synchronize logical volume local_lv.
0516-932 /etc/syncvg: Unable to synchronize volume group rootvg.

Apparently, hdisk0 is really dead. No joke. Let’s check if any swap needs to be de-activated (the one on hdisk0) as well.

1
2
3
grdoras1:/root >  lsps -a
Page Space Physical Volume Volume Group  Size %Used Active  Auto  Type
hd6      hdisk1      rootvg      8192MB     1   yes   yes    lv

Today, I am lucky. No swap has been defined on hdisk0. Otherwise, we would have to execute the next command (where the swap_lv is the name of the swap volume to be removed).

1
chps -a n swap_lv

The previously de-activated volume has to be removed executing:

1
rmps swap_lv

The calling home host provided IBM with all the information about the missing disk so I have not been asked to provide ant FRU or Z? info (lscfg -vl hdisk0) would provide all the answers). To satisfy my own curiosity, I execute the next command and proceed with the remaining tasks.

1
2
grdoras1:/root >  lsdev -Cc disk | grep hdisk0
hdisk0  Available 06-08-01-5,0 16 Bit LVD SCSI Disk Drive

The bootlist has to be modified as hdisk0 is useless and it cannot be used as a boot device.

1
2
3
grdoras1:/root >  bootlist -m normal -o
hdisk0
hdisk1 blv=hd5

hdisk1 will be the only device the host knows to boot from.

1
2
3
4
grdoras1:/root >  bootlist -m normal hdisk1
grdoras1:/root >  bootlist -m normal -o
hdisk1 blv=hd5
grdoras1:/root >  savebase

The dead disk has to be removed from its volume group which is a two step process. First, the mirrors have to be broken removing the mirror residing on hdisk0.

1
2
3
4
5
6
grdoras1:/root >  unmirrorvg -c 1 rootvg hdisk0
0516-1246 rmlvcopy: If hd5 is the boot logical volume, please run 'chpv -c ' as root user to clear the boot record and avoid a potential boot off an old  boot image that may reside on the disk from which this logical volume is moved/removed.
 
0516-1804 chvg: The quorum change takes effect immediately.
0516-1144 unmirrorvg: rootvg successfully unmirrored, user should perform bosboot of system to reinitialize boot records.  Then, user must modify bootlist to just include:  hdisk1.
grdoras1:/root >

Next, step is the actual removal of the disk from its volume group.

1
2
3
grdoras1:/root >  reducevg rootvg hdisk0
0516-016 ldeletepv: Cannot delete physical volume with allocated partitions. Use either migratepv to move the partitions or reducevg with the -d option to delete the partitions.
0516-884 reducevg: Unable to remove physical volume hdisk0.

The command fails not because of an error. Well, we could say that it is an error and that the error is mine. OK. I de-activated the dump volume residing on hdisk0 but as far as AIX is concerned this volume is still in use – it is still there on the disk. So I have to remove it (regardless the disk is dead or not). As you can see next, AIX still can read the disk.

1
2
3
4
5
6
7
8
9
10
11
grdoras1:/root >  lspv -M hdisk0
hdisk0:1-193
hdisk0:194      dump0:1
hdisk0:195      dump0:2
hdisk0:196      dump0:3
hdisk0:197      dump0:4
hdisk0:198      dump0:5
hdisk0:199      dump0:6
hdisk0:200      dump0:7
hdisk0:201      dump0:8
hdisk0:202-546
1
2
3
4
5
grdoras1:/root >  rmlv dump0
Warning, all data contained on logical volume dump0 will be destroyed.
rmlv: Do you wish to continue? y(es) n(o)?
y
rmlv: Logical volume dump0 is removed.

The next in line is the disk removal from the group.

1
grdoras1:/root >  reducevg rootvg hdisk0
1
2
3
4
grdoras1:/root >  lsvg -p rootvg
rootvg:
PV_NAME   PV STATE   TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk1    active    546         383         109..18..38..109..109
1
2
3
grdoras1:/root >  lspv
hdisk0          00ca1ef03c7d9ff2                    None
hdisk1          00ca1ef0454c8913                    rootvg          active

If I executed the next command and removed the disks, I would not be able to see it executing the diag hot plug-gable tasks…

1
2
3
grdoras1:/root >  rmdev -dl hdisk0
hdisk0 deleted
grdoras1:/root >

I would have to execute the configmangler aka the cfgmgr command to get it back so diag could present it for me. Execute diag, hit the ENTER key, slide down to the Task Selection (Diagnostics, Advanced Diagnostics, Service Aids, etc.), hit CTRL-V once, and slide to the Hot Plug Task. From the next screen select SCSI and SCSI RAID Hot Plug Manager, slide down and select Replace/Remove a Device Attached to an SCSI Hot Swap Enclosure Device. From the next and the final screen select the desired disk (hdisk0 in our case). Hit the ENTER key to define your selection and proceed with the disk replacement.
Hit the same key again to declare the swap of the disk completed and after you leave the diag and are back at the command prompt execute cfgmgr (for a good measure).

If you are a puritan, you may execute the next two steps. Otherwise proceed directly to the task at hand and add the new hdisk0 to the host rootvg.

1
2
chdev -l hdisk0 -a pv=clear
chdev -l hdisk0 -a pv=yes
1
extendvg rootvg hdisk0

Before we get lost in the activities ahead, let’s stop for a moment and regroup. What do we need to complete this process. We need (not necessarly in the order shown) dump volumes (one per disk), re-mirror rootvg and the bootlist needs to be modified to again include both disks. OK.
To re-create and re-sync the mirrors in rootvg, the following command has to be executed.

1
2
3
4
5
grdoras1:/root >  mirrorvg -S -c 2 rootvg hdisk0 hdisk1
0516-1804 chvg: The quorum change takes effect immediately.
0516-1126 mirrorvg: rootvg successfully mirrored, user should perform
bosboot of system to initialize boot records.  Then, user must modify
bootlist to include:  hdisk1 hdisk0.

After my return, I check the state of my volume group. I noticed that some logical volume are still stale, apparently I returned sooner than expected. </article></div></span>
													</div><div class=

0 (0)
Article Rating (No Votes)
Rate this article
Attachments
There are no attachments for this article.
Comments
There are no comments for this article. Be the first to post a comment.
Full Name
Email Address
Security Code Security Code
Related Articles RSS Feed
List of 10 Must Know Oracle Database Parameters for Database Administrator
Viewed 115538 times since Thu, Jun 21, 2018
Setup private yum repository for AIX clients
Viewed 10817 times since Thu, Feb 21, 2019
Install and configure yum on AIX
Viewed 4302 times since Thu, Feb 21, 2019
A Unix Utility You Should Know About: lsof
Viewed 1760 times since Tue, Apr 16, 2019
Monitor logfiles and command output on AIX using multitail.
Viewed 2086 times since Thu, Feb 21, 2019
AIX: Script to check if all paths are consistent and available
Viewed 2910 times since Tue, Jun 12, 2018
AIX How to Investigate a System Reboot
Viewed 6098 times since Tue, Aug 14, 2018
AIX Oracle tuning
Viewed 206289 times since Tue, Jul 2, 2019
Monitoring Events with AIX Audit
Viewed 3639 times since Wed, May 30, 2018
Practical Guide to AIX - network
Viewed 17286 times since Thu, Sep 20, 2018