AIX Power replacing (hot-swap) failed disk in rootvg
replacing (hot-swap) failed disk in rootvg
After login in, I had to verify that it is indeed hdisk0 that died,
|
1
2
3
4
5
|
grdoras1:/root > lsvg -p rootvgrootvg:PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTIONhdisk1 active 546 383 109..18..38..109..109hdisk0 missing 546 383 109..18..38..109..109 |
Let’s try to “wake” this disk. Maybe it will come back on line?
|
1
2
3
|
grdoras1:/root > varyonvg rootvg0516-1747 varyonvg: Cannot varyon volume group with an active dump device on a missing physical volume. Use sysdumpdev to temporarily replace the dump device with /dev/sysdumpnull and try again.grdoras1:/root > |
I want use this opportunity to re-size the dump volumes of this host – I deactivate both as I will remove them both later.
|
1
2
3
4
5
6
7
|
grdoras1:/root > sysdumpdev -P -p /dev/sysdumpnullprimary /dev/sysdumpnullsecondary /dev/dump1copy directory /var/adm/rasforced copy flag TRUEalways allow dump TRUEdump compression ON |
|
1
2
3
4
5
6
7
|
grdoras1:/root > sysdumpdev -P -s /dev/sysdumpnullprimary /dev/sysdumpnullsecondary /dev/sysdumpnullcopy directory /var/adm/rasforced copy flag TRUEalways allow dump TRUEdump compression ON |
For the very last time – can I wake it up?
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
grdoras1:/root > varyonvg rootvggrdoras1:/root > 0516-934 /etc/syncvg: Unable to synchronize logical volume hd5.0516-934 /etc/syncvg: Unable to synchronize logical volume hd8.0516-934 /etc/syncvg: Unable to synchronize logical volume hd4.0516-934 /etc/syncvg: Unable to synchronize logical volume hd2.0516-934 /etc/syncvg: Unable to synchronize logical volume hd9var.0516-934 /etc/syncvg: Unable to synchronize logical volume hd3.0516-934 /etc/syncvg: Unable to synchronize logical volume hd1.0516-934 /etc/syncvg: Unable to synchronize logical volume hd10opt.0516-934 /etc/syncvg: Unable to synchronize logical volume fslv00.0516-934 /etc/syncvg: Unable to synchronize logical volume rootlv.0516-934 /etc/syncvg: Unable to synchronize logical volume local_lv.0516-932 /etc/syncvg: Unable to synchronize volume group rootvg. |
Apparently, hdisk0 is really dead. No joke. Let’s check if any swap needs to be de-activated (the one on hdisk0) as well.
|
1
2
3
|
grdoras1:/root > lsps -aPage Space Physical Volume Volume Group Size %Used Active Auto Typehd6 hdisk1 rootvg 8192MB 1 yes yes lv |
Today, I am lucky. No swap has been defined on hdisk0. Otherwise, we would have to execute the next command (where the swap_lv is the name of the swap volume to be removed).
|
1
|
chps -a n swap_lv |
The previously de-activated volume has to be removed executing:
|
1
|
rmps swap_lv |
The calling home host provided IBM with all the information about the missing disk so I have not been asked to provide ant FRU or Z? info (lscfg -vl hdisk0) would provide all the answers). To satisfy my own curiosity, I execute the next command and proceed with the remaining tasks.
|
1
2
|
grdoras1:/root > lsdev -Cc disk | grep hdisk0hdisk0 Available 06-08-01-5,0 16 Bit LVD SCSI Disk Drive |
The bootlist has to be modified as hdisk0 is useless and it cannot be used as a boot device.
|
1
2
3
|
grdoras1:/root > bootlist -m normal -ohdisk0hdisk1 blv=hd5 |
hdisk1 will be the only device the host knows to boot from.
|
1
2
3
4
|
grdoras1:/root > bootlist -m normal hdisk1grdoras1:/root > bootlist -m normal -ohdisk1 blv=hd5grdoras1:/root > savebase |
The dead disk has to be removed from its volume group which is a two step process. First, the mirrors have to be broken removing the mirror residing on hdisk0.
|
1
2
3
4
5
6
|
grdoras1:/root > unmirrorvg -c 1 rootvg hdisk00516-1246 rmlvcopy: If hd5 is the boot logical volume, please run 'chpv -c ' as root user to clear the boot record and avoid a potential boot off an old boot image that may reside on the disk from which this logical volume is moved/removed.0516-1804 chvg: The quorum change takes effect immediately.0516-1144 unmirrorvg: rootvg successfully unmirrored, user should perform bosboot of system to reinitialize boot records. Then, user must modify bootlist to just include: hdisk1.grdoras1:/root > |
Next, step is the actual removal of the disk from its volume group.
|
1
2
3
|
grdoras1:/root > reducevg rootvg hdisk00516-016 ldeletepv: Cannot delete physical volume with allocated partitions. Use either migratepv to move the partitions or reducevg with the -d option to delete the partitions.0516-884 reducevg: Unable to remove physical volume hdisk0. |
The command fails not because of an error. Well, we could say that it is an error and that the error is mine. OK. I de-activated the dump volume residing on hdisk0 but as far as AIX is concerned this volume is still in use – it is still there on the disk. So I have to remove it (regardless the disk is dead or not). As you can see next, AIX still can read the disk.
|
1
2
3
4
5
6
7
8
9
10
11
|
grdoras1:/root > lspv -M hdisk0hdisk0:1-193hdisk0:194 dump0:1hdisk0:195 dump0:2hdisk0:196 dump0:3hdisk0:197 dump0:4hdisk0:198 dump0:5hdisk0:199 dump0:6hdisk0:200 dump0:7hdisk0:201 dump0:8hdisk0:202-546 |
|
1
2
3
4
5
|
grdoras1:/root > rmlv dump0Warning, all data contained on logical volume dump0 will be destroyed.rmlv: Do you wish to continue? y(es) n(o)?yrmlv: Logical volume dump0 is removed. |
The next in line is the disk removal from the group.
|
1
|
grdoras1:/root > reducevg rootvg hdisk0 |
|
1
2
3
4
|
grdoras1:/root > lsvg -p rootvgrootvg:PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTIONhdisk1 active 546 383 109..18..38..109..109 |
|
1
2
3
|
grdoras1:/root > lspvhdisk0 00ca1ef03c7d9ff2 Nonehdisk1 00ca1ef0454c8913 rootvg active |
If I executed the next command and removed the disks, I would not be able to see it executing the diag hot plug-gable tasks…
|
1
2
3
|
grdoras1:/root > rmdev -dl hdisk0hdisk0 deletedgrdoras1:/root > |
I would have to execute the configmangler aka the cfgmgr command to get it back so diag could present it for me. Execute diag, hit the ENTER key, slide down to the Task Selection (Diagnostics, Advanced Diagnostics, Service Aids, etc.), hit CTRL-V once, and slide to the Hot Plug Task. From the next screen select SCSI and SCSI RAID Hot Plug Manager, slide down and select Replace/Remove a Device Attached to an SCSI Hot Swap Enclosure Device. From the next and the final screen select the desired disk (hdisk0 in our case). Hit the ENTER key to define your selection and proceed with the disk replacement.
Hit the same key again to declare the swap of the disk completed and after you leave the diag and are back at the command prompt execute cfgmgr (for a good measure).
If you are a puritan, you may execute the next two steps. Otherwise proceed directly to the task at hand and add the new hdisk0 to the host rootvg.
|
1
2
|
chdev -l hdisk0 -a pv=clearchdev -l hdisk0 -a pv=yes |
|
1
|
extendvg rootvg hdisk0 |
Before we get lost in the activities ahead, let’s stop for a moment and regroup. What do we need to complete this process. We need (not necessarly in the order shown) dump volumes (one per disk), re-mirror rootvg and the bootlist needs to be modified to again include both disks. OK.
To re-create and re-sync the mirrors in rootvg, the following command has to be executed.
|
1
2
3
4
5
|
grdoras1:/root > mirrorvg -S -c 2 rootvg hdisk0 hdisk10516-1804 chvg: The quorum change takes effect immediately.0516-1126 mirrorvg: rootvg successfully mirrored, user should performbosboot of system to initialize boot records. Then, user must modifybootlist to include: hdisk1 hdisk0. |
After my return, I check the state of my volume group. I noticed that some logical volume are still stale, apparently I returned sooner than expected.

