AIX Power replacing (hot-swap) failed disk in rootvg
replacing (hot-swap
) failed disk in rootvg
After login in, I had to verify that it is indeed hdisk0
that died,
1
2
3
4
5
|
grdoras1:/root > lsvg -p rootvg rootvg: PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION hdisk1 active 546 383 109..18..38..109..109 hdisk0 missing 546 383 109..18..38..109..109 |
Let’s try to “wake” this disk. Maybe it will come back on line?
1
2
3
|
grdoras1:/root > varyonvg rootvg 0516-1747 varyonvg: Cannot varyon volume group with an active dump device on a missing physical volume. Use sysdumpdev to temporarily replace the dump device with /dev/sysdumpnull and try again. grdoras1:/root > |
I want use this opportunity to re-size the dump
volumes of this host – I deactivate both as I will remove them both later.
1
2
3
4
5
6
7
|
grdoras1:/root > sysdumpdev -P -p /dev/sysdumpnull primary /dev/sysdumpnull secondary /dev/dump1 copy directory /var/adm/ras forced copy flag TRUE always allow dump TRUE dump compression ON |
1
2
3
4
5
6
7
|
grdoras1:/root > sysdumpdev -P -s /dev/sysdumpnull primary /dev/sysdumpnull secondary /dev/sysdumpnull copy directory /var/adm/ras forced copy flag TRUE always allow dump TRUE dump compression ON |
For the very last time – can I wake it up?
1
2
3
4
5
6
7
8
9
10
11
12
13
|
grdoras1:/root > varyonvg rootvg grdoras1:/root > 0516-934 /etc/syncvg: Unable to synchronize logical volume hd5. 0516-934 /etc/syncvg: Unable to synchronize logical volume hd8. 0516-934 /etc/syncvg: Unable to synchronize logical volume hd4. 0516-934 /etc/syncvg: Unable to synchronize logical volume hd2. 0516-934 /etc/syncvg: Unable to synchronize logical volume hd9var. 0516-934 /etc/syncvg: Unable to synchronize logical volume hd3. 0516-934 /etc/syncvg: Unable to synchronize logical volume hd1. 0516-934 /etc/syncvg: Unable to synchronize logical volume hd10opt. 0516-934 /etc/syncvg: Unable to synchronize logical volume fslv00. 0516-934 /etc/syncvg: Unable to synchronize logical volume rootlv. 0516-934 /etc/syncvg: Unable to synchronize logical volume local_lv. 0516-932 /etc/syncvg: Unable to synchronize volume group rootvg. |
Apparently, hdisk0
is really dead. No joke. Let’s check if any swap
needs to be de-activated (the one on hdisk0
) as well.
1
2
3
|
grdoras1:/root > lsps -a Page Space Physical Volume Volume Group Size %Used Active Auto Type hd6 hdisk1 rootvg 8192MB 1 yes yes lv |
Today, I am lucky. No swap
has been defined on hdisk0
. Otherwise, we would have to execute the next command (where the swap_lv
is the name of the swap volume to be removed).
1
|
chps -a n swap_lv |
The previously de-activated volume has to be removed executing:
1
|
rmps swap_lv |
The calling home host provided IBM with all the information about the missing disk so I have not been asked to provide ant FRU or Z? info (lscfg -vl hdisk0
) would provide all the answers). To satisfy my own curiosity, I execute the next command and proceed with the remaining tasks.
1
2
|
grdoras1:/root > lsdev -Cc disk | grep hdisk0 hdisk0 Available 06-08-01-5,0 16 Bit LVD SCSI Disk Drive |
The bootlist
has to be modified as hdisk0
is useless and it cannot be used as a boot device.
1
2
3
|
grdoras1:/root > bootlist -m normal -o hdisk0 hdisk1 blv=hd5 |
hdisk1
will be the only device the host knows to boot from.
1
2
3
4
|
grdoras1:/root > bootlist -m normal hdisk1 grdoras1:/root > bootlist -m normal -o hdisk1 blv=hd5 grdoras1:/root > savebase |
The dead disk has to be removed from its volume group which is a two step process. First, the mirrors have to be broken removing the mirror residing on hdisk0
.
1
2
3
4
5
6
|
grdoras1:/root > unmirrorvg -c 1 rootvg hdisk0 0516-1246 rmlvcopy: If hd5 is the boot logical volume, please run 'chpv -c ' as root user to clear the boot record and avoid a potential boot off an old boot image that may reside on the disk from which this logical volume is moved/removed. 0516-1804 chvg: The quorum change takes effect immediately. 0516-1144 unmirrorvg: rootvg successfully unmirrored, user should perform bosboot of system to reinitialize boot records. Then, user must modify bootlist to just include: hdisk1. grdoras1:/root > |
Next, step is the actual removal of the disk from its volume group.
1
2
3
|
grdoras1:/root > reducevg rootvg hdisk0 0516-016 ldeletepv: Cannot delete physical volume with allocated partitions. Use either migratepv to move the partitions or reducevg with the -d option to delete the partitions. 0516-884 reducevg: Unable to remove physical volume hdisk0. |
The command fails not because of an error. Well, we could say that it is an error and that the error is mine. OK. I de-activated the dump volume residing on hdisk0
but as far as AIX is concerned this volume is still in use – it is still there on the disk. So I have to remove it (regardless the disk is dead or not). As you can see next, AIX still can read the disk.
1
2
3
4
5
6
7
8
9
10
11
|
grdoras1:/root > lspv -M hdisk0 hdisk0:1-193 hdisk0:194 dump0:1 hdisk0:195 dump0:2 hdisk0:196 dump0:3 hdisk0:197 dump0:4 hdisk0:198 dump0:5 hdisk0:199 dump0:6 hdisk0:200 dump0:7 hdisk0:201 dump0:8 hdisk0:202-546 |
1
2
3
4
5
|
grdoras1:/root > rmlv dump0 Warning, all data contained on logical volume dump0 will be destroyed. rmlv: Do you wish to continue? y(es) n(o)? y rmlv: Logical volume dump0 is removed. |
The next in line is the disk removal from the group.
1
|
grdoras1:/root > reducevg rootvg hdisk0 |
1
2
3
4
|
grdoras1:/root > lsvg -p rootvg rootvg: PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION hdisk1 active 546 383 109..18..38..109..109 |
1
2
3
|
grdoras1:/root > lspv hdisk0 00ca1ef03c7d9ff2 None hdisk1 00ca1ef0454c8913 rootvg active |
If I executed the next command and removed the disks, I would not be able to see it executing the diag
hot plug-gable tasks…
1
2
3
|
grdoras1:/root > rmdev -dl hdisk0 hdisk0 deleted grdoras1:/root > |
I would have to execute the configmangler
aka the cfgmgr
command to get it back so diag
could present it for me. Execute diag
, hit the ENTER key, slide down to the Task Selection (Diagnostics, Advanced Diagnostics, Service Aids, etc.)
, hit CTRL-V once, and slide to the Hot Plug Task
. From the next screen select SCSI and SCSI RAID Hot Plug Manager
, slide down and select Replace/Remove a Device Attached to an SCSI Hot Swap Enclosure Device
. From the next and the final screen select the desired disk (hdisk0
in our case). Hit the ENTER key to define your selection and proceed with the disk replacement.
Hit the same key again to declare the swap of the disk completed and after you leave the diag
and are back at the command prompt execute cfgmgr
(for a good measure).
If you are a puritan, you may execute the next two steps. Otherwise proceed directly to the task at hand and add the new hdisk0
to the host rootvg
.
1
2
|
chdev -l hdisk0 -a pv=clear chdev -l hdisk0 -a pv=yes |
1
|
extendvg rootvg hdisk0 |
Before we get lost in the activities ahead, let’s stop for a moment and regroup. What do we need to complete this process. We need (not necessarly in the order shown) dump
volumes (one per disk), re-mirror rootvg
and the bootlist needs to be modified to again include both disks. OK.
To re-create and re-sync the mirrors in rootvg
, the following command has to be executed.
1
2
3
4
5
|
grdoras1:/root > mirrorvg -S -c 2 rootvg hdisk0 hdisk1 0516-1804 chvg: The quorum change takes effect immediately. 0516-1126 mirrorvg: rootvg successfully mirrored, user should perform bosboot of system to initialize boot records. Then, user must modify bootlist to include: hdisk1 hdisk0. |
After my return, I check the state of my volume group. I noticed that some logical volume are still stale
, apparently I returned sooner than expected.