AIX Power replacing (hot-swap) failed disk in rootvg

replacing (hot-swap) failed disk in rootvg

After login in, I had to verify that it is indeed hdisk0 that died,

1
2
3
4
5
grdoras1:/root >  lsvg -p rootvg
rootvg:
PV_NAME   PV STATE  TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk1    active    546         383         109..18..38..109..109
hdisk0    missing   546         383         109..18..38..109..109

Let’s try to “wake” this disk. Maybe it will come back on line?

1
2
3
grdoras1:/root >  varyonvg rootvg
0516-1747 varyonvg: Cannot varyon volume group with an active dump device on a missing physical volume. Use sysdumpdev to temporarily replace the dump device with /dev/sysdumpnull and try again.
grdoras1:/root >

I want use this opportunity to re-size the dump volumes of this host – I deactivate both as I will remove them both later.

1
2
3
4
5
6
7
grdoras1:/root >  sysdumpdev -P -p /dev/sysdumpnull
primary              /dev/sysdumpnull
secondary            /dev/dump1
copy directory       /var/adm/ras
forced copy flag     TRUE
always allow dump    TRUE
dump compression     ON
1
2
3
4
5
6
7
grdoras1:/root >  sysdumpdev -P -s /dev/sysdumpnull
primary              /dev/sysdumpnull
secondary            /dev/sysdumpnull
copy directory       /var/adm/ras
forced copy flag     TRUE
always allow dump    TRUE
dump compression     ON

For the very last time – can I wake it up?

1
2
3
4
5
6
7
8
9
10
11
12
13
grdoras1:/root >  varyonvg rootvg
grdoras1:/root >  0516-934 /etc/syncvg: Unable to synchronize logical volume hd5.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd8.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd4.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd2.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd9var.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd3.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd1.
0516-934 /etc/syncvg: Unable to synchronize logical volume hd10opt.
0516-934 /etc/syncvg: Unable to synchronize logical volume fslv00.
0516-934 /etc/syncvg: Unable to synchronize logical volume rootlv.
0516-934 /etc/syncvg: Unable to synchronize logical volume local_lv.
0516-932 /etc/syncvg: Unable to synchronize volume group rootvg.

Apparently, hdisk0 is really dead. No joke. Let’s check if any swap needs to be de-activated (the one on hdisk0) as well.

1
2
3
grdoras1:/root >  lsps -a
Page Space Physical Volume Volume Group  Size %Used Active  Auto  Type
hd6      hdisk1      rootvg      8192MB     1   yes   yes    lv

Today, I am lucky. No swap has been defined on hdisk0. Otherwise, we would have to execute the next command (where the swap_lv is the name of the swap volume to be removed).

1
chps -a n swap_lv

The previously de-activated volume has to be removed executing:

1
rmps swap_lv

The calling home host provided IBM with all the information about the missing disk so I have not been asked to provide ant FRU or Z? info (lscfg -vl hdisk0) would provide all the answers). To satisfy my own curiosity, I execute the next command and proceed with the remaining tasks.

1
2
grdoras1:/root >  lsdev -Cc disk | grep hdisk0
hdisk0  Available 06-08-01-5,0 16 Bit LVD SCSI Disk Drive

The bootlist has to be modified as hdisk0 is useless and it cannot be used as a boot device.

1
2
3
grdoras1:/root >  bootlist -m normal -o
hdisk0
hdisk1 blv=hd5

hdisk1 will be the only device the host knows to boot from.

1
2
3
4
grdoras1:/root >  bootlist -m normal hdisk1
grdoras1:/root >  bootlist -m normal -o
hdisk1 blv=hd5
grdoras1:/root >  savebase

The dead disk has to be removed from its volume group which is a two step process. First, the mirrors have to be broken removing the mirror residing on hdisk0.

1
2
3
4
5
6
grdoras1:/root >  unmirrorvg -c 1 rootvg hdisk0
0516-1246 rmlvcopy: If hd5 is the boot logical volume, please run 'chpv -c ' as root user to clear the boot record and avoid a potential boot off an old  boot image that may reside on the disk from which this logical volume is moved/removed.
 
0516-1804 chvg: The quorum change takes effect immediately.
0516-1144 unmirrorvg: rootvg successfully unmirrored, user should perform bosboot of system to reinitialize boot records.  Then, user must modify bootlist to just include:  hdisk1.
grdoras1:/root >

Next, step is the actual removal of the disk from its volume group.

1
2
3
grdoras1:/root >  reducevg rootvg hdisk0
0516-016 ldeletepv: Cannot delete physical volume with allocated partitions. Use either migratepv to move the partitions or reducevg with the -d option to delete the partitions.
0516-884 reducevg: Unable to remove physical volume hdisk0.

The command fails not because of an error. Well, we could say that it is an error and that the error is mine. OK. I de-activated the dump volume residing on hdisk0 but as far as AIX is concerned this volume is still in use – it is still there on the disk. So I have to remove it (regardless the disk is dead or not). As you can see next, AIX still can read the disk.

1
2
3
4
5
6
7
8
9
10
11
grdoras1:/root >  lspv -M hdisk0
hdisk0:1-193
hdisk0:194      dump0:1
hdisk0:195      dump0:2
hdisk0:196      dump0:3
hdisk0:197      dump0:4
hdisk0:198      dump0:5
hdisk0:199      dump0:6
hdisk0:200      dump0:7
hdisk0:201      dump0:8
hdisk0:202-546
1
2
3
4
5
grdoras1:/root >  rmlv dump0
Warning, all data contained on logical volume dump0 will be destroyed.
rmlv: Do you wish to continue? y(es) n(o)?
y
rmlv: Logical volume dump0 is removed.

The next in line is the disk removal from the group.

1
grdoras1:/root >  reducevg rootvg hdisk0
1
2
3
4
grdoras1:/root >  lsvg -p rootvg
rootvg:
PV_NAME   PV STATE   TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk1    active    546         383         109..18..38..109..109
1
2
3
grdoras1:/root >  lspv
hdisk0          00ca1ef03c7d9ff2                    None
hdisk1          00ca1ef0454c8913                    rootvg          active

If I executed the next command and removed the disks, I would not be able to see it executing the diag hot plug-gable tasks…

1
2
3
grdoras1:/root >  rmdev -dl hdisk0
hdisk0 deleted
grdoras1:/root >

I would have to execute the configmangler aka the cfgmgr command to get it back so diag could present it for me. Execute diag, hit the ENTER key, slide down to the Task Selection (Diagnostics, Advanced Diagnostics, Service Aids, etc.), hit CTRL-V once, and slide to the Hot Plug Task. From the next screen select SCSI and SCSI RAID Hot Plug Manager, slide down and select Replace/Remove a Device Attached to an SCSI Hot Swap Enclosure Device. From the next and the final screen select the desired disk (hdisk0 in our case). Hit the ENTER key to define your selection and proceed with the disk replacement.
Hit the same key again to declare the swap of the disk completed and after you leave the diag and are back at the command prompt execute cfgmgr (for a good measure).

If you are a puritan, you may execute the next two steps. Otherwise proceed directly to the task at hand and add the new hdisk0 to the host rootvg.

1
2
chdev -l hdisk0 -a pv=clear
chdev -l hdisk0 -a pv=yes
1
extendvg rootvg hdisk0

Before we get lost in the activities ahead, let’s stop for a moment and regroup. What do we need to complete this process. We need (not necessarly in the order shown) dump volumes (one per disk), re-mirror rootvg and the bootlist needs to be modified to again include both disks. OK.
To re-create and re-sync the mirrors in rootvg, the following command has to be executed.

1
2
3
4
5
grdoras1:/root >  mirrorvg -S -c 2 rootvg hdisk0 hdisk1
0516-1804 chvg: The quorum change takes effect immediately.
0516-1126 mirrorvg: rootvg successfully mirrored, user should perform
bosboot of system to initialize boot records.  Then, user must modify
bootlist to include:  hdisk1 hdisk0.

After my return, I check the state of my volume group. I noticed that some logical volume are still stale, apparently I returned sooner than expected. </article></div></span>
													</div><div class=

0 (0)
Article Rating (No Votes)
Rate this article
Attachments
There are no attachments for this article.
Comments
There are no comments for this article. Be the first to post a comment.
Full Name
Email Address
Security Code Security Code
Related Articles RSS Feed
0516-404 allocp: This system cannot fulfill the allocation request. [AIX]
Viewed 4763 times since Thu, Sep 20, 2018
Recovery from LED 552, 554, or 556 in AIX
Viewed 2350 times since Tue, Apr 16, 2019
This document discusses a new feature implemented for JFS2 filesystems to prevent simultaneous mounting.
Viewed 2594 times since Sat, Jun 1, 2019
AIX, user gets “pwd: The file access permissions do not allow the specified action.”
Viewed 10323 times since Tue, Mar 16, 2021
Convert to Scalable Volume Groups
Viewed 3424 times since Wed, May 30, 2018
AIX↑ AIX www links
Viewed 3408 times since Sat, Apr 20, 2019
AIX disk queue depth tuning for performance
Viewed 14688 times since Thu, Jan 16, 2020
AIX, Monitoring, System Admin↑ NMON recordings
Viewed 2762 times since Fri, Apr 19, 2019
http://ibmsystemsmag.com/aix/administrator/backuprecovery/remote-sync/
Viewed 5168 times since Wed, May 30, 2018
AIX - How to monitor CPU usage
Viewed 25567 times since Fri, Jun 8, 2018