Got Duplicate PVIDs in Your User VG? Try Recreatevg!
Got Duplicate PVIDs in Your User VG? Try Recreatevg!
I recently installed a new SP (service pack) level on AIX. I then rebooted the box, as is usual practice for the changes to take effect. But upon checking the mounted file- systems, post reboot, I noticed one of the file-systems hadn’t mounted. I checked the VG (Volume Group) called: data_vg and indeed the LV (Logical Volumes) were closed. In fact the VG was offline, this was checked by listing the online VGs only:
# lsvg -o root
The VG (data_vg) was definitely offline but was present as far as AIX was aware.
# lsvg rootvg data_vg
I knew from a configuration listing that always gets generated and mailed to the team prior to a reboot, what I should have in regards to the devices present on the box. This is when I noticed from the report that hdisk41 and hdisk42 had duplicate PVIDs (Physical Volume Identifiers), prior to the reboot and still have duplicate PVIDs after the reboot.
hdisk0 00cd94b6d734dfa2 rootvg hdisk40 00cd94b6d382cda5 data_vg hdisk41 00cd94b6d2bc9362 data_vg hdisk44 00cd94b6d2bc9362 data_vg
For the LV's I had the following in the VG:
data_vg: LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT fslv03 jfs2log 1 1 1 closed/syncd N/A fslv06 jfs2 1 1 1 closed/syncd /maps_uk
I then decided to manually vary the VG online:
# varyonvg data_vg 0516-1775 varyonvg: Physical volumes hdisk41 and hdisk44 have identical PVIDS (00cd94b6d2bc9362).
Now this was interesting because no new disks had been imported to cause this issue. As far as I was aware no disk copies via the SAN were undertaken. I had the correct multi-path drivers on, though I couldn’t see that as the issue that caused this problem. I needed to look into it. But first I needed to get the VG online and file-system mounted.
So I decided to export the VG and re-import to see what I ended up with:
# exportvg data_vg # importvg -y data_vg hdisk44 0516-776 importvg: Cannot import hdisk44 as data_vg.
As can be seen from the above output. I couldn’t import the VG data_vg. AIX was complaining about hdisk44. I needed to investigate this further.
Query the ODM
The next task was to check the duplicate PVID disks by querying the disk header. To make sure I had no mismatches with the disks
# lquerypv -h /dev/hdisk41 80 10 00000080 00CD94B6 D2BC9362 00000000 00000000 |.......b........| # lquerypv -h /dev/hdisk44 80 10 00000080 00CD94B6 D2BC9362 00000000 00000000 |.......b........|
The second and third columns of the output report the current PVIDs. In this case two different disks had the same PVID. Not looking good, but fixable. So I queried the ODM (Object Data Manager) to see what AIX actually thought it had regarding the duplicate PVIDs, using the odmget command:
# odmget -q "name=hdisk44 and attribute=pvid" CuAt CuAt: name = "hdisk44" attribute = "pvid" value = "00cd94b6d2bc93620000000000000000" type = "R" … # odmget -q "name=hdisk41 and attribute=pvid" CuAt CuAt: name = "hdisk41" attribute = "pvid" value = "00cd94b6d2bc93620000000000000000" …
Looking at the above output the ODM reports both disks with the same PVID, so no mismatch which is what I was expecting to find.
At this point I thought I would try and re-generate the PVID of one of the disks.
# chdev -a pv=yes -l hdisk41 hdisk41 changed
Hooray! AIX says it has changed the PVID, but let’s check that:
# lspv hdisk0 00cd94b6d734dfa2 rootvg hdisk40 00cd94b6d382cda5 None hdisk41 00cd94b6d2bc9362 None hdisk44 00cd94b6d2bc9362 None
No PVID had changed. I tried the same procedure with the other disk, unfortunately the same result, no change on the PVID.
I then considered clearing the PVID, with:
chdev -a pv=clear -l hdisk41
However, this operation is permanent; I didn’t know what change this would make on the data contained in the exported VG, as this was a production box. I decided to tread carefully. I didn’t want to be put into the position requesting a restore of the file-system from the netbackup team.
Try Redefining the VG
I couldn’t import the VG or change the PVID of the disks. The only option I really had, with one eye on not losing the data, was to use redefinevg. This command is actually called as part of the importvg command, so I thought this is my best bet, not the only option available to me, but certainly the safest one. This command will define the disks in the volume group. With redefinevg you only need to specify one member disk of the VG, if all goes well it should import the VG as well. So I tried that.
# redefinevg -d hdisk41 data_vg
No output was returned which is good. So let’s see if the VG is present:
# lsvg rootvg data_vg
So far so good. Now I tried to varyon the VG, to make it come online:
# varyonvg data_vg 0516-1775 varyonvg: Physical volumes hdisk41 and hdisk44 have identical PVIDs (00cd94b6d2bc9362)
Same issue: duplicate PVIDs! At least I am back to where I started. So at this point I decided to let AIX re-generate the PVIDs with the recreatevg command.
First I decided to export the VG, and then remove the disks associated with the VG (data_vg):
# exportvg data_vg # rmdev -dl hdisk41; rmdev -dl hdisk44; rmdev –dl hdisk40
Next I ran the cfgmr command to rediscover the devices:
# cfgmgr
Recreate the VG with Confidence
Now I was in a strong position to bring the data back in and online, by using recreatevg. This command will literally recreate the VG in question, take the information from the ODM and assign new PVIDs to the disks contained in the VG recreation. Typically this command is used when you want to clone a VG from a SAN disk and import it back onto the same box.
The format of the command I was going to use was:
recreatevg -Y NA < vg_name>
With the recreatevg, unlike the importvg command, you need to specify all disks that are associated with the VG. It doesn’t matter what order the hdisks are in. With -Y and -L you can specify a prefix to LVs. This prefix would also be used in the mount points found in /etc/filesystems file. As I wasn’t cloning a set of file-systems, I just wanted the VG and file systems back as it was. So I specified NA, which means to not prefix.
# recreatevg -YNA data_vg hdisk40 hdisk41 hdisk44 # echo $? 0
No output from the above command, I expected it to echo out what LV's or file-systems it had imported. I checked the exit status from the command line. A zero (0), means it completed with no errors. This filled me with confidence. It was all starting to look all good.
# lsvg –o rootvg data_vg
The VG data_vg has been successfully imported and varied on. I checked that I had no duplicate PVIDs. Looking at the below output no duplicates were found.
# lspv hdisk0 00cd94b6d734dfa2 rootvg hdisk40 00cd94b6d3152d26 data_vg hdisk41 00cd94b6d3466c48 data_vg hdisk44 00cd94b6d2bc8373 data_vg
Then I checked the LV listing containing in the VG; all looked correct. Next I checked the file-system by running a fsck on it:
# fsck -y /dev/fslv06
This returned OK. Next task: mount the file-system. But first I double-checked the file /etc/filesytems to make sure AIX hadn’t prefixed the mount point in the /etc/filesytems file. It hadn’t, which is what I expected:
# grep "maps_uk" /etc/filesystems /maps_uk:
No prefixes to change, so OK to mount then:
# mount -a # df -g |grep maps /dev/fslv06 40.00 32.45 19% 1880 1% /maps_uk
Lastly I synced the ODM with any new changes that might have occurred with the recreatevg with regards to the VG data_vg:
# synclvodm datavg synclvodm: Physical volume data updated. synclvodm: Logical volume fslv06 updated. synclvodm: Logical volume fslv03 updated.
Now the system was usable to the business. I could hand it back to the application team.
Conclusion
Using the recreatevg, the duplicate PVIDs disappeared and new PVIDs were generated. The VG was recreated successfully. But more importantly I didn’t have to restore any data, as I had managed to recreate the VG with no data loss.
The next task was to try and find out why I had duplicate PVs in the first place. The file /tmp/lvmt.log was my friend here it did point me in the right direction. The lvmt.log file holds all LVM commands and processes that affect a LV, one may think of it as a history file on anything what was executed on a LV or VG. Looks like at some point someone had taken a SAN snapshot of one of the disks, but hadn’t removed it. At some point it got pulled into the VG.