Replacing a hard drive with Solaris Volume Manager

Last time I posted about my experience replacing a drive in an array created with Veritas Volume Manager. That was a RAID 0 array that had lost its data immediately when the drive died, so I didn’t worry about saving the data when rebuilding it. This time I was rebuilding a RAID 1 array that was built with Solaris Volume Manager. When I lost this hard drive the data was still intact on the existing functional drive and I wanted to keep it that way. So as far as disclaimers on this post, I kept my data with no problems, but don’t just copy and paste commands without understanding how it will effect your system. Every server is different, so don’t assume yours is setup the same as mine! So this is a Solaris 9 server, with two IDE drives setup into 4 separate mirrors: d0, d1, d3, & d7. Since there are two hard drives, there are two components or submirrors under each mirror. The first drive is always setup as 1x where x is the number from the parent mirror. The second drive is always 2x where x is the number from the parent mirror. So d0 is the parent mirror, d10 is the drive one submirror, and d20 is the drive two submirror. Make sense? Let’s go!

The first problem that occurs when you have two drives and you don’t notice that you lost one, is when you reboot or the system restarts for some reason. The problem is with the metadevice database replicas. Sun recommends that you store at least one metadevice database replica on each of your hard drives. So if you only have two drives, it is very likely you have an even number of replicas. Solaris 9 uses a majority consensus algorithm to determine if there are stale databases and will not boot without one more than half of the total replicas online. Quite obviously, this will cause your system to not boot if you have two hard drives and one dies. Here’s the console output for this situation:

metainit: hostname: stale databases

Insufficient metadevice database replicas located.

Use metadb to delete databases which are broken.
Ignore any “Read-only file system” error messages.
Reboot the system when finished to reload the metadevice database.
After reboot, repair any broken database replicas which were deleted.

Type control-d to proceed with normal startup,
(or give root password for system maintenance):

The solution is to simply remove the bad replicas and reboot. First check to see how the replicas are defined:

# metadb -i
flags first blk block count
M p 16 unknown /dev/dsk/c0t0d0s4
M p 4112 unknown /dev/dsk/c0t0d0s4
a m p lu 16 4096 /dev/dsk/c0t2d0s4
a p l 4112 4096 /dev/dsk/c0t2d0s4

Remove the replicas from the dead drive:

# metadb -d c0t0d0s4
metadb: rembrandt: c0t0d0s4: no metadevice database replica on device

Recheck that they’re gone:

# metadb -i
flags first blk block count
a m p lu 16 4096 /dev/dsk/c0t2d0s4
a p l 4112 4096 /dev/dsk/c0t2d0s4

Once logged out, the system will continue to boot and then reset again before fully booting.

# exit
logout
Resuming system initialization. Metadevice database will remain stale.

Once the system is back up and running, we can go about replacing the bad drive. The first thing you want to do is backup your lvm configs:

# cp -r /etc/lvm /root/lvm.backup

Now take a look at what mirrors/drives you have down. To keep the post shorter, I am just showing one mirror here. The others look the same. You’ll also notice at the end that in my case, the second hard drive isn’t showing up at all, a pretty good sign that the drive is dead. If it’s still listed you might have a chance that it’s still good. (I’ll go over that scenario in my next post.)

# metastat

… <SNIP> …

d0: Mirror
Submirror 0: d10
State: Needs maintenance
Submirror 1: d20
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 6295440 blocks (3.0 GB)

d10: Submirror of d0
State: Needs maintenance
Invoke: metareplace d0 c0t0d0s0
Size: 6295440 blocks (3.0 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c0t0d0s0 0 No Maintenance Yes

d20: Submirror of d0
State: Okay
Size: 6295440 blocks (3.0 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c0t2d0s0 0 No Okay Yes

Device Relocation Information:
Device Reloc Device ID
c0t2d0 Yes id1,dad@AWDC_WD1200BB-00CAA1=WD-WMA8C2168885

This can also be verified through iostat. If the drive is still in here, it has some useful information like model numbers and serial numbers. If it’s gone, the info is still useful too. You know which drive(s) are good and through the process of elimination, you know which one is bad.

# iostat -En
c0t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: WDC WD1200BB-00C Revision: 17.07W17 Serial No: WD-WMA8C2168885
Size: 120.03GB <120031641600 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0

This can also be verified through format and cfgadm:

# format
Searching for disks…done

AVAILABLE DISK SELECTIONS:
0. c0t2d0
/pci@1f,0/ide@d/dad@2,0
Specify disk (enter its number):

# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c0 scsi-bus connected configured unknown
c0::dsk/c0t2d0 disk connected configured unknown
c0::dsk/c0t3d0 CD-ROM connected configured unknown
usb0/1 unknown empty unconfigured ok
usb0/2 unknown empty unconfigured ok

Since, I had all the information I needed, I shutdown the server (since they were IDE drives and not hot swappable) and replaced the faulty drive with another one of the same model, size, etc. Once booted, I went into format, it saw the drive, so I went ahead and cleared the partitions since I had used this drive previously for something else. I’m not going to go through all the format screens, they’re self explanatory. So I went to format > 0 > part > zero out partitions (all but partition 2) > label > quit > quit. Next, I needed to partition the drive to be the same as the drive it was going to mirror. The prtvtoc command makes this easy. Just make sure you type the right drives:

# prtvtoc -h /dev/rdsk/c0t2d0s2 | fmthard -s – /dev/rdsk/c0t0d0s2
fmthard: New volume table of contents now in place.

You can verify that both partition tables match now if you like, compare the output of the two commands:

# prtvtoc -h /dev/rdsk/c0t0d0s2
# prtvtoc -h /dev/rdsk/c0t2d0s2

Now go into your lvm backup and cat out md.cf, copy it to another screen somewhere since you will be referring back to it several times. I have highlighted the lines I used:

# cd /root/lvm.backup/
# cat md.cf
d3 -m d13 d23 1
d13 1 1 c0t0d0s3
d23 1 1 c0t2d0s3
d7 -m d17 d27 1
d17 1 1 c0t0d0s7
d27 1 1 c0t2d0s7
d1 -m d11 d21 1
d11 1 1 c0t0d0s1
d21 1 1 c0t2d0s1
d0 -m d10 d20 1
d10 1 1 c0t0d0s0
d20 1 1 c0t2d0s0

Now we will be going through the process of detaching the failed submirror, clearing , rebuilding, and then reattaching it. Go through this for every submirror that has failed. I will show you the output from the first round then just the commands for the additional rounds. The third command (metainit) just uses the information from the md.cf file:

# metadetach -f d3 d13
d3: submirror d13 is detached
# metaclear d13
d13: Concat/Stripe is cleared
# metainit d13 1 1 c0t0d0s3
d13: Concat/Stripe is setup
# metattach d3 d13
d3: submirror d13 is attached

Now you can check the status of the resync with metastat:

# metastat d3
d3: Mirror
Submirror 0: d13
State: Resyncing
Submirror 1: d23
State: Okay
Resync in progress: 12 % done
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 4198320 blocks (2.0 GB)

d13: Submirror of d3
State: Resyncing
Size: 4198320 blocks (2.0 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c0t0d0s3 0 No Okay Yes

d23: Submirror of d3
State: Okay
Size: 4198320 blocks (2.0 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c0t2d0s3 0 No Okay Yes

Device Relocation Information:
Device Reloc Device ID
c0t0d0 Yes id1,dad@AWDC_WD1200BB-00CAA1=WD-WMA8C2114374
c0t2d0 Yes id1,dad@AWDC_WD1200BB-00CAA1=WD-WMA8C2168885

Now run through the rest of the commands:

# metadetach -f d7 d17
# metaclear d17
# metainit d17 1 1 c0t0d0s7
# metattach d7 d17

# metadetach -f d1 d11
# metaclear d11
# metainit d11 1 1 c0t0d0s1
# metattach d1 d11

# metadetach -f d0 d10
# metaclear d10
# metainit d10 1 1 c0t0d0s0
# metattach d0 d10

Of course, you can always check at anytime any of the mirrors or submirrors to see what you’re doing. If you want to check the status on multiple rebuilds, just run:

# metastat | grep Resync
State: Resyncing
Resync in progress: 2 % done
State: Resyncing
State: Resyncing
Resync in progress: 99 % done
State: Resyncing
State: Resyncing
Resync in progress: 1 % done
State: Resyncing

Once the drives have finished syncing, we need to make sure that we can still boot off this drive in case the other drive fails:

# installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t0d0s0

And, of course, we need to re-add the metadevice database replicas for the same reason:

# metadb -a -c 2 c0t0d0s4
# metadb -i
flags first blk block count
a u 16 4096 /dev/dsk/c0t0d0s4
a u 4112 4096 /dev/dsk/c0t0d0s4
a m p luo 16 4096 /dev/dsk/c0t2d0s4
a p luo 4112 4096 /dev/dsk/c0t2d0s4





Please VOTE for this page at: ADD TO DEL.ICIO.US | ADD TO DIGG | ADD TO FURL | ADD TO NEWSVINE | ADD TO NETSCAPE | ADD TO REDDIT | ADD TO STUMBLEUPON | ADD TO TECHNORATI FAVORITES | ADD TO SQUIDOO | ADD TO WINDOWS LIVE | ADD TO YAHOO MYWEB | ADD TO ASK | ADD TO GOOGLE


2 Comments


  1. […] Admin help? « Extract files from an rpm Replacing a hard drive with Sun Volume Manager »Replacing a hard drive with Veritas Volume Manager […]

    Posted September 29, 2008, 9:09 am

  2. It was pointed out to me that since the drive had the right partitions on it and had the same device name, it would be possible to replace all of the lines like this:

    # metadetach -f d7 d17
    # metaclear d17
    # metainit d17 1 1 c0t0d0s7
    # metattach d7 d17

    to just this:

    # metareplace -e d7 c0t0d0s7

    The -e is the important part of this. Without the -e it would complain, but the -e allows you to resync with the same name.

    Posted September 29, 2008, 4:32 pm

Leave a reply