Recently we purchased a Sun Fire v210 server with a pair of 73GB drives. The Sun Fire doesn’t come with a RAID controller on board and I wasn’t interested in paying Sun another $1000 to install one, since the machine is primarily for development, so I decided to set up Sun Volume Manager (a/k/a Logical Volume Manager, or LVM) to mirror the drives.
Sadly, there’s no way to do so during the initial operating system install. Sun’s documentation says that in order to configure a LVM set during installation, you must use JumpStart. It seemed like a waste of time to set up JumpStart for one machine, so I decided to just do a regular installation onto the primary disk and configure LVM later, by mirroring the primary drive onto the secondary, creating a RAID-1 set.
LVM stores all of its metadata on-disk in a state database, and you must configure it with enough copies (“replicas”) to survive the potential loss of one or more replicas. Sun’s documentation recommends that for a RAID-1 set that you create 2 replicas per drive, so that even if a drive is lost, you still have enough copies of the metadata to run in degraded mode.
Sun’s documentation claims that you can store state database replicas on:
- A dedicated local disk partition
- A local partition that will be part of a volume
- A local partition that will be part of a UFS logging device
Additionally, the same document says:
Replicas cannot be stored on the root (/), swap, or /usr slices. Nor can replicas be stored on slices that contain existing file systems or data. After the replicas have been stored, volumes or file systems can be placed on the same slice.
Unfortunately, I didn’t know enough to set up a dedicated local partition for the state database, so I figured that I could just use an empty local partition to store the state databases, and newfs(1M) it later.
For reference, here’s how my prtvtoc(1M) output appeared right after the initial install:
* First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 2 00 8395200 22529664 30924863 / 1 7 00 30924864 12292608 43217471 /var 2 5 00 0 143349312 143349311 3 3 01 0 8395200 8395199 swap 5 0 00 43217472 40968576 84186047 /opt 6 4 00 84186048 40968576 125154623 /usr 7 8 00 125154624 18194688 143349311 /export/home
I decided to put the state database replicas on c1d0t0s7 since /export/home was empty and I could afford to reformat it. So I unmounted it and tried:
# metadb -a -c 2 -f c1t0d0s7 # metadb flags first blk block count a m p luo 16 8192 /dev/dsk/c1t0d0s7 a p luo 8208 8192 /dev/dsk/c1t0d0s7
Okay, that seemed to work. Now to re-newfs(1M) the disk:
# newfs /dev/rdsk/c1t0d0s7 [newfs output snipped] # mount /dev/dsk/c1t0d0s7 /export/home
I rebooted the machine at this point, and tried to do a metadb:
# metadb flags first blk block count F p luo ? ? /dev/dsk/c1t0d0s7 a m p luo 8208 8192 /dev/dsk/c1t0d0s7
Uh-oh. Somehow newfs(1M)-ing the partition wiped out the first database replica. How could this be? I thought Sun said you could have a state database replica on a partition that will be part of a volume? (which I intended to do by mirroring /export/home)
Anyway, I didn’t have time to delve into this further, so I decided to just carve out a piece of the swap space for myself to use as a dedicated local partition for the state database replica. Here’s how the prtvtoc(1M) output looked after that:
* First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 2 00 8395200 22529664 30924863 / 1 7 00 30924864 12292608 43217471 /var 2 5 00 0 143349312 143349311 3 3 01 0 8242560 8242559 swap 4 0 00 8242560 152640 8395199 5 0 00 43217472 40968576 84186047 /opt 6 4 00 84186048 40968576 125154623 /usr 7 8 00 125154624 18194688 143349311 /export/home
Then I went and created the state database replicas on /dev/rdsk/c1t0d0s4. All is well:
flags first blk block count a m p luo 16 8192 /dev/dsk/c1t0d0s4 a p luo 8208 8192 /dev/dsk/c1t0d0s4
It was now time to recreate the partition table on the second disk, c1t1d0. I didn’t realize you could pipe the output from prtvtoc(1M) into fmthard(1M), so I ended up trying to recreate the partition table from scratch, interactively, using format(1M). Unfortunately this meant that the sector start and end blocks were all different and most critically, /dev/dsk/c1t1d0s0 started on sector 0. You’ll see what happened in the next section.
Now that I had two disks partitioned (so I thought) with the same partitioning scheme, it was time to start setting up volumes. I thought I would start with the hardest (slightly) task: mirroring the root filesystem. Fortunately, Sun’s documentation for doing this is very good. You have to first create the components of the mirror as “RAID-0” submirrors, attach the first submirror to the mirror, change the root device to the new mirror device node, reboot the machine, and then attach the second submirror to the mirror. LVM will then rebuild the first submirror onto the second. The sequence of commands looks like this:
# metainit -f d11 1 1 c1t0d0s0 # metainit d12 1 1 c1t1d0s0 # metainit d10 -m d11 # metaroot d10
Reboot the machine at this point, and when it comes back up, attach the second submirror to the mirror:
# metattach d10 d12
Unfortunately, at this point, I got the error can't attach labeled submirror to unlabeled mirror. The LVM user guide gives a very dry explanation:
An attempt was made to attach a labeled submirror to an unlabeled mirror. A labeled metadevice is a device whose first component starts at cylinder 0. To prevent the submirror’s label from being corrupted, DiskSuite does not allow labeled submirrors to be attached to unlabeled mirrors.
It seems that my hand-creation of the partition table on c1t1d0 was the root cause (no pun intended) of the problem; the slice c1t1d0s1 started at cylinder 0, and so was “labelled” with the disklabel for the entire drive! Now I understand why the Solaris install program never allocates slice 0 starting on cylinder 0.
Okay, so I had to repartition c1t1d0 with exactly the same partition table as c1t0d0. I found this handy command online to duplicate partition tables:
# prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2
Then it was just a matter of recreating the database replicas on the drive, recreating the bad submirror, and metattach-ing it to the d10 mirror.
I managed to convert all of the other filesystems over to LVM volumes using Sun’s instructions so now all filesystems are LVM-managed.
Next episode: setting up zones on the machine.
Julian, is there any special reason not to use ZFS instead of LVM ? What's your opinion about it ?
Feature Story: ZFS–the last word in file systems.
The breakthrough file system in Solaris 10 delivers virtually unlimited capacity, provable data integrity, and near-zero administration.
OpenSolaris Community: ZFS
100 Mirrored Filesystems in 5 minutes
Actually, I didn't even know about ZFS… but thanks for the tip! I'm looking into it now, and I might play with it in a test environment. For production I'm obviously still most comfortable with LVM because it has a long history and therefore I assume more people know about it if I have to ask the community for support.
(Amusingly enough, the fellow that runs the <a rel="nofollow" href="http://www.sunmanagers.org/" rel="nofollow">sunmanagers mailing list I just joined is someone that I interviewed with not so long ago!)
Edit: Apparently ZFS is still beta quality and not yet released — someone asked about it on <tt>sunmanagers</tt> and here is some information on its current status.
you used the wrong format to attach the 2nd submirror to the mirrored disk pair. You should have used mettattach d10 d12. You transposed the d10 and d12. The first should always be the mirror, while the second is the submirror.
Dave – you're right. I've fixed the typo.