Abstract
This document walks you through building a RAID-5 volume using four disks (three is the minimum, and performance can degrade when more than eight are used) on Solaris 8, using the command-line interface to Solstice Disk Suite. Our demonstration system contains five hot-swappable 9GB SCSI disks, one of which is the system disk.
Contents
Introduction
Step One: Partitioning the Disks
Step Two: Creating the Metadevice Database
Step Three: Configure the RAID-5 Metadevice
Step Four: Configure a Hot Spare Pool
Step Five: Start the Logical Volume Manager
Step Six: Create a Filesystem on the Metadevice
Conclusion and Additional Advice
Further Information
Notes
Introduction
Solaris 8 provides a facility for building “software†RAID volumes; i.e., the arrangement of bits on the disks is handled by a device driver rather than by hardware. The advantage is that it is very cheap to build RAID volumes in this manner, and their performance scales with the CPU of the machine that serves them. The disadvantage is that such volumes will perform poorly on older machines that have slow CPUs, and also that they will not have the acceleration features that good, dedicated RAID hardware chassis systems have. If you’re working with a small budget, however, you can’t beat software RAID.
The sections that follow assume that you have already installed Solstice Disk Suite (it’s on disk 2 of the CD set), and that you have at least three disks of the same rotational and data transfer speed.
Terminology
Solaris 8 refers to RAID volumes as “metadevicesâ€, and to associated journaling devices or the like as “trans metadevicesâ€. This is the terminology that will be used from here on in the document, so that you can cross-reference with the man pages and not have to translate terminology in your head.
Step One: Partitioning the Disks
Metadevices are made up of disk slices. If you have not already partitioned the disks that you intend to use to build your metadevice, do so now. I tend to make slice 0 100MB, in case I want to attach a journaling trans metadevice later. I also store the metadb replicas in slice 0 on each of the disks. I then use slice 7 for the entire rest of the disk. For my example 9GB disks, my partition table looks like this:
partition> print Volume: raid5d3 Current partition table (unnamed): Total disk cylinders available: 4924 + 2 (reserved cylinders) Part Tag Flag Cylinders Size Blocks 0 unassigned wm 0 - 57 101.70MB (58/0/0) 208278 1 unassigned wm 0 0 (0/0/0) 0 2 backup wm 0 - 4923 8.43GB (4924/0/0) 17682084 3 unassigned wm 0 0 (0/0/0) 0 4 unassigned wm 0 0 (0/0/0) 0 5 unassigned wm 0 0 (0/0/0) 0 6 unassigned wm 0 0 (0/0/0) 0 7 unassigned wm 58 - 4923 8.33GB (4866/0/0) 17473806
Use the format(1M) command to edit the partition table and to label the disk (including setting the volume name).
Step Two: Creating the Metadevice Database
We create the metadevice database, which holds information about which disks participate in the various metadevices, with the following command:
# metadb -a -f -c2 c0t2d0s0 c0t3d0s0 c2t0d0s0 c2t1d0s0
Explanation
metadb(1M) creates RAID database replicas on the slices listed on the command line. These database replicas contain state and configuration information, and use a majority consensus algorithm to determine who has the current, correct information. The -a switch tells metadb to attach a new database device, and modifies /etc/system to tell the system to reattach the devices at boot-time. The -f switch is used to create the initial state database. The -c switch is used to determine the number of database replicas that will be created on each of the specified slices. In our case, we’re creating two replicas per slice, just to be paranoid.
Step Three: Configure the RAID-5 Metadevice
Now we need to define the RAID-5 metadevice by name, slices participating in the metadevice, and stripe width. We do that with the following command:
# metainit d5 -r c0t2d0s7 c0t3d0s7 c2t0d0s7 c2t1d0s7 -i 65k
Explanation
In this example, d5 is the name of the volume, -r designates this metadevice as a RAID-5 metadevice, and the parameters are the slices that participate in the metadevice. In this case, we’ve chosen slice 7 on each disk, which is traditionally the entire disk. One could choose other slices, so long as each slice is the same size. -i 65k defines a stripe interlace size of 65KB (the default is 16KB), which will give us a stripe size of 260KB (4 disks * 65KB per disk). Others have empirically determined that 256KB-512KB is the optimum stripe width for a general-purpose RAID-5 volume, because most writes will fit into a single stripe, minimizing the numbers of reads and writes that must occur for parity calculations[1]. If you know your average file size, then you should tailor the stripe interlace size accordingly.
Note that if your stripe size is a power of 2, there’s a good chance that all of your superblocks and inodes will end up on the same physical disk, which will negatively impact performance. That’s why the example uses 65KB as an interlace size instead of 64KB.
Finally, note that Solaris 8 has the annoying limitation that metadevice names must match the following regular expression: d[0-9]+
Step Four: Configure a Hot Spare Pool
Solstice Disk Suite allows you to associate one or more pools of hot spare disks with a metadevice, so that if one of the disks in the metadevice fails, one of the hot spares is transparently substituted for the failed device, and the volume is rebuilt. The nice thing is that you can configure an empty hot spare pool and add disks to it later, which is exactly what we’re going to do here.
# metainit hsp001
# metaparam -h hsp001 d5
Explanation
The metainit(1M) command creates the hot spare pool, named hsp001. The metaparam(1M) command associates the hot spare pool with our RAID-5 metadevice. Later, we can add devices to the hot spare pool with the metahs(1M) command.
Step Five: Start the Logical Volume Manager
To be able to use our metadevice as if it were a physical device, we have to start the logical volume manager (our RAID-5 metadevice is a logical volume):
# sh /etc/init.d/lvm.init start
Explanation
Self-explanatory.
Now, you must wait until the metadevice is initialized before proceeding to the next step. You can watch the progress of metadevice initialization via repeated invocations of the command metastat -i.
Step Six: Create a Filesystem on the Metadevice
Now that we’ve created our metadevice and started the volume manager, we can now pretend that the metadevice is a big partition on which we can do the usual filesystem things. The first thing, of course, is to create a filesystem. One must bear in mind that for good performance, the product of the maximum number of contiguous blocks (maxcontig) and the blocksize must be an integral multiple of the stripe width In our example, we have a 65KB stripe interlace size on four disks, giving us a 260KB stripe width. We’re going to use an 8KB blocksize; therefore, we want to set maxcontig to 260KB/8KB = 65/2 = 32.5. That isn’t going to work, so let’s try for double the stripe width: 520KB/8KB = 65. This works beautifully, because our 8KB blocksize times 65 contiguous blocks works out to 520KB–exactly twice our stripe size (we couldn’t get one exact stripe size, because we’d have needed a non-integral number of contiguous blocks, which just isn’t possible).
Solaris 8 recommends as well that we designate 256 cylinders per group when creating filesystems larger than 8GB. Our metadevice is going to be 3 * 8.33GB = 24.99GB (the equivalent of one full slice is taken up with parity information, which is actually striped across all of the disks in the metadevice for resiliency); therefore, we’ll heed this recommendation. Our blocksize and cylinder group size choices will minimize fsck(1M) times after an ungraceful shutdown.
Our newfs(1M) command, which creates a UFS filesystem on the RAID-5 metadevice, looks like this:
# newfs -c 256 -i 8192 -m 8 -C 65 /dev/md/rdsk/d5
Explanation
- -c 256
- specifies 256 cylinders per group, as discussed above.
- -i 8192
- specifies 8192 blocks per inode; i.e., a blocksize of 8KB. This blocksize causes the fragment size to default to 1024 bytes, a default that we accept.
- -m 8
- specifies that a minimum of 8% of the filesystem is held back as free space. The default value would have been 1%, but that would degrade throughput threefold over a value of 10%.[2] 8% strikes a nice balance between throughput and space. Once this threshold is reached, only the superuser can write to the partition.
- -C 65
- specifies a maximum of 65 contiguous blocks to be laid out before inserting a rotational delay.
- /dev/md/rdsk/d5
- the “raw†version of our RAID-5 metadevice.
Conclusion and Additional Advice
We’re done with this simple example. It isn’t difficult to implement RAID on a Solaris machine for very low cost. Certainly, more elaborate constructions can be made (e.g., journaling, mirrors of striped devices, etc.), and the basic idea is the same as we’ve seen above, except that in some cases, the sub-devices that we use in building a metadevice may be metadevices themselves (e.g., mirroring a striped metadevice).
IMPORTANT NOTE:
Although the RAID-5 setup we’ve demonstrated provides protection against the failure of any disk in the volume, it does not guard against corruption of the configuration information on the system disk. If you lose that, you will have to retrieve your data from the RAID volume with a hex editor (no fun). Make sure that you back up /etc/lvm regularly! (You should be backing up /etc anyway).