Tuesday, 20 October 2009

Building a ZFS filesystem using RAID60

We are starting to use ZFS as a production filesystem at $WORK. Our disk array of choice is the Sun StorageTek 2540 which provides hardware RAID capabilities. When building a ZFS environment, the decision has to be made on whether to use the hardware RAID and/or the software RAID capabilities of ZFS.

Having watched Ben Rockwood's excellent ZFS tutorial, my understanding of ZFS is much better than before. For our new fileserver, I've created the following:

On the StorageTek 2540, I've created two virtual disks in a RAID6 configuration. Each virtual disk comprises of 5 physical disks (3 for data, 2 for parity) and is assigned to different controllers. On top of each virtual disk, I've created a 100GB volume. This is published as a LUN to the Solaris server and appears as c0d3 and c0d4.

Each LUN is then added to a zpool called "fileserver":

# zpool create fileserver c0d3 c0d4

By default, ZFS treats the above in a variable width stripe, so the hardware and software combined result in a "RAID60" configuration; data is striped across 2 x RAID6 virtual disks for a total width of 10 spindles.

Why RAID6 and not RAID10? Apart from the cost implications, as a fileserver, the majority of operations will be read-only and RAID6 is very good at reads (while being less-good at writes).

Now, when I'm running out of space, I can create a new volume, publish the LUN and add it to the zpool:

# zpool add fileserver c0d5

A quick check of the zpool status shows the disk is added successfully:

# zpool status
pool: fileserver

state: ONLINE

scrub: none requested

config:

NAME STATE READ WRITE CKSUM

fileserver ONLINE 0 0 0

c0d3 ONLINE 0 0 0

c0d4 ONLINE 0 0 0

errors: No known data errors

Running zpool list reports the newly added space is available:

# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
fileserver 199G 69.4G 130G 34% ONLINE -


All told, very simple to do and results in a pretty fast filesystem.

Saturday, 10 October 2009

Debugging backup problems

We had some problems with our nightly backup recently. These are tricky to debug as you don't want to be playing around while users are on the system being backed up, so it can become a case of try and fix something during the day, wait until the nightly backup run and try again the next day when it's failed.

The problem for us was that the backup would start, and then fail after about 10 minutes. The software we use is CA ARCserve for Unix and the message logged in the error log was a very unhelpful "unknown scsi error".

The backup server is a Sun V125 running Solaris 10. As noted above, the backup software is ARCserve, version 11.5. The server has a Qlogic fibre channel HBA for SAN connectivity and plugs into a SAN fabric built on Qlogic 5200 (2Gbit) and 5600 (4Gbit) switches. Also plugged into the SAN is an ATTO Fibrebridge. Our tape library, the Sun StorageTek SL48, connects via SCSI to the fibre bridge, which in turn publishes the internal tape drive and the library as LUNs on the SAN.

As you can see, there are a few things that could go wrong. The reason we have this somewhat complicated setup is that when we originally bought the SL48, although it was listed on CA's HCL, it only worked in a fibre attached configuration.

Having checked the obvious; that ARCserve could see the tape library, load, unload and scan the media, that there were sufficient "scratch" tapes available for writing and that the Ingres database that holds all the backup records was consistent and working properly, I turned my attention to the hardware.

The first step was to eliminate ARCserve from the list of suspects. When loaded, ARCserve loads the "cha" device driver that controls the tape library. I rebooted the backup server to ensure that the drivers were definitely not running, and observed that the tape drive could be seen as /dev/rmt/0. Using the SL48 web interface, I loaded a tape and tried to perform a ufsdump of a local filesystem to the tape drive.

This worked for a while and then failed with a write error.

Okay, so it's not ARCserve, and it looks like an error on the tape drive. Perhaps a clean would help. ARCserve is meant to run automatic drive cleans, but perhaps it hadn't. Again, the SL48 web interface provides functionality to do this.

The SL48 complained that the cleaning tape had expired. Unfortunately, I didn't have any spare, but it looked like this might be the issue. I immediately ordered three more tapes, and configured the SL48 to email me when it had warnings as well as errors. This should mean I get notified when the next cleaning tape expires.

A dirty drive was the most likely suspect, but in order to definitely rule out the SAN, I tried directly attaching the SL48 to the V125's internal SCSI port. This hadn't worked with the original ARCserver 11.5, but we had since applied a service pack.

The service pack had updated the device driver list and the SL48 was now detected as a local SCSI drive. I tried the same ufsdump, against the same local filesystem, using the same tape and expecting the same error, but was surprised to see that the backup completed without any problems.

Hmm, perhaps it's the SAN.

Last week, prior to the problems starting, we added a new fibre switch (the 5600) and this required that our existing switches (5200) have a firmware upgrade so they were all at the same level. It's possible that there is something in this latest firmware release that is causing the tape library (or fibre bridge) to choke.

I fired up ARCserve again and kicked off a backup job. It was during the day, but we hadn't had a working backup for several days, so I was content to take the performance hit (not that anyone appeared to notice).

The backup ran for 13 hours and completed successfully.

I've not actually spent more time trying to determine whether the problem is with the switches or the [now redundant] fibre bridge. The backup is now a lot simpler in it's configuration. My belief is that of all the systems we run, the backup should be the simplest, especially when there is a restore to be done!

Subsequent backups have been successful, but in the event of future problems, it's useful to have a template to work from when debugging any issues.