Saturday, 10 October 2009

Debugging backup problems

We had some problems with our nightly backup recently. These are tricky to debug as you don't want to be playing around while users are on the system being backed up, so it can become a case of try and fix something during the day, wait until the nightly backup run and try again the next day when it's failed.

The problem for us was that the backup would start, and then fail after about 10 minutes. The software we use is CA ARCserve for Unix and the message logged in the error log was a very unhelpful "unknown scsi error".

The backup server is a Sun V125 running Solaris 10. As noted above, the backup software is ARCserve, version 11.5. The server has a Qlogic fibre channel HBA for SAN connectivity and plugs into a SAN fabric built on Qlogic 5200 (2Gbit) and 5600 (4Gbit) switches. Also plugged into the SAN is an ATTO Fibrebridge. Our tape library, the Sun StorageTek SL48, connects via SCSI to the fibre bridge, which in turn publishes the internal tape drive and the library as LUNs on the SAN.

As you can see, there are a few things that could go wrong. The reason we have this somewhat complicated setup is that when we originally bought the SL48, although it was listed on CA's HCL, it only worked in a fibre attached configuration.

Having checked the obvious; that ARCserve could see the tape library, load, unload and scan the media, that there were sufficient "scratch" tapes available for writing and that the Ingres database that holds all the backup records was consistent and working properly, I turned my attention to the hardware.

The first step was to eliminate ARCserve from the list of suspects. When loaded, ARCserve loads the "cha" device driver that controls the tape library. I rebooted the backup server to ensure that the drivers were definitely not running, and observed that the tape drive could be seen as /dev/rmt/0. Using the SL48 web interface, I loaded a tape and tried to perform a ufsdump of a local filesystem to the tape drive.

This worked for a while and then failed with a write error.

Okay, so it's not ARCserve, and it looks like an error on the tape drive. Perhaps a clean would help. ARCserve is meant to run automatic drive cleans, but perhaps it hadn't. Again, the SL48 web interface provides functionality to do this.

The SL48 complained that the cleaning tape had expired. Unfortunately, I didn't have any spare, but it looked like this might be the issue. I immediately ordered three more tapes, and configured the SL48 to email me when it had warnings as well as errors. This should mean I get notified when the next cleaning tape expires.

A dirty drive was the most likely suspect, but in order to definitely rule out the SAN, I tried directly attaching the SL48 to the V125's internal SCSI port. This hadn't worked with the original ARCserver 11.5, but we had since applied a service pack.

The service pack had updated the device driver list and the SL48 was now detected as a local SCSI drive. I tried the same ufsdump, against the same local filesystem, using the same tape and expecting the same error, but was surprised to see that the backup completed without any problems.

Hmm, perhaps it's the SAN.

Last week, prior to the problems starting, we added a new fibre switch (the 5600) and this required that our existing switches (5200) have a firmware upgrade so they were all at the same level. It's possible that there is something in this latest firmware release that is causing the tape library (or fibre bridge) to choke.

I fired up ARCserve again and kicked off a backup job. It was during the day, but we hadn't had a working backup for several days, so I was content to take the performance hit (not that anyone appeared to notice).

The backup ran for 13 hours and completed successfully.

I've not actually spent more time trying to determine whether the problem is with the switches or the [now redundant] fibre bridge. The backup is now a lot simpler in it's configuration. My belief is that of all the systems we run, the backup should be the simplest, especially when there is a restore to be done!

Subsequent backups have been successful, but in the event of future problems, it's useful to have a template to work from when debugging any issues.

No comments: