The first step to troubleshooting is to get to a proper Unix prompt (remember that NexentaStor is built on the Solaris codebase). I opened an SSH session to the VSA and logged in as "admin". By default the admin shell is a bit special, and for real troubleshooting, we needed the root account. To get this, run the "su" command:
admin@nexenta01:~$ su
Password:
root@nexenta01:/export/home/admin#
Note that I ran "su" and not "su -".
VMware ESX admins may be familiar with "esxtop", and Linux admins with "top". The Solaris equivalent is "prstat":
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
877 root 36M 33M sleep 59 0 0:00:35 0.7% python2.5/15
7312 root 4320K 3468K cpu0 59 0 0:00:00 0.4% prstat/1
855 root 58M 26M sleep 59 0 0:00:33 0.3% nms/1
196 root 7812K 4716K sleep 59 0 0:00:00 0.2% nscd/32
576 root 6952K 4676K sleep 59 0 0:00:05 0.2% vmtoolsd/1
3716 root 17M 15M sleep 44 5 0:00:04 0.1% volume-check/1
596 root 39M 9092K sleep 59 0 0:00:00 0.1% nmdtrace/1
7213 admin 7744K 5272K sleep 59 0 0:00:00 0.1% sshd/1
953 root 58M 54M sleep 59 0 0:00:13 0.1% nms/1
3560 root 16M 15M sleep 44 5 0:00:01 0.0% disk-check/1
3564 root 18M 17M sleep 59 0 0:00:02 0.0% hosts-check/1
434 root 3472K 2104K sleep 59 0 0:00:03 0.0% dbus-daemon/1
324 root 0K 0K sleep 99 -20 0:00:02 0.0% zpool-testpool/136
5 root 0K 0K sleep 99 -20 0:00:01 0.0% zpool-syspool/136
509 root 18M 10M sleep 59 0 0:00:01 0.0% fmd/21
1273 root 58M 54M sleep 59 0 0:00:05 0.0% nms/1
392 root 13M 8400K sleep 59 0 0:00:00 0.0% smbd/18
515 www-data 17M 6716K sleep 59 0 0:00:00 0.0% apache2/28
234 root 2604K 1584K sleep 100 - 0:00:00 0.0% xntpd/1
7231 root 4628K 2568K sleep 59 0 0:00:00 0.0% bash/1
519 www-data 17M 6564K sleep 59 0 0:00:00 0.0% apache2/28
Total: 91 processes, 732 lwps, load averages: 0.55, 0.53, 0.55
When troubleshooting, I noticed that the process using the most CPU was "nms". This is a custom command provided by Nexenta. Curious to what this command was doing, I ran the truss command against the process id:
root@nexenta01:/export/home/admin# truss -f -p 855
855: pollsys(0x08047AA0, 1, 0x08047B58, 0x00000000) (sleeping...)
855: pollsys(0x08047AA0, 1, 0x08047B58, 0x00000000) = 1
855: read(4, " l01\00118\0\0\08D\0\0\0".., 2048) = 176
855: read(4, 0x0AB8DB78, 2048) Err#11 EAGAIN
855: stat64("/tmp/.nza", 0x0813E078) = 0
855: stat64("/tmp/.nza", 0x0813E078) = 0
855: stat64("/tmp/.nza/.appliance", 0x0813E078) = 0
855: open64("/tmp/.nza/.appliance", O_RDWR) = 9
855: fstat64(9, 0x0813DFE8) = 0
855: fcntl(9, F_SETFD, 0x00000001) = 0
855: llseek(9, 0, SEEK_CUR) = 0
855: fcntl(9, F_SETLKW64, 0x08047410) = 0
855: llseek(9, 0, SEEK_CUR) = 0
The truss command traces system calls and although the output appear quite scary, you can learn a lot about what a process is doing without needing to know what the system calls are doing. Useful calls to look for are:
- open() - opens a file for reading/writing. The number returned (on the right after the = sign) is the file descriptor.
- close() - closes a file.
- read() and write() - the first number after the "(" is the file that is being read from or written to. Cross reference it with the open() call.
- stat() and stat64() - tests to see if a file exists. Don't worry if you get errors returned here as it might be the process looking for a file that may existing in multiple places (e.g., when scanning the PATH for an executable).
That was a really quick intro to truss for the purposes of explaining how I debugged the problem. Truss is capable of a lot more than I've just described. See the man page ("man truss") for more details.
Back to the performance problem...
The truss output showed me that the nms process was scanning through a lot of ZFS snapshots. There seemed to be a lot of these snapshots. I obtained a list of snapshots on the system:
# zfs list -t snapshot
...and got hundreds back! Something was creating a large number of snapshots. On closer inspection, it appeared I was getting a new snapshot of some filesystems every 7 minutes:
filestore/Shared@snap-daily-1-2011-01-21-2122 1.23M - 50.5G -
filestore/Shared@snap-daily-1-2011-01-21-2129 1.22M - 50.5G -
filestore/Shared@snap-daily-1-2011-01-21-2136 1.39M - 50.5G -
filestore/Shared@snap-daily-1-2011-01-21-2143 1.22M - 50.5G -
filestore/Shared@snap-daily-1-2011-01-21-2150 1.12M - 50.5G -
filestore/Shared@snap-daily-1-2011-01-21-2157 1.06M - 50.5G -
filestore/Shared@snap-daily-1-2011-01-21-2201 1.04M - 50.5G -
filestore/Shared@snap-daily-1-2011-01-21-2208 1.04M - 50.5G -
filestore/Shared@snap-daily-1-2011-01-21-2215 0 - 50.5G -
filestore/Shared@snap-daily-1-latest 0 - 50.5G -
It appeared that the filesystems with the large number of snapshots were also the filesystems that I had set to replicate to my second VSA using the auto-tier service. As a test, I listed all the auto-tier services and disabled the suspects:
root@nexenta01:/# svcs -a | grep auto-tier
online 20:33:40 svc:/system/filesystem/zfs/auto-tier:filestore-Software-000
online 20:33:41 svc:/system/filesystem/zfs/auto-tier:filestore-ISOs-000
online 20:33:42 svc:/system/filesystem/zfs/auto-tier:filestore-Shared-000
online 21:33:24 svc:/system/filesystem/zfs/auto-tier:filestore-Home-000
root@nexenta01:/# svcadm disable svc:/system/filesystem/zfs/auto-tier:filestore-Home-000online 20:33:40 svc:/system/filesystem/zfs/auto-tier:filestore-Software-000
online 20:33:41 svc:/system/filesystem/zfs/auto-tier:filestore-ISOs-000
online 20:33:42 svc:/system/filesystem/zfs/auto-tier:filestore-Shared-000
online 21:33:24 svc:/system/filesystem/zfs/auto-tier:filestore-Home-000
root@nexenta01:/# svcadm disable svc:/system/filesystem/zfs/auto-tier:filestore-Shared-000
The snapshots stopped.
To determine if the number of snapshots was the problem (I'd seen similar problems before), I destroyed all the snapshots for that filesystem:
root@nexenta01:/# for snapshot in $(zfs list -t snapshot | grep Home | grep snap-daily-1-2011-01 | awk '{ print $1 }'); do zfs destroy $snapshot; echo "Destroyed $snapshot"; done
The web interface performance was fast.
Okay, so that was the problem, but why was it happening? At first, I couldn't work it out and deleted and recreated the auto-tier jobs. Everything then worked fine... for a couple of days. Then the number of snapshots increased again.
This time I was able to identify a change in the configuration. I had had to reboot the second VSA because it had run out of memory (I assigned too little). This appears to have caused the link between the two to have been broken and the auto-tier jobs were running out of control.
Knowing that a new snapshot would fire every 7 minutes, I waited and ran the "ptree" command (shows the list of processes in a tree view showing the parent/child relationships) until I spotted the auto-tier job:
2266 /usr/bin/perl /lib/svc/method/zfs-auto-tier svc:/system/filesystem/zfs/auto-tie
2315 rsync -e ssh --delete --exclude-from=/var/lib/nza/rsync_excl.txt --inplace --ig
2316 ssh nexenta02.local.zone rsync --server -lHogDtpre.isf --delete --ignore-errors
The problem here was the zfs-auto-tier service (process 2260). Although the full command is truncated, I compared it with the output from svcs (see above) and guessed it to be:
sh -c /lib/svc/method/zfs-auto-tier svc:/system/filesystem/zfs/auto-tier:filestore-Home-000
To examine the properties of this service, I ran:
root@nexenta01:/# svccfg -s svc:/system/filesystem/zfs/auto-tier:filestore-Shared-000 listprop
zfs application
zfs/action astring
zfs/day astring 1
zfs/depth astring 1
zfs/dircontent astring 0
zfs/direction astring 1
zfs/exclude astring
zfs/from-fs astring /volumes/filestore/Shared
zfs/from-host astring localhost
zfs/from-snapshot astring
zfs/fs-name astring filestore/Shared
zfs/keep_days astring 7
zfs/method astring tier
zfs/minute astring 0
zfs/options astring "--delete --exclude-from=/var/lib/nza/rsync_excl.txt --inplace --ignore-errors -HlptgoD"
zfs/proto astring rsync+ssh
zfs/rate_limit astring 0
zfs/to-fs astring /volumes/backup
zfs/to-host astring nexenta02.local.zone
zfs/trace_level astring 1
zfs/type astring daily
zfs/retry-timestamp astring 1295492474
zfs/period astring 1
zfs/hour astring 2
zfs/last_replic_time astring 12
zfs/time_started astring 21:43:18,Jan21
zfs/retry astring 1
startd framework
startd/duration astring transient
general framework
general/enabled boolean true
start method
start/exec astring "/lib/svc/method/zfs-auto-tier start"
start/timeout_seconds count 0
start/type astring method
stop method
stop/exec astring "/lib/svc/method/zfs-auto-tier stop"
stop/timeout_seconds count 0
stop/type astring method
refresh method
refresh/exec astring "/lib/svc/method/zfs-auto-tier refresh"
refresh/timeout_seconds count 0
refresh/type astring method
restarter framework NONPERSISTENT
restarter/auxiliary_state astring none
restarter/logfile astring /var/svc/log/system-filesystem-zfs-auto-tier:filestore-Shared-000.log
restarter/start_pid count 867
restarter/start_method_timestamp time 1295642022.247594000
restarter/start_method_waitstatus integer 0
restarter/transient_contract count
restarter/next_state astring none
restarter/state astring online
restarter/state_timestamp time 1295642022.255217000
The property that stood out was zfs/retry-timestamp and I guessed the value was a timestamp counting in seconds since the epoch. Converting the value turned it into a human-readable date:
Thu Jan 20 2011 03:01:14 GMT+0000 (BST)
This date was in the past, so was the script running because of this?
I editing the value:
svccfg -s svc:/system/filesystem/zfs/auto-tier:filestore-Home-000 setprop zfs/retry-timestamp=0
And waited...
No new snapshot was created!
I assume this a bug. The temporary failure of the second device should not cause the primary VSA to run amok! Fortunately, this fix appears to have worked and the auto-tier service is now working correctly. The web interface is also performing as expected!