Saturday, 22 January 2011

NexentaStor Community Edition: Troubleshooting the slow web interface

Although my experience with the NexentaStor Community Edition VSA has been largely positive, I found the web interface to be slow at times. I thought I'd do a bit of troubleshooting to see what was wrong...

The first step to troubleshooting is to get to a proper Unix prompt (remember that NexentaStor is built on the Solaris codebase). I opened an SSH session to the VSA and logged in as "admin". By default the admin shell is a bit special, and for real troubleshooting, we needed the root account. To get this, run the "su" command:

admin@nexenta01:~$ su

Note that I ran "su" and not "su -".

VMware ESX admins may be familiar with "esxtop", and Linux admins with "top". The Solaris equivalent is "prstat":

   877 root       36M   33M sleep   59    0   0:00:35 0.7% python2.5/15
  7312 root     4320K 3468K cpu0    59    0   0:00:00 0.4% prstat/1
   855 root       58M   26M sleep   59    0   0:00:33 0.3% nms/1
   196 root     7812K 4716K sleep   59    0   0:00:00 0.2% nscd/32
   576 root     6952K 4676K sleep   59    0   0:00:05 0.2% vmtoolsd/1
  3716 root       17M   15M sleep   44    5   0:00:04 0.1% volume-check/1
   596 root       39M 9092K sleep   59    0   0:00:00 0.1% nmdtrace/1
  7213 admin    7744K 5272K sleep   59    0   0:00:00 0.1% sshd/1
   953 root       58M   54M sleep   59    0   0:00:13 0.1% nms/1
  3560 root       16M   15M sleep   44    5   0:00:01 0.0% disk-check/1
  3564 root       18M   17M sleep   59    0   0:00:02 0.0% hosts-check/1
   434 root     3472K 2104K sleep   59    0   0:00:03 0.0% dbus-daemon/1
   324 root        0K    0K sleep   99  -20   0:00:02 0.0% zpool-testpool/136
     5 root        0K    0K sleep   99  -20   0:00:01 0.0% zpool-syspool/136
   509 root       18M   10M sleep   59    0   0:00:01 0.0% fmd/21
  1273 root       58M   54M sleep   59    0   0:00:05 0.0% nms/1
   392 root       13M 8400K sleep   59    0   0:00:00 0.0% smbd/18
   515 www-data   17M 6716K sleep   59    0   0:00:00 0.0% apache2/28
   234 root     2604K 1584K sleep  100    -   0:00:00 0.0% xntpd/1
  7231 root     4628K 2568K sleep   59    0   0:00:00 0.0% bash/1
   519 www-data   17M 6564K sleep   59    0   0:00:00 0.0% apache2/28
Total: 91 processes, 732 lwps, load averages: 0.55, 0.53, 0.55

When troubleshooting, I noticed that the process using the most CPU was "nms". This is a custom command provided by Nexenta. Curious to what this command was doing, I ran the truss command against the process id:

root@nexenta01:/export/home/admin# truss -f -p 855
855:    pollsys(0x08047AA0, 1, 0x08047B58, 0x00000000) (sleeping...)
855:    pollsys(0x08047AA0, 1, 0x08047B58, 0x00000000)    = 1
855:    read(4, " l01\00118\0\0\08D\0\0\0".., 2048)    = 176
855:    read(4, 0x0AB8DB78, 2048)            Err#11 EAGAIN
855:    stat64("/tmp/.nza", 0x0813E078)            = 0
855:    stat64("/tmp/.nza", 0x0813E078)            = 0
855:    stat64("/tmp/.nza/.appliance", 0x0813E078)    = 0
855:    open64("/tmp/.nza/.appliance", O_RDWR)        = 9
855:    fstat64(9, 0x0813DFE8)                = 0
855:    fcntl(9, F_SETFD, 0x00000001)            = 0
855:    llseek(9, 0, SEEK_CUR)                = 0
855:    fcntl(9, F_SETLKW64, 0x08047410)        = 0
855:    llseek(9, 0, SEEK_CUR)                = 0

The truss command traces system calls and although the output appear quite scary, you can learn a lot about what a process is doing without needing to know what the system calls are doing. Useful calls to look for are:
  • open() - opens a file for reading/writing. The number returned (on the right after the = sign) is the file descriptor.
  • close() - closes a file.
  • read() and write() - the first number after the "(" is the file that is being read from or written to. Cross reference it with the open() call.
  • stat() and stat64() - tests to see if a file exists. Don't worry if you get errors returned here as it might be the process looking for a file that may existing in multiple places (e.g., when scanning the PATH for an executable).
The -f option in truss means that child processes will be "followed". So if the process you are tracing forks another process, you will get the data on the child process as well. The -p tells truss to trace the numbered process (obtained from ps or prstat output).

That was a really quick intro to truss for the purposes of explaining how I debugged the problem. Truss is capable of a lot more than I've just described. See the man page ("man truss") for more details.

Back to the performance problem...

The truss output showed me that the nms process was scanning through a lot of ZFS snapshots. There seemed to be a lot of these snapshots. I obtained a list of snapshots on the system:

# zfs list -t snapshot

...and got hundreds back! Something was creating a large number of snapshots. On closer inspection, it appeared I was getting a new snapshot of some filesystems every 7 minutes:

filestore/Shared@snap-daily-1-2011-01-21-2122     1.23M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2129     1.22M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2136     1.39M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2143     1.22M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2150     1.12M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2157     1.06M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2201     1.04M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2208     1.04M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2215         0      -  50.5G  -
filestore/Shared@snap-daily-1-latest                  0      -  50.5G  -

It appeared that the filesystems with the large number of snapshots were also the filesystems that I had set to replicate to my second VSA using the auto-tier service. As a test, I listed all the auto-tier services and disabled the suspects:

root@nexenta01:/# svcs -a | grep auto-tier
online         20:33:40 svc:/system/filesystem/zfs/auto-tier:filestore-Software-000
online         20:33:41 svc:/system/filesystem/zfs/auto-tier:filestore-ISOs-000
online         20:33:42 svc:/system/filesystem/zfs/auto-tier:filestore-Shared-000
online         21:33:24 svc:/system/filesystem/zfs/auto-tier:filestore-Home-000
root@nexenta01:/# svcadm disable svc:/system/filesystem/zfs/auto-tier:filestore-Home-000
root@nexenta01:/# svcadm disable svc:/system/filesystem/zfs/auto-tier:filestore-Shared-000

The snapshots stopped.

To determine if the number of snapshots was the problem (I'd seen similar problems before), I destroyed all the snapshots for that filesystem:

root@nexenta01:/# for snapshot in $(zfs list -t snapshot | grep Home | grep snap-daily-1-2011-01 | awk '{ print $1 }'); do zfs destroy $snapshot; echo "Destroyed $snapshot"; done

The web interface performance was fast.

Okay, so that was the problem, but why was it happening? At first, I couldn't work it out and deleted and recreated the auto-tier jobs. Everything then worked fine... for a couple of days. Then the number of snapshots increased again.

This time I was able to identify a change in the configuration. I had had to reboot the second VSA because it had run out of memory (I assigned too little). This appears to have caused the link between the two to have been broken and the auto-tier jobs were running out of control.

Knowing that a new snapshot would fire every 7 minutes, I waited and ran the "ptree" command (shows the list of processes in a tree view showing the parent/child relationships) until I spotted the auto-tier job:

  2260  sh -c /lib/svc/method/zfs-auto-tier svc:/system/filesystem/zfs/auto-tier:filest
    2266  /usr/bin/perl /lib/svc/method/zfs-auto-tier svc:/system/filesystem/zfs/auto-tie
      2315  rsync -e ssh --delete --exclude-from=/var/lib/nza/rsync_excl.txt --inplace --ig
        2316  ssh rsync --server -lHogDtpre.isf --delete --ignore-errors

The problem here was the zfs-auto-tier service (process 2260). Although the full command is truncated, I compared it with the output from svcs (see above) and guessed it to be:

sh -c /lib/svc/method/zfs-auto-tier svc:/system/filesystem/zfs/auto-tier:filestore-Home-000

To examine the properties of this service, I ran:

root@nexenta01:/# svccfg -s svc:/system/filesystem/zfs/auto-tier:filestore-Shared-000 listprop
zfs                                application
zfs/action                         astring 
zfs/day                            astring  1
zfs/depth                          astring  1
zfs/dircontent                     astring  0
zfs/direction                      astring  1
zfs/exclude                        astring 
zfs/from-fs                        astring  /volumes/filestore/Shared
zfs/from-host                      astring  localhost
zfs/from-snapshot                  astring 
zfs/fs-name                        astring  filestore/Shared
zfs/keep_days                      astring  7
zfs/method                         astring  tier
zfs/minute                         astring  0
zfs/options                        astring  "--delete --exclude-from=/var/lib/nza/rsync_excl.txt --inplace --ignore-errors -HlptgoD"
zfs/proto                          astring  rsync+ssh
zfs/rate_limit                     astring  0
zfs/to-fs                          astring  /volumes/backup
zfs/to-host                        astring
zfs/trace_level                    astring  1
zfs/type                           astring  daily
zfs/retry-timestamp                astring  1295492474
zfs/period                         astring  1
zfs/hour                           astring  2
zfs/last_replic_time               astring  12
zfs/time_started                   astring  21:43:18,Jan21
zfs/retry                          astring  1
startd                             framework
startd/duration                    astring  transient
general                            framework
general/enabled                    boolean  true
start                              method
start/exec                         astring  "/lib/svc/method/zfs-auto-tier start"
start/timeout_seconds              count    0
start/type                         astring  method
stop                               method
stop/exec                          astring  "/lib/svc/method/zfs-auto-tier stop"
stop/timeout_seconds               count    0
stop/type                          astring  method
refresh                            method
refresh/exec                       astring  "/lib/svc/method/zfs-auto-tier refresh"
refresh/timeout_seconds            count    0
refresh/type                       astring  method
restarter                          framework    NONPERSISTENT
restarter/auxiliary_state          astring  none
restarter/logfile                  astring  /var/svc/log/system-filesystem-zfs-auto-tier:filestore-Shared-000.log
restarter/start_pid                count    867
restarter/start_method_timestamp   time     1295642022.247594000
restarter/start_method_waitstatus  integer  0
restarter/transient_contract       count  
restarter/next_state               astring  none
restarter/state                    astring  online
restarter/state_timestamp          time     1295642022.255217000

The property that stood out was zfs/retry-timestamp and I guessed the value was a timestamp counting in seconds since the epoch. Converting the value turned it into a human-readable date:

Thu Jan 20 2011 03:01:14 GMT+0000 (BST)

This date was in the past, so was the script running because of this?

I editing the value:

svccfg -s svc:/system/filesystem/zfs/auto-tier:filestore-Home-000 setprop zfs/retry-timestamp=0

And waited...

No new snapshot was created!

I assume this a bug. The temporary failure of the second device should not cause the primary VSA to run amok! Fortunately, this fix appears to have worked and the auto-tier service is now working correctly. The web interface is also performing as expected!


THo aka "bonzay" tafka "DJ Mouse" said...

Thank you so much for this post, i expierenced the exakt same problems, I hope they will be fixed now. Thx again

Bethwaite said...

I had a similar issue with an auto-tier service continually rerunning every ~10 mins after it failed once, despite all subsequent rsyncs succeeding. I tried disabling and re-enabling the service, and even altered the schedule, neither had any effect.

Thanks to your post I looked at the service properties. My other auto-tiers also had their retry-timestamps in the past, but they were not exhibiting the buggy behaviour. I noticed the zfs/retry property was 2 on the buggy service, where on the others it was 0. The auto-tier config "Service retry" was false. Setting the zfs/retry service property to 0 fixed the issue for me (for the moment).

JR said...

Thanks for the comment Bethwaite.

I've not had any problems recently to test your property, but will give it a try if I have any further issues.

Thanks for the comment!