Saturday, 22 January 2011

NexentaStor Community Edition: Troubleshooting the slow web interface

Although my experience with the NexentaStor Community Edition VSA has been largely positive, I found the web interface to be slow at times. I thought I'd do a bit of troubleshooting to see what was wrong...

The first step to troubleshooting is to get to a proper Unix prompt (remember that NexentaStor is built on the Solaris codebase). I opened an SSH session to the VSA and logged in as "admin". By default the admin shell is a bit special, and for real troubleshooting, we needed the root account. To get this, run the "su" command:

admin@nexenta01:~$ su
Password:
root@nexenta01:/export/home/admin#

Note that I ran "su" and not "su -".

VMware ESX admins may be familiar with "esxtop", and Linux admins with "top". The Solaris equivalent is "prstat":

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP      
   877 root       36M   33M sleep   59    0   0:00:35 0.7% python2.5/15
  7312 root     4320K 3468K cpu0    59    0   0:00:00 0.4% prstat/1
   855 root       58M   26M sleep   59    0   0:00:33 0.3% nms/1
   196 root     7812K 4716K sleep   59    0   0:00:00 0.2% nscd/32
   576 root     6952K 4676K sleep   59    0   0:00:05 0.2% vmtoolsd/1
  3716 root       17M   15M sleep   44    5   0:00:04 0.1% volume-check/1
   596 root       39M 9092K sleep   59    0   0:00:00 0.1% nmdtrace/1
  7213 admin    7744K 5272K sleep   59    0   0:00:00 0.1% sshd/1
   953 root       58M   54M sleep   59    0   0:00:13 0.1% nms/1
  3560 root       16M   15M sleep   44    5   0:00:01 0.0% disk-check/1
  3564 root       18M   17M sleep   59    0   0:00:02 0.0% hosts-check/1
   434 root     3472K 2104K sleep   59    0   0:00:03 0.0% dbus-daemon/1
   324 root        0K    0K sleep   99  -20   0:00:02 0.0% zpool-testpool/136
     5 root        0K    0K sleep   99  -20   0:00:01 0.0% zpool-syspool/136
   509 root       18M   10M sleep   59    0   0:00:01 0.0% fmd/21
  1273 root       58M   54M sleep   59    0   0:00:05 0.0% nms/1
   392 root       13M 8400K sleep   59    0   0:00:00 0.0% smbd/18
   515 www-data   17M 6716K sleep   59    0   0:00:00 0.0% apache2/28
   234 root     2604K 1584K sleep  100    -   0:00:00 0.0% xntpd/1
  7231 root     4628K 2568K sleep   59    0   0:00:00 0.0% bash/1
   519 www-data   17M 6564K sleep   59    0   0:00:00 0.0% apache2/28
Total: 91 processes, 732 lwps, load averages: 0.55, 0.53, 0.55


When troubleshooting, I noticed that the process using the most CPU was "nms". This is a custom command provided by Nexenta. Curious to what this command was doing, I ran the truss command against the process id:

root@nexenta01:/export/home/admin# truss -f -p 855
855:    pollsys(0x08047AA0, 1, 0x08047B58, 0x00000000) (sleeping...)
855:    pollsys(0x08047AA0, 1, 0x08047B58, 0x00000000)    = 1
855:    read(4, " l01\00118\0\0\08D\0\0\0".., 2048)    = 176
855:    read(4, 0x0AB8DB78, 2048)            Err#11 EAGAIN
855:    stat64("/tmp/.nza", 0x0813E078)            = 0
855:    stat64("/tmp/.nza", 0x0813E078)            = 0
855:    stat64("/tmp/.nza/.appliance", 0x0813E078)    = 0
855:    open64("/tmp/.nza/.appliance", O_RDWR)        = 9
855:    fstat64(9, 0x0813DFE8)                = 0
855:    fcntl(9, F_SETFD, 0x00000001)            = 0
855:    llseek(9, 0, SEEK_CUR)                = 0
855:    fcntl(9, F_SETLKW64, 0x08047410)        = 0
855:    llseek(9, 0, SEEK_CUR)                = 0


The truss command traces system calls and although the output appear quite scary, you can learn a lot about what a process is doing without needing to know what the system calls are doing. Useful calls to look for are:
  • open() - opens a file for reading/writing. The number returned (on the right after the = sign) is the file descriptor.
  • close() - closes a file.
  • read() and write() - the first number after the "(" is the file that is being read from or written to. Cross reference it with the open() call.
  • stat() and stat64() - tests to see if a file exists. Don't worry if you get errors returned here as it might be the process looking for a file that may existing in multiple places (e.g., when scanning the PATH for an executable).
The -f option in truss means that child processes will be "followed". So if the process you are tracing forks another process, you will get the data on the child process as well. The -p tells truss to trace the numbered process (obtained from ps or prstat output).

That was a really quick intro to truss for the purposes of explaining how I debugged the problem. Truss is capable of a lot more than I've just described. See the man page ("man truss") for more details.

Back to the performance problem...

The truss output showed me that the nms process was scanning through a lot of ZFS snapshots. There seemed to be a lot of these snapshots. I obtained a list of snapshots on the system:

# zfs list -t snapshot

...and got hundreds back! Something was creating a large number of snapshots. On closer inspection, it appeared I was getting a new snapshot of some filesystems every 7 minutes:

filestore/Shared@snap-daily-1-2011-01-21-2122     1.23M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2129     1.22M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2136     1.39M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2143     1.22M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2150     1.12M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2157     1.06M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2201     1.04M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2208     1.04M      -  50.5G  -
filestore/Shared@snap-daily-1-2011-01-21-2215         0      -  50.5G  -
filestore/Shared@snap-daily-1-latest                  0      -  50.5G  -


It appeared that the filesystems with the large number of snapshots were also the filesystems that I had set to replicate to my second VSA using the auto-tier service. As a test, I listed all the auto-tier services and disabled the suspects:

root@nexenta01:/# svcs -a | grep auto-tier
online         20:33:40 svc:/system/filesystem/zfs/auto-tier:filestore-Software-000
online         20:33:41 svc:/system/filesystem/zfs/auto-tier:filestore-ISOs-000
online         20:33:42 svc:/system/filesystem/zfs/auto-tier:filestore-Shared-000
online         21:33:24 svc:/system/filesystem/zfs/auto-tier:filestore-Home-000
root@nexenta01:/# svcadm disable svc:/system/filesystem/zfs/auto-tier:filestore-Home-000
root@nexenta01:/# svcadm disable svc:/system/filesystem/zfs/auto-tier:filestore-Shared-000


The snapshots stopped.

To determine if the number of snapshots was the problem (I'd seen similar problems before), I destroyed all the snapshots for that filesystem:

root@nexenta01:/# for snapshot in $(zfs list -t snapshot | grep Home | grep snap-daily-1-2011-01 | awk '{ print $1 }'); do zfs destroy $snapshot; echo "Destroyed $snapshot"; done


The web interface performance was fast.

Okay, so that was the problem, but why was it happening? At first, I couldn't work it out and deleted and recreated the auto-tier jobs. Everything then worked fine... for a couple of days. Then the number of snapshots increased again.

This time I was able to identify a change in the configuration. I had had to reboot the second VSA because it had run out of memory (I assigned too little). This appears to have caused the link between the two to have been broken and the auto-tier jobs were running out of control.

Knowing that a new snapshot would fire every 7 minutes, I waited and ran the "ptree" command (shows the list of processes in a tree view showing the parent/child relationships) until I spotted the auto-tier job:

  2260  sh -c /lib/svc/method/zfs-auto-tier svc:/system/filesystem/zfs/auto-tier:filest
    2266  /usr/bin/perl /lib/svc/method/zfs-auto-tier svc:/system/filesystem/zfs/auto-tie
      2315  rsync -e ssh --delete --exclude-from=/var/lib/nza/rsync_excl.txt --inplace --ig
        2316  ssh nexenta02.local.zone rsync --server -lHogDtpre.isf --delete --ignore-errors


The problem here was the zfs-auto-tier service (process 2260). Although the full command is truncated, I compared it with the output from svcs (see above) and guessed it to be:

sh -c /lib/svc/method/zfs-auto-tier svc:/system/filesystem/zfs/auto-tier:filestore-Home-000

To examine the properties of this service, I ran:

root@nexenta01:/# svccfg -s svc:/system/filesystem/zfs/auto-tier:filestore-Shared-000 listprop
zfs                                application
zfs/action                         astring 
zfs/day                            astring  1
zfs/depth                          astring  1
zfs/dircontent                     astring  0
zfs/direction                      astring  1
zfs/exclude                        astring 
zfs/from-fs                        astring  /volumes/filestore/Shared
zfs/from-host                      astring  localhost
zfs/from-snapshot                  astring 
zfs/fs-name                        astring  filestore/Shared
zfs/keep_days                      astring  7
zfs/method                         astring  tier
zfs/minute                         astring  0
zfs/options                        astring  "--delete --exclude-from=/var/lib/nza/rsync_excl.txt --inplace --ignore-errors -HlptgoD"
zfs/proto                          astring  rsync+ssh
zfs/rate_limit                     astring  0
zfs/to-fs                          astring  /volumes/backup
zfs/to-host                        astring  nexenta02.local.zone
zfs/trace_level                    astring  1
zfs/type                           astring  daily
zfs/retry-timestamp                astring  1295492474
zfs/period                         astring  1
zfs/hour                           astring  2
zfs/last_replic_time               astring  12
zfs/time_started                   astring  21:43:18,Jan21
zfs/retry                          astring  1
startd                             framework
startd/duration                    astring  transient
general                            framework
general/enabled                    boolean  true
start                              method
start/exec                         astring  "/lib/svc/method/zfs-auto-tier start"
start/timeout_seconds              count    0
start/type                         astring  method
stop                               method
stop/exec                          astring  "/lib/svc/method/zfs-auto-tier stop"
stop/timeout_seconds               count    0
stop/type                          astring  method
refresh                            method
refresh/exec                       astring  "/lib/svc/method/zfs-auto-tier refresh"
refresh/timeout_seconds            count    0
refresh/type                       astring  method
restarter                          framework    NONPERSISTENT
restarter/auxiliary_state          astring  none
restarter/logfile                  astring  /var/svc/log/system-filesystem-zfs-auto-tier:filestore-Shared-000.log
restarter/start_pid                count    867
restarter/start_method_timestamp   time     1295642022.247594000
restarter/start_method_waitstatus  integer  0
restarter/transient_contract       count  
restarter/next_state               astring  none
restarter/state                    astring  online
restarter/state_timestamp          time     1295642022.255217000


The property that stood out was zfs/retry-timestamp and I guessed the value was a timestamp counting in seconds since the epoch. Converting the value turned it into a human-readable date:

Thu Jan 20 2011 03:01:14 GMT+0000 (BST)

This date was in the past, so was the script running because of this?

I editing the value:

svccfg -s svc:/system/filesystem/zfs/auto-tier:filestore-Home-000 setprop zfs/retry-timestamp=0

And waited...

No new snapshot was created!

I assume this a bug. The temporary failure of the second device should not cause the primary VSA to run amok! Fortunately, this fix appears to have worked and the auto-tier service is now working correctly. The web interface is also performing as expected!

Tuesday, 18 January 2011

NexentaStor Community Edition: Compression and Deduplication Benchmarks

If you read my previous post on benchmarking the NexentaStor VSA and want even more benchmarking information, this post is for you!

ZFS filesystems can be configured to support compression, and in the later releases, deduplication. While both these features are useful in maximising the use of disk space, what is the impact on performance running with these options configured?

The base configuration of the appliance is the same as test 4 from the previous configuration: A mirrored pair of SATA disks, with a SSD L2ARC. The filesystem is configured to use the ZFS Intent Log (ZIL) for synchronous operations (the default) but a separate log is not configured.

The performance data for the base configuration is:
  • Sequential Block Reads: 59934K/sec (58.5MB/sec)
  • Sequential Block Writes: 28793K/sec (28.1MB/sec)
  • Rewrite: 17127K/sec (16.7MB/sec)
  • Random Seeks: 517.45/sec
The benchmark will use bonnie++ running on a Solaris 11 Express VM connecting to the NexentaStor appliance over an internal vSwitch. The bonnie++ command line is:


# /usr/local/sbin/bonnie++ -uroot -x 4 -f -s 4096 -d /mnt


See the previous blog post for an explanation of these options.

Test 1: Set compression=on

Enabling compression for a specific filesystem is very simple:

# zfs set compression=on testpool/testing

The results of the test were:
  • Sequential Block Reads: 52830K/sec (51.5MB/sec)
  • Sequential Block Writes: 38811K/sec (37.9MB/sec)
  • Rewrite: 18659K/sec (18.2MB/sec)
  • Random Seeks: 1188.95/sec

Reads were lower with compression enabled, but writes and rewrites were faster. Random seeks are much faster, but I cannot explain that, although suspect that if the bonnie++ data is highly compressible, this may cause "odd" results such as this.

Test 2: Set deduplication=on

For this test, compression was turned off and de-duplication turned on:

# zfs set compression=off testpool/testing
# zfs set dedup=on testpool/testing

The results of the test were:

  • Sequential Block Reads: 45806K/sec (44.7MB/sec)
  • Sequential Block Writes: 27550K/sec (26.9MB/sec)
  • Rewrite: 15179K/sec (14.8MB/sec)
  • Random Seeks: 464/sec
This shows that there is a performance penalty for enabling data deduplication. There is also a RAM overhead as the operating system needs to store the dedupe table in memory (not measured as part of this test).


Test 3: Set compression=on, deduplication=on


For this test, both compression and de-duplication were turned on:

# zfs set compression=on testpool/testing
# zfs set dedup=on testpool/testing

The results of the test were:
  • Sequential Block Reads: 53844K/sec (52.5MB/sec)
  • Sequential Block Writes: 34315K/sec (33.5MB/sec)
  • Rewrite: 17654K/sec (17.2MB/sec)
  • Random Seeks: 1331/sec
These results suggest that if deduplication is required (to save space), then the additional overhead of compression improves both the read and write performance. As with compression turned on, random seeks are improved significantly.

Conclusion

In conclusion, for maximum read performance, do not turn on compression or deduplication. For maximum write and rewrite performance, turn on compression. If deduplication is required, consider turning on compression as well as this improves dedupe performance. There is a CPU and memory overhead using these features, but as with most things, it's a case of balancing the cost vs the benefit.

Friday, 14 January 2011

NexentaStor Community Edition: Benchmarking

In a previous post I discussed how I implemented a Virtual Storage Appliance (VSA) on my VMware home server running the NexentaStor Community Edition operating system. While getting everything working was fairly straightforward, knowing how well it was running required some benchmarking.

I've used the bonnie++ benchmark program before and generally like the way it works. Although I suspect  most of this testing could be done through the web interface (setting up the disks etc.), I found it easier and quicker to use the command line and the Solaris commands.

For the test, I create a new VMDK (20GB) on my primary SATA drive and published it to the VSA. I then created a new pool and added the disk:

# zpool create testpool c1t5d0

I then created a filesystem in the pool:

# zfs create testpool/testing

For this testing, I did not enable compression or deduplication (perhaps a topic for another day...).

I ran the bonnie++ benchmark with the following command line:

#  /usr/local/sbin/bonnie++ -uroot -s 8192 -d /testpool/testing

The size (8192) tells bonnie++ to create test data that is 8GB in size. This is twice the RAM allocated to the VM, so prevents the results being skewed by using data cached in memory. I then ran each test 4 times and averaged the results. No other significant activity was taking place while the tests were running. To provide a consistent environment, I used CPU and memory reservations for the VM. I opted to focus on the sequential block reads, sequential block writes (ZFS buffers random writes and writes them sequentially), rewrite and random seek performance. A good guide to understanding bonnie++ output in a ZFS contact can be found here.

Test 1: One SATA based disk

This is the basic starting point: One disk in the pool:
  • Sequential Block Reads: 75492K/sec (73.5MB/sec)
  • Sequential Block Writes: 61966K/sec (60.5MB/sec)
  • Rewrite: 26873K/sec (26.2MB/sec)
  • Random Seeks: 278.8/sec

Test 2: Mirrored SATA disks

I added a second VMDK to the NexentaStor VM locating it on the second SATA disk. The new disk was then added to the test pool as a mirror. Once the resilvering was complete, the test was re-run:
  • Sequential Block Reads: 76141K/sec (74.3MB/sec)
  • Sequential Block Writes: 52331K/sec (51.1MB/sec)
  • Rewrite: 31525K/sec (30.7MB/sec)
  • Random Seeks: 292.6/sec
So we can see that in a mirrored configuration, block reads are marginally faster, block writes are slower, rewrites of existing blocks is faster and the number of random seeks has increased. I was a bit surprised that the block read was not much higher given that running a "zpool iostat" on the test pool shows that the read load is balanced across both disks. The slower writes are no surprise as the kernel has to write the same data to two separate devices.

Test 3: Mirrored SATA disks with SSD L2ARC

I added another VMDK to the NexentaStor VM locating it on the SSD datastore. The new disk was added to the test pool as a cache device, implementing a L2ARC:
  • Sequential Block Reads: 116837K/sec (114MB/sec)
  • Sequential Block Writes: 65454K/sec (63.9MB/sec)
  • Rewrite: 37598K/sec (36.7MB/sec)
  • Random Seeks: 440/sec
The L2ARC has improved the read performance significantly, and surprisingly the write performance is faster too (not sure why this is because the L2ARC is a read-only cache). Rewrite is a bit faster and random seeks is much higher (to be expected with an SSD).

So at this point, we have a pretty good idea what the performance of the NexentaStor appliance is writing to local disk. The next test is to see what the performance is like over NFS...

Test 4: NFS test from Solaris 11 Express VM

The Solaris 11 Express VM is running on the same host and is connected to the VSA by the same vSwitch.

The mount operation was performed by running:

# mount -F nfs nexenta01:/testpool/testing /mnt

The bonnie++ command was:

# /usr/local/sbin/bonnie++ -uroot -x 4 -f -s 4096 -d /mnt

The size of the testing dataset was reduced from the 8192MB on the VSA because the Solaris 11 Express VM only has 2GB of RAM and 4096MB is enough to ensure the VM isn't caching the data in its RAM. It's less than the RAM in the VSA, but at this point we're interested in the performance over the network to clients, not the speed of the appliance itself.

The default behaviour for Solaris NFS is to perform synchronous writes (see my last blog post for a quick primer on NFS/ZFS interactions). Using the zilstat script, I was able to confirm that the ZIL was written to during the benchmark run, proving that the write operations were indeed synchronous. As expected, performance was much worse:
  • Sequential Block Reads: 59934K/sec (58.5MB/sec)
  • Sequential Block Writes: 28793K/sec (28.1MB/sec)
  • Rewrite: 17127K/sec (16.7MB/sec)
  • Random Seeks: 517.45/sec
Of course, the network stack will be an overhead, but it's worth seeing if we can improve on these times...

Test 5: NFS test from Solaris 11 Express VM, sync=disabled on NexentaStor NFS server

The ZIL is used in ZFS to log synchronous writes to a secure place before writing the data to the pool. The NFS client will wait until the server has confirmed the write to the ZIL before continuing processing. Very good for secure data, but does slow things down. The ZFS sync=disabled option bypasses the ZIL and buffers the request in the server's RAM until it is commited to disk. In real world terms, it's more unreliable, but it's about the same as other non-ZFS based NFS servers such as Linux.

The command to disable synchronous writes (on a per-filesystem basis), is:

# zfs set sync=disabled testpool/testing

The tests were then re-run:

  • Sequential Block Reads: 69438K/sec (67.8MB/sec)
  • Sequential Block Writes: 49177K/sec (48MB/sec)
  • Rewrite: 20737K/sec (20.2MB/sec)
  • Random Seeks: 520.9/sec
As the test ran, I monitored the ZIL utilisation using zilstat and confirmed that the ZIL was not being used. The results show a significant improvement in writes of approximately 20MB/sec and a smaller improvement in rewrites.


Test 6: NFS test from Solaris 11 Express VM, sync=standard, separate slog on NexentaStor NFS server

Disabling the ZIL improved NFS performance, but what would happen if the ZIL was placed on a separate SSD disk? To do this, I created a new disk from the SSD datastore and attached it to the pool:

# zpool add testpool log c1t4d0

I changed the sync property back to standard (re-enabling synchronous writes) and ran the tests, using zilstat to confirm that the ZIL was being written to:
  • Sequential Block Reads: 55142K/sec (53.8MB/sec)
  • Sequential Block Writes: 23931K/sec (23.3MB/sec)
  • Rewrite: 16518.5K/sec (16.1MB/sec)
  • Random Seeks: 628.1/sec
Well this was unexpected! The separate ZIL has produced worse results than using the pool SATA disks writing the data twice! The surprising drop in performance may be due to the type of SSD I'm using (OCZ Vertex 2). This is a Multi Level Cell (MLC) type device which is optimised for read operations (most consumer SSDs are MLC). For high performance writes, Single Level Cell (SLC) SSDs are recommended, but they are far more expensive.

Conclusion

To wrap this up then, there are two options to consider when running NFS on ZFS:

  1. Enable the ZIL, experience slower performance but know the data is secure
  2. Disable the ZIL, experience faster performance but understand the risks
The "best" option depends on the environment the VSA is serving. Fortunately the ZIL can be turned on or off on a per-filesystem basis. This means that non-critical test lab VMs can sit on a filesystem with no ZIL for maximum performance, while critical data (e.g., family photos/videos and the copy of your tax return) can be configured with end-to-end consistency.

If you are running a VMware home lab and are looking for a decent virtual storage appliance, NexentaStor CE is definitely worth a look, and as you can see, has plenty of features!

Thursday, 13 January 2011

Understanding NFS and ZFS interactions

This post was originally going to document the benchmarking of my NexentaStor VSA. Although most of this work has been done (and will be posted soon - promise!), the results were somewhat confusing and required me to dive into the guts of how Solaris (on which NexentaStor is based) handles filesystem operations and NFS. This might be useful if you are trying to debug some performance issues:

ZFS is designed so that all writes are transactional. A write is either successful and the data is written to disk, or it fails and no data gets written. This means that the data on-disk is always consistent.

From an application's perspective, there are two types of write operations to a POSIX compliant filesystem: asynchronous and synchronous.

An asynchronous write passes the data to the filesystem and then continues, effectively assuming that the data is safe. In comparison, a synchronous write passes the data to the filesystem and then waits until the filesystem acknowledges that the data is safe.

Asynchronous Writes



Synchronous Writes


The difference can be seen in what happens if the filesystem or server has a problem (e.g., power fail). In an asynchronous write, the data may only be buffered in RAM and is therefore lost, although the application believes it is safe. With synchronous writes, the data will be definitely be safe because the filesystem only reports back to the application when it is definitely written.

So while the ZFS on-disk data is consistent, it's possible, using asynchronous writes, to lose data in transit. For synchronous writes, ZFS uses a feature called the ZFS Intent Log (ZIL).

When a synchronous write request is made to the filesystem, the data is first written to the ZIL. This ensures the data is safe on disk and the application is free to continue. ZFS will then flush the contents of ZIL to the filesystem at a specified interval (roughly every 5 seconds).

By default, the ZIL is allocated disk blocks from the pool. This leads to a situation where data is written to disk twice, once to the ZIL and secondly to the actual filesystem. This double writing can slow things down.

One option to speed things up is to use a Separate Log (SLOG in ZFS terminology) on which to locate the ZIL. This is typically a flash/SSD drive. Synchronous writes can be logged quickly to the SLOG and then written to slower disk later, speeding up the response to the application. When using a SLOG, the recommendation is to use a mirrored ZIL to ensure that the data is truly safe before being committed to disk.

Synchronous Writes with separate ZIL


How does this impact on NFS?

It is common for an NFS client to write synchronously to an NFS server in order to get an acknowledgement that the data is safely committed to disk. This means that an NFS server with a ZFS filesystem will be performing ZIL writes and come with the associated overheads. This also means the NFS client will be blocked until it receives an acknowledgement back from the server. This can result in "poor performance" when running ZFS and NFS together.

NFS with synchronous writes

There are a couple of options that can speed things up:

The NFS client may opt to mount the filesystem asynchronously. This means that the data will sit in the server's RAM buffers until a transaction group commit and is therefore vulnerable to power failure. NFS clients mounting asynchronously may therefore lose data.



Another option is to disable the ZIL on the NFS server. This effectively makes all synchronous writes asynchronous from the ZFS perspective. Again, data in transit may be lost. Recent versions of ZFS refer to this as sync=standard (default, synchronous writes are written to the ZIL), sync=always (paranoid, everything is written synchronously) or sync=disabled (makes all writes asynchronous).

NFS with async or ZIL disables


Whether NFS async or disabling the ZIL is a risk worth taking depends on the nature of your data. VMware vSphere appears (based on my reading which I'm assuming to be true) to use synchronous writes when using NFS datastores, which can impact on performance. Applying some of the tuning detailed above may help improve performance in a virtual environment.

Wednesday, 5 January 2011

NexentaStor Community Edition - first impressions

During the Christmas break I took the opportunity to upgrade my HP ML110 G5 from the sadly future-less OpenSolaris to another platform. I opted to turn it into a VMware ESXi 4.1 install to run alongside my existing HP ML115 G5 lab server.

The ML110 G5 was fitted with 2 x 1TB SATA drives and a 60GB SSD drive. All three were presented as datastores to ESXi.



For file and block level storage, I opted to use NexentaStor Community Edition. This operating system is derived from the OpenSolaris code base and builds on many Solaris technologies, including ZFS. The enterprise version is pay-for, but the free Community Edition supports datasets up to 18TB, which is easily enough for a home lab environment.



I installed NexentaStor CE on a fairly small volume and created a larger (400GB) VMDK which I then added to the ZFS pool. I assigned 4GB of RAM to the VM, the majority of which will be used as the ARC cache (see below for details).

A (brief) ZFS Primer

In ZFS, physical disks are grouped together in pools. Writes to a pool are striped across all disks in the pool by default, but disks within the pool can be mirrored to each other, or configured in parity RAID comprising of one, two or three parity disks (called RAIDZ, RAIDZ2 and RAIDZ3 respectively) to provide additional resilience.

ZFS filesystems are created from space in the pool and can have many properties applied including size reservations, quotas, compression and deduplication. Filesystems can be shared over NFS, CIFS, or both concurrently.

In addition to ZFS filesystems, Zpools can also contain Zvols. These are basically ZFS filesystems without the filesystem formatted. Zvols provide many of the same properties as a ZFS filesystem including compression and deduplication. Zvols can be shared over iSCSI and formatted by the initiator to hold a server's native filesystem (such as VMFS, NTFS, Ext3, HFS+ etc.).

NexentaStor CE VSA data integrity

With a single 400GB VMDK created and assigned to the VM, I create a new zpool (called Datasets by Nexenta and configured through the web interface - command line mojo not required) and started creating new ZFS filesystems (called Shares, one to hold software installers, another for ISO images, a third for documents etc.).

Obviously a single disk is no good if there is a problem with the underlying drive, so I created a second 400GB VMDK on the other physical disk and presented it to the appliance (all disk rescanning is done without a reboot necessary). The second 400GB was then added to the zpool as a mirror. The process of copying data from the original disk to the mirror is called resilvering and can take some time.

This mirroring is within the VSA and will not help if the primary disk fails as the VM configuration files and boot VMDK are not mirrored. So why mirror the data?

ZFS stores a checksum for the data it writes and when configured in a mirror or RAID-Z, the filesystem is able to reconstruct the data in the event of disk write errors using the redundant data. See here for more information on the end-to-end checksumming and data integrity.

This means that while the VSA will not survive the primary disk physically dying, any corruptions that occur as a disk starts to die will be caught and corrected. A scheduled housekeeping job called a scrub runs weekly to ensure the checksums and data are correct.

NexentaStor CE VSA performance tuning

SATA disks are slow and SSD is fast. Unfortunately SSD is much more expensive than SATA. While one option is to put performance critical data on the SSDs and less important VMs on SATA, the alternative is to use flash disk as cache.

ZFS utilises an in-memory cache called the "Adaptive Replacement Cache" (ARC). This is very fast (being in RAM) and speeds up disk reads, but is limited to the physical memory in the machine (approximately 3GB in a 4GB VM). However, ZFS can support two additional caches: The L2ARC (Level 2 Adaptive Replacement Cache) and ZIL (ZFS Intent Log). The L2ARC is designed to speed up reads, while the ZIL speeds up metadata writes. The best practice for creating a ZIL is to use mirrored flash drives on devices separate from the L2ARC, but as I only had one SSD, I opted to create a single L2ARC.

The L2ARC was created as a 20GB VMDK disk on the SSD datastore and added to the VM. The new volume was then added to the zpool as a cache device. While 20GB is not huge in terms of disk, it represents a significant amount of cache memory.



The performance advantages of the cache are not immediately obvious given that it takes time for the cache to populate. However, once data has been read, future reads will be taken from SSD instead of SATA. I've not had the chance to do meaningful benchmarks yet, but plan to do so soon.

NexentaStor CE VSA snapshots and replication

On top of the data resilience provided by the checksum, ZFS supports copy-on-write snapshots. These can be automatically scheduled on a per-filesystem basis to provide a point in time snapshot. This can be configured so document data is snapshotted daily (or hourly), while more static data such as the ISO store taken weekly or monthly.



The final step was to add even more resilience to the configuration. For this, I created a second NexentaStor CE VM on my HP ML115 G5 lab machine. This VM is smaller with only 1GB RAM. I created a 400GB disk but did not bother with mirroring. Using the NexentaStor web interface, I paired the machines and configured some scheduled jobs to replicate specific filesystems from the primary VSA to this secondary VSA (using snapshot copies over SSH). Nexenta refers to this as a "tiering service". This means that in the event the original server dies, the important data will still be available.



Overkill? Perhaps, but part of this work was to see what could be done with ZFS and the result is a very powerful storage setup.

There are a couple of concerns. One surrounds the longterm viability of ZFS given the Oracle takeover. Although NetApp have settled with Oracle, I don't know if the agreement covers other users of ZFS. Secondly, there will be a performance overhead by running the NexentaStor CE as a VSA on top of the ESXi storage subsystem. While it might be possible to squeeze a bit more performance by running NexentaStor CE directly on the bare-metal, ESXi allows me to run a few other VMs alongside the VSA. The trade-off is worth it in my mind.

In summary, NexentaStor Community Edition is a very powerful piece of software (and this post only scratches the surface - no mention of its AD integration, iSCSI functionality etc.) that gives some high-end functionality *for free* and is certainly worth considering for your home lab.