Thursday 13 January 2011

Understanding NFS and ZFS interactions

This post was originally going to document the benchmarking of my NexentaStor VSA. Although most of this work has been done (and will be posted soon - promise!), the results were somewhat confusing and required me to dive into the guts of how Solaris (on which NexentaStor is based) handles filesystem operations and NFS. This might be useful if you are trying to debug some performance issues:

ZFS is designed so that all writes are transactional. A write is either successful and the data is written to disk, or it fails and no data gets written. This means that the data on-disk is always consistent.

From an application's perspective, there are two types of write operations to a POSIX compliant filesystem: asynchronous and synchronous.

An asynchronous write passes the data to the filesystem and then continues, effectively assuming that the data is safe. In comparison, a synchronous write passes the data to the filesystem and then waits until the filesystem acknowledges that the data is safe.

Asynchronous Writes

Synchronous Writes

The difference can be seen in what happens if the filesystem or server has a problem (e.g., power fail). In an asynchronous write, the data may only be buffered in RAM and is therefore lost, although the application believes it is safe. With synchronous writes, the data will be definitely be safe because the filesystem only reports back to the application when it is definitely written.

So while the ZFS on-disk data is consistent, it's possible, using asynchronous writes, to lose data in transit. For synchronous writes, ZFS uses a feature called the ZFS Intent Log (ZIL).

When a synchronous write request is made to the filesystem, the data is first written to the ZIL. This ensures the data is safe on disk and the application is free to continue. ZFS will then flush the contents of ZIL to the filesystem at a specified interval (roughly every 5 seconds).

By default, the ZIL is allocated disk blocks from the pool. This leads to a situation where data is written to disk twice, once to the ZIL and secondly to the actual filesystem. This double writing can slow things down.

One option to speed things up is to use a Separate Log (SLOG in ZFS terminology) on which to locate the ZIL. This is typically a flash/SSD drive. Synchronous writes can be logged quickly to the SLOG and then written to slower disk later, speeding up the response to the application. When using a SLOG, the recommendation is to use a mirrored ZIL to ensure that the data is truly safe before being committed to disk.

Synchronous Writes with separate ZIL

How does this impact on NFS?

It is common for an NFS client to write synchronously to an NFS server in order to get an acknowledgement that the data is safely committed to disk. This means that an NFS server with a ZFS filesystem will be performing ZIL writes and come with the associated overheads. This also means the NFS client will be blocked until it receives an acknowledgement back from the server. This can result in "poor performance" when running ZFS and NFS together.

NFS with synchronous writes

There are a couple of options that can speed things up:

The NFS client may opt to mount the filesystem asynchronously. This means that the data will sit in the server's RAM buffers until a transaction group commit and is therefore vulnerable to power failure. NFS clients mounting asynchronously may therefore lose data.

Another option is to disable the ZIL on the NFS server. This effectively makes all synchronous writes asynchronous from the ZFS perspective. Again, data in transit may be lost. Recent versions of ZFS refer to this as sync=standard (default, synchronous writes are written to the ZIL), sync=always (paranoid, everything is written synchronously) or sync=disabled (makes all writes asynchronous).

NFS with async or ZIL disables

Whether NFS async or disabling the ZIL is a risk worth taking depends on the nature of your data. VMware vSphere appears (based on my reading which I'm assuming to be true) to use synchronous writes when using NFS datastores, which can impact on performance. Applying some of the tuning detailed above may help improve performance in a virtual environment.

No comments: