Wednesday, 5 January 2011

NexentaStor Community Edition - first impressions

During the Christmas break I took the opportunity to upgrade my HP ML110 G5 from the sadly future-less OpenSolaris to another platform. I opted to turn it into a VMware ESXi 4.1 install to run alongside my existing HP ML115 G5 lab server.

The ML110 G5 was fitted with 2 x 1TB SATA drives and a 60GB SSD drive. All three were presented as datastores to ESXi.



For file and block level storage, I opted to use NexentaStor Community Edition. This operating system is derived from the OpenSolaris code base and builds on many Solaris technologies, including ZFS. The enterprise version is pay-for, but the free Community Edition supports datasets up to 18TB, which is easily enough for a home lab environment.



I installed NexentaStor CE on a fairly small volume and created a larger (400GB) VMDK which I then added to the ZFS pool. I assigned 4GB of RAM to the VM, the majority of which will be used as the ARC cache (see below for details).

A (brief) ZFS Primer

In ZFS, physical disks are grouped together in pools. Writes to a pool are striped across all disks in the pool by default, but disks within the pool can be mirrored to each other, or configured in parity RAID comprising of one, two or three parity disks (called RAIDZ, RAIDZ2 and RAIDZ3 respectively) to provide additional resilience.

ZFS filesystems are created from space in the pool and can have many properties applied including size reservations, quotas, compression and deduplication. Filesystems can be shared over NFS, CIFS, or both concurrently.

In addition to ZFS filesystems, Zpools can also contain Zvols. These are basically ZFS filesystems without the filesystem formatted. Zvols provide many of the same properties as a ZFS filesystem including compression and deduplication. Zvols can be shared over iSCSI and formatted by the initiator to hold a server's native filesystem (such as VMFS, NTFS, Ext3, HFS+ etc.).

NexentaStor CE VSA data integrity

With a single 400GB VMDK created and assigned to the VM, I create a new zpool (called Datasets by Nexenta and configured through the web interface - command line mojo not required) and started creating new ZFS filesystems (called Shares, one to hold software installers, another for ISO images, a third for documents etc.).

Obviously a single disk is no good if there is a problem with the underlying drive, so I created a second 400GB VMDK on the other physical disk and presented it to the appliance (all disk rescanning is done without a reboot necessary). The second 400GB was then added to the zpool as a mirror. The process of copying data from the original disk to the mirror is called resilvering and can take some time.

This mirroring is within the VSA and will not help if the primary disk fails as the VM configuration files and boot VMDK are not mirrored. So why mirror the data?

ZFS stores a checksum for the data it writes and when configured in a mirror or RAID-Z, the filesystem is able to reconstruct the data in the event of disk write errors using the redundant data. See here for more information on the end-to-end checksumming and data integrity.

This means that while the VSA will not survive the primary disk physically dying, any corruptions that occur as a disk starts to die will be caught and corrected. A scheduled housekeeping job called a scrub runs weekly to ensure the checksums and data are correct.

NexentaStor CE VSA performance tuning

SATA disks are slow and SSD is fast. Unfortunately SSD is much more expensive than SATA. While one option is to put performance critical data on the SSDs and less important VMs on SATA, the alternative is to use flash disk as cache.

ZFS utilises an in-memory cache called the "Adaptive Replacement Cache" (ARC). This is very fast (being in RAM) and speeds up disk reads, but is limited to the physical memory in the machine (approximately 3GB in a 4GB VM). However, ZFS can support two additional caches: The L2ARC (Level 2 Adaptive Replacement Cache) and ZIL (ZFS Intent Log). The L2ARC is designed to speed up reads, while the ZIL speeds up metadata writes. The best practice for creating a ZIL is to use mirrored flash drives on devices separate from the L2ARC, but as I only had one SSD, I opted to create a single L2ARC.

The L2ARC was created as a 20GB VMDK disk on the SSD datastore and added to the VM. The new volume was then added to the zpool as a cache device. While 20GB is not huge in terms of disk, it represents a significant amount of cache memory.



The performance advantages of the cache are not immediately obvious given that it takes time for the cache to populate. However, once data has been read, future reads will be taken from SSD instead of SATA. I've not had the chance to do meaningful benchmarks yet, but plan to do so soon.

NexentaStor CE VSA snapshots and replication

On top of the data resilience provided by the checksum, ZFS supports copy-on-write snapshots. These can be automatically scheduled on a per-filesystem basis to provide a point in time snapshot. This can be configured so document data is snapshotted daily (or hourly), while more static data such as the ISO store taken weekly or monthly.



The final step was to add even more resilience to the configuration. For this, I created a second NexentaStor CE VM on my HP ML115 G5 lab machine. This VM is smaller with only 1GB RAM. I created a 400GB disk but did not bother with mirroring. Using the NexentaStor web interface, I paired the machines and configured some scheduled jobs to replicate specific filesystems from the primary VSA to this secondary VSA (using snapshot copies over SSH). Nexenta refers to this as a "tiering service". This means that in the event the original server dies, the important data will still be available.



Overkill? Perhaps, but part of this work was to see what could be done with ZFS and the result is a very powerful storage setup.

There are a couple of concerns. One surrounds the longterm viability of ZFS given the Oracle takeover. Although NetApp have settled with Oracle, I don't know if the agreement covers other users of ZFS. Secondly, there will be a performance overhead by running the NexentaStor CE as a VSA on top of the ESXi storage subsystem. While it might be possible to squeeze a bit more performance by running NexentaStor CE directly on the bare-metal, ESXi allows me to run a few other VMs alongside the VSA. The trade-off is worth it in my mind.

In summary, NexentaStor Community Edition is a very powerful piece of software (and this post only scratches the surface - no mention of its AD integration, iSCSI functionality etc.) that gives some high-end functionality *for free* and is certainly worth considering for your home lab.

4 comments:

Bill said...

Thanks for posting this. I did the exact same thing over break, but with Solaris 11 Express. I'm a little concerned over the licensing issues, so I've thought about using Nexenta Community. Your post made me decide to give it a shot.

One thing you may want to look into is using your drives in a bare-metal fashion (Raw Device Mapping) with ESXi. That way if the OS drive blows up, you can do a simple Nexenta/Solaris install to retrieve the data on the disks, you don't need to worry about the additional VMFS layer. I have to believe it will also help performance as well.

Here's some sites explaining various ways to do it:
http://www.vm-help.com/esx40i/SATA_RDMs.php


http://blog.davidwarburton.net/2010/10/25/rdm-mapping-of-local-sata-storage-for-esxi/

Thanks again!

JR said...

Hi Bill

Thanks for the comment. Using RDMs was something I considered, but it would mean dedicating the entire disk to the NexentaStor VM. I wanted to run a few other VMs alongside NexentaStor, so RDMs wouldn't have worked for me.

JR

Christian said...

Hi,

just came across, many thanks for the information! I have a HP Microserver and as Esxi works quite well I also had a look at nexenta for working as a file server.

I am puzzled as I am unsure if it is more secure to attach the SATAs directly and not as virtual volumes. What if the VMware filesystem gets corrupt? How likely is that?

JR said...

Hi Christian

If you have enough drives to dedicate to the Nexenta fileserver, then you could attach them as RDMs. This means the disks are accessed directly and do not exist as VMDK files on the VMFS filesystem.

This would be more secure than my approach, but means you cannot use those drives to store other virtual machines (unless you exported the Nexenta filesystems as NFS/iSCSI and put VMs on that).

Thanks for your comment.

JR