Friday 30 March 2012

EMC VNXe - diving under the hood (Part 2: FLARE)

In part one we looked at how the EMC midrange CLARiiON and Celerra combination (now replaced by the VNX) presented physical disks to a host as a filesystem. The VNXe is a new architecture physically, but dive beneath the hood and bits of FLARE and DART from the CLARiiON and Celerra are still visible.

The best way to get under the hood and see how the VNXe works is to enable SSH through the Unisphere web interface (it's under Settings > Service System). Then open an SSH session to the VNXe as user "service". First piece of information: It's running SUSE Linux Enterprise Server 11 (64bit).

The most interesting command to find out information about the VNXe storage system is "svc_storagecheck". This takes a number of parameters.

If we start at the bottom of the stack with the "svc_storagecheck -b" (for backend information), we get a lot of information about the array. Note that the word "Neo" crops up a lot and was the codename for the VNXe.

Some of the useful details revealed include the supported RAID types:

RAID Type Min Disks Max Disks

It's worth noting that not all of the above are directly accessible by the user (there is no way to manually create a RAID3 group to my knowledge). Also, the limit of 16 disks per RAID Group is also found on the CLARiiON.

The svc_storagecheck command also outputs details on "Flare's Memory Info" which shows two objects (one per SP?), each having a total capacity of 900 (MB?) with 128 (MB?) Read Cache and 768 (MB?) Write Cache. This might be a surprise if you were expecting the entire 8GB of memory to be available. A lot of this 8GB is used by the FLARE and DART subsystems, along with the Linux operating system itself.

There is also information on "2 Internal Disks" which presumably refer to the internal SSD on each SP which is used to store the operating environment.

Eight Ethernet ports are listed, as are two Link Aggregation Ports that I have setup within Unisphere.

Each of the 12 disks in the array are also detailed, including manufacturer (Seagate), capacity, speed, the slot number etc. Each disk is assigned a type of "NEO_DISK_TYPE_SAS" and also contains a reference to the RAID Group that it belongs to.

There are 2 Private RAID Groups, but no LUNs are presented from it and I cannot determine what this is for. I assume it's used by the operating system.

On my VNXe, there are an additional 3 RAID Groups:

Number RAID Type Number of Disks Number of LUNs

The first of these RAID Groups is the RAID5 (4+1) of the 300GB SAS disks, the second isn't really a RAID Group and is the SAS hot spare disk. The final RAID Group comprises the six 2TB NL-SAS disks in a RAID6 (4+2) configuration.

Physical disks in RAID Groups

The 2 LUNs are presented from the SAS disks in a way that looks similar to that done on CLARiiON (except on the CLARiiON, each LUN would typically be assigned to a different SP, whereas on the VNXe, this isn't the case and both LUNs are on the same SP).

I'm not sure why the NL-SAS RAID Group presents 16 LUNs, possibly due to the size of each disk. Each of these LUNs is striped across the RAID Group as follows:

FLARE LUNs carved from RAID Group

The next part of the "svc_storagecheck -b" command details the 19 LUNs that have been defined above.

Each LUN has a name, prefixed with "flare_lun_" which gives a big clue to its history. The default owning and current controller is also defined.

The final part of the "svc_storagecheck -b" command details the Storage Groups used by the array. A Storage Group is an intersection of LUNs and servers. For example. LUN0, LUN1 and LUN2 could be added to a SG with ServerA and ServerB. In this example, both ServerA and ServerB can see LUNs 0, 1 and 2.

There are some in-built Storage Groups (~management, ~physical, ~filestorage and control_lun4) as well as dart0, dart1, dart2 and dart3. The 2 LUNs from the 300GB SAS RAID Group belong to Storage Group "dart0" along with the IDs of the Storage Processors. Similarly, the 16 LUNs from the 2TB NL-SAS RAID Group are mapped to "dart1" along with the two Storage Processors. Storage Groups dart2 and dart3 are unused and presumably for future use.

We can also get some more disk information by using the "svc_neo_map" command, specifying a LUN number:

# svc_neo_map --lun=0

This command can be used to help map the FLARE side to the DART side of the VNXe.

And this pretty much concludes the FLARE side of the VNXe. The resulting LUNs have been carved from the RAID Groups and presented to the Storage Processors in a configuration that the DART side of the VNXe will be able to understand. We'll look in more detail at the DART side in the next post.

Tuesday 27 March 2012

EMC VNXe - diving under the hood (Part 1: Intro)

Recently we took delivery of an EMC VNXe 3100. The purpose of this unit is to provide a low cost alternative to our NetApp FAS2040, primarily for backups but also as a VMware datastore for some smaller workloads. Here is is:

If you want a decent primer on the VNXe, check out Henriwithani's blog.

The VNXe is an interesting unit and this mini series will look at what the VNXe 3100 provides and how it works. In order to understand some of the thinking behind the design, we need some background.

First, some history...

(It's worth stating here at the start that I'm not an expert on CLARiiON or Celerra and the below is my understanding based on reading the documentation. If I've made errors, please correct me!)

Go back a couple of years and EMC had two product lines for the midrange: CLARiiON and Celerra.

CLARiiON was their block storage array capable of supporting Fibre Channel and iSCSI. Each controller in a CLARiiON ran an operating environment called FLARE which actually ran on top of a Windows XP Embedded kernel.

Celerra took a CLARiiON, added one or more “X-Blades” (aka “Datamovers”) and a 1U, Linux-based “Control Station”. The Datamovers were basically NAS head units and added support for NFS and CIFS/SMB. They could also do iSCSI, albeit in a somewhat more abstracted way than running native block on CLARiiON. The operating system running on the Datamovers was a UNIX-lilke derivative called DART.

More information on Celerra’s architecture here.

In January 2011, EMC announced the successor to both CLARiiON and Celerra: The VNX. At the same time, the VNXe was announced as an entry-level sibling to the VNX. The VNX appears to be a fairly straightforward upgrade to CLARiiON and Celerra: Faster, bigger, more features etc.

However, the VNXe sports a new software architecture running on an “execution environment” called CSX. From what I can tell by looking at the VNXe support logs and poking around the command line, CSX runs on a Linux kernel. Relevant parts of the FLARE and DART stacks have been ported to run on CSX and each Storage Processor (SP) (aka controller) in the VNXe runs an instance of the Linux/CSX/FLARE/DART stack.

(More information on VNXe here.)

So what you get in a VNXe is a fusion of bits of FLARE and DART, but mixed up in a new way.

The VNXe configuration

Because it is aimed at the non-storage administrator, a lot of the technical details in the VNXe are hidden. This is a bit frustrating for those of us who like to know how things work, so I’ve tried to dive under the hood a bit.

The base VNXe 3100 is a 2U unit and has 12 drive bays and up to two controllers. Each controller (Storage Processor, or “SP” in EMC language) has a dedicated management Ethernet port, and two ports for data called Eth2 and Eth3. There is an option to install a SLIC module in each controller that adds an additional 4 x 10/100/1000Mbit Ethernet ports (copper). We didn’t buy this expansion, so I can’t comment further on those.

With two controllers, the VNXe 3100 has the ability to support up to 96 drives through the addition of Disk Array Enclosures (DAEs). The DAEs connect to the base unit via 6Gbps SAS.

As an entry level system, the VNXe has some limitations in terms of disk configuration options.  Disks are sold in packs and are designed to be added in one of the following ways:

  • 300GB SAS: 5 pack in RAID5 (4+1)
  • 300GB SAS: 6 pack in RAID10 (3+3)
  • 600GB SAS: 5 pack in RAID5 (4+1)
  • 600GB SAS: 6 pack in RAID10 (3+3)
  • 1TB NL-SAS: 6 pack in RAID6 (4+2)
  • 2TB NL-SAS: 6 pack in RAID6 (4+2)

The version we purchased has two controllers and 12 disks: 6x300GB SAS and 6x2TB NL-SAS.

Once installed in the system, the 6x300GB disks can be either configured as “High Performance” which is RAID10 (3+3), or as “Balanced Performance/Capacity” which is RAID5 (4+1) leaving 1 disk as a hot spare. We opted for the latter.

The 6x2TB NL-SAS can only be configured as RAID6 (4+2). Fine for our purposes (backups), but worth knowing as some use cases may require different configurations.

However, it's not as simple as popping in some disks and off you go (well, it is that simple, but there's a lot going on underneath!).

To understand what the VNXe does with the disks requires another background history lesson because there is a bunch of terminology carried over from both CLARiiON and Celerra.

CLARiiON terminology

The CLARiiON works on block storage in the following way:

Physical disks are added to a CLARiiON and a RAID set is created (such as a RAID5 4+1 configuration). The result is call a "RAID Group". Traditionally, up to 16 disks could belong to a single RAID Group.

From within the RAID Group, FLARE (the CLARiiON operating environment) would create LUNs. The most basic type of LUN is the cunningly named "FLARE LUN". Some FLARE LUNs have special purposes and are used internally (for snapshots, logging etc.). These are called Private LUNs.

The problem with a FLARE LUN is that you are limited to the maximum size of the RAID Group (16 disks worth of capacity minus overhead for parity or mirroring). To overcome this, EMC invented the MetaLUN. This construct combines multiple FLARE LUNs together either through striping or concatenation.

For the sake of completeness, later releases of FLARE introduced the concept of a Storage Pool, out of which Thick LUNs (with reserved space) and Thin LUNs (with no reserved space, also known as Virtual (aka Thin) Provisioning) could be created.

FLARE provides some additional features such as compression (which moves the LUN to a pool, converts it to a Thin LUN and compresses it) and Fully Automated Storage Tiering Virtual Provisioning (FAST VP) which combines multiple disk types (such as Flash, Fibre Channel and SATA) into a single pool and dynamically moves hot data to the fast disks). Pretty clever stuff.

Regardless of the type, each LUN is assigned to a preferred owning Storage Processor (SP), allowing the storage administrator the ability to manually balance LUNs for maximum performance. In the event of a controller failure, the peer controller would take ownership. The LUNs are then mapped to be visible to hosts.

Okay, so that's CLARiiON, what about Celerra?

Celerra terminology

To the CLARiiON, a Celerra is a host to which LUNs are presented. These LUNs must be configured in a specific way to be understood by the Celerra operating system, DART.

The LUNs that CLARiiON present are seen as single disks by the Celerra (regardless of the number of underlying physical disks that comprise the LUN). These disks are sometimes referred to as "dvols". You should assume a 1:1 mapping between CLARiiON LUNs and Celerra dvols.

In addition to the disk (dvol), Celerra offers some additional I/O constructs: A "slice" is a part of a dvol, and a single disk can be partitioned into many slices. Similarly, a "stripe" is an interlace of multiple dvols or multiple slices. The "meta" (not to be confused with a CLARiiON MetaLUN) is a concatenation of multiple slices, stripes, disks or other metas.

Finally, a "volume" is created. This is a block I/O construct into which a filesystem is created. And it's these filesystems that are made visible to hosts. The default filesystem is uxfs and is used by both NFS and CIFS fileservers.

As the above illustrates, the path between physical disk and filesystem is pretty complicated. How the VNXe is derived from this messy lineage will be the subject of part 2 (coming soon).