Saturday, 21 April 2012

EMC VNXe - diving under the hood (Part 5: Networking)

A quick look at the EMC Community Forum for the VNXe will show a lot of questions around the best way to configure networking. This is partly due to the way that networking is handled by the VNXe.

Put simply, it's different from the CLARiiON and Celerra.

As the previous post illustrated, the VNXe operating system is based on Linux with CSX Containers hosting FLARE and DART environments. To understand how networking is handled in the VNXe requires looking at various parts of the stack.

The physical perspective

Each SP has, by default, four network interfaces:
  • 1 x Network Management
  • 1 x Service Laptop (not discussed here)
  • 2 x LAN ports (for iSCSI/CIFS/NFS traffic)
Additional ports can be added in the form a a SLIC module, but I don't have access to any of these, so won't discuss that here.

The Linux perspective

Running ifconfig in an SSH session reveals a number of different devices:
  • bond0
  • cmin0
  • eth2
  • eth3
  • eth_int
  • lab (???)
  • lo (loopback)
  • mgmt
  • mgmt:0

The Linux "mgmt" device maps to the physical management NIC port, but does not have an IP address assigned to it. A virtual interface, "mgmt:0" is created on top of "mgmt" and this is assigned the Unisphere management interface IP address. This is almost certainly due to the HA capabilities built into the VNXe. In the event of a SP failure, the virtual interface will be failed over to the peer SP.

End user data is transferred over "eth2" and "eth3". The first thing to note from the output of ifconfig is that the MAC addresses of both interfaces are the same.

Another device, "bond0", is created on top of eth2. If link aggregation is configured, eth3 is also joined to bond0. This provides load balancing of network traffic into the VNXe.

There is also a "cmin0" device which connects to the internal "CLARiiON Messaging Interface" (CMI). The CMI is a fast PCIe connection to the peer SP and is used for failover traffic and cache mirroring. The cmin0 device does not have an IP address. It's possible the CMI communicates using layer 2 only and therefore doesn't require an IP address, but that's only speculation.

Finally, there is an eth_int device that has an IP address in the subnet. This is used to communicate with the peer SP and either uses the CMI or has an internal network connection of some kind.

A quick check of a client machines ARP cache reveals that the IP addresses of Shared Folder Servers in the VNXe do not have a MAC address that is mapped to any of the Linux devices. So how does IP traffic reach the Shared Folder Servers if the network ports are not listening for those MAC addresses? Running the "dmesg" command shows the kernel log, including boot information. The answer to the question can be found here:

[  659.638845] device eth_int entered promiscuous mode
[  659.797273] device bond0 entered promiscuous mode
[  659.797283] device eth2 entered promiscuous mode
[  659.799994] device eth3 entered promiscuous mode
[  659.805381] device cmin0 entered promiscuous mode

On start up, all the Linux network ports are put into promiscuous mode. This means that the ports listen to all traffic passed to them regardless of the destination MAC address and can therefore pass traffic up the stack to the DART container.

The DART perspective

The CSX DART Container sits on top of the Linux operating system and provides its own network devices:
  • DART vnic0 maps to Linux bond0
  • DART vnic1 maps to Linux eth3 (presumably unless eth3 is joined to bond0)
  • DART vnic0-b maps to the Linux cmin0 device
  • DART vnic-int maps to the Linux eth_int device

DART also creates some "Fail Safe Network" devices on top of the vnics:
  • fsn0 maps to vnic0
  • fsn1 maps to vnic1
While ifconfig provides information from within Linux, to get the DART information requires the "svc_networkcheck -i" command. This reveals some useful information.

The DART fsnX devices are virtual devices that map to the underlying DART devices in an active/standby configuration:
  • fsn1 active=vnic1 primary=vnic1 standby=vnic1-b
  • fsn0 active=vnic0 primary=vnic0 standby=vnic0-b
If that wasn't enough layers of indirection, DART creates some additional interfaces on top of the fsnX devices:
  • rep30 - IP on the internal network and maps to vnic-int
  • el30 - IP address of DART instance "server_2" on the vnic-int interface, network
  • if_12 - maps to device fsn0 and contains the user configured IP address of the Shared Folder Server
This appears to suggest that a Shared Folder Server (which is basically equivalent to a Data Mover in the Celerra, albeit now implemented in software and not hardware) has a front end external network interface and a back end internal network interface. I'm not sure what the significance of "el30" or "if_12" is, but seems to be carried over from the Celerra.

I don't know what the rep30 interface is, but guess it could be an address for replication to use if licensed and configured.

This might be more clearly explained with a diagram:


So how does failover work?

It would appear that there are different failover technologies used. The Linux-HA software is used within Linux to provide management interface failover to the peer SP.

It's also likely that DART is doing some form of HA clustering as well. On a Celerra, DART redundancy was handled by setting up a (physical) standby Data Mover. Given that DART is running as a CSX Container, does the peer SP actually run two instances of the DART CSX Container, one active for the SP, the second running standby for the peer? I don't know, but it would make some sense if it did, and would also help explain what the 8GB of RAM in each SP is being used for.

Network Configuration

This document has a good overview on how to best configure networking for a VNXe. The following hopefully explains "why" networks should be configured in a particular way.

The "best" approach does depend on whether you are using NFS or iSCSI. The important thing to understand is that they make use of multiple links in different ways:

Stacking switch pair with link aggregation

If both eth2 and eth3 are aggregated into an Etherchannel, then the failure of one link should not cause a problem. Redundancy is handled at the network layer (through the bond0 device) and DART should not even notice that the physical link is down.

With an aggregation, traffic is load balanced based on a MAC or IP hash. With multiple hosts accessing the VNXe, the load should be balanced pretty evenly across both links. However, if you only have a single host accessing the VNXe, chances are you will be limited to the throughput of a single link. Despite this limitation, you will still have the additional redundancy of the second link.

Separate switches (no link aggregation)

If eth2 and eth3 are connected to separate non-stacking switches then eth3 would not be joined to bond0, but would connect directly to vnic1 and be used by a different Shared Folder Server or iSCSI Server to one using eth2.

Therefore, the only connection on the SP is via eth2 and if it fails, DART would detect that vnic0 has failed and the fsn0 device will failover from vnic0 to vnic0-b. This will then route traffic via the peer SPs physical Ethernet ports via the cmin0 device. Presumably a gratuitous ARP request is sent from the peer SP to notify the upstream network of the new route.

This is why eth0 on SPA must be on the same subnet (and VLAN) as SPB. If a failover occurs, then the peer SP must be able to impersonate the failed link.

This is a big difference from the CLARiiON which passes LUN ownership to the peer SP, or Celerra which relies on standby data movers to pick up the load if the active fails (although as noted above, it might still do this in software). In contrast, there should be little performance hit if traffic is directed across the CMI to the peer SP (although the peer SP network links may be overloaded as a result).


This pretty much concludes this mini-series into the VNXe!

It goes to show that even the simplest of storage devices have a fair amount of complexity under the hood and despite its limitations in some areas, the VNXe is a very good entry level array. It is impressive how EMC have managed to virtualise the CLARiiON and Celerra stacks and it makes sense that this approach will be used in other products in the future.

Thanks for reading! Any comments and/or corrections welcome.

Thursday, 5 April 2012

EMC VNXe - diving under the hood (Part 4: CSX)

After the last post, I was pointed in the direction of the "VNXe Theory of Operations" online training available from (just do a search for it). This free course provides some interesting details into the VNXe architecture.

With knowledge gained from the course in mind, let's see if we can get a better understanding of what's happening under the hood...

C4LX and CSX

When the VNXe was announced, Chad Sakac at EMC referred to the it as "using a completely homegrown EMC innovation called C4LX and CSX to virtualize, encapsulate whole kernels and other multiple high performance storage services into a tight, integrated package."

In the same blog post, Chad also illustrated the operating system stack which showed the C4LX and CSX components are built on a 64bit Linux kernel.

CSX (short for "Common Software eXecution") is designed to provide a common API layer for EMC software that is not tied to the underlying operating system kernel. As a portable execution environment, CSX can run on many platforms in either kernel or user space. So when some functionality is written within the CSX framework (e.g., data compression), it can be easily ported to all CSX supporting platforms, regardless of whether the underlying operating system is DART, FLARE or something else. Steve Todd has some more details about CSX on his blog.

So if you read that CSX instances are similar to Virtual Machines, think of it in terms more like a Java Virtual Machine rather than a VMware Virtual Machine. It's an API abstraction and runtime environment, not a virtualisation of physical resources such as CPU and memory.

There aren't many details on what C4LX is, but here's my conjecture: There are some functions that CSX needs the underlying operating system to perform that may not be easily possible "out of the box". If that's true, then C4LX is the Linux kernel along with a bunch of kernel modules and additional software that provide this functionality. Or another way to describe it might be to call it EMC's own internal Linux distribution...

Data Path

Like the data plane and control plane in a network switch, software in the VNXe appears to be designed to operate on the "Data Path" or the "Control Path".

CSX creates various "Containers" that are populated with "CSX Modules". A Container is either a user space application or a kernel module. CSX Containers implement functionality within the Data Path.

The FLARE functionality described in part 2 of this series is implemented as a CSX Module, as is the DART functionality described in part 3. Both these modules run in the Linux user space. There is a degree of isolation between Containers in that they be terminated and restarted without interfering with other Containers. However, some Containers (such as DART) have dependencies on other Containers (FLARE).

In addition to the FLARE and DART Containers, a Global Memory Services (GMS) Container provides memory management functionality and services other Containers. As an example, the FLARE Container takes 500MB memory, while the DART Container takes 2.5GB, all allocated by the GMS.

A kernel space Container is responsible for allocating resources on behalf of user space Containers. The Linux Upstart software provides a means to start, stop and restart Containers.

Control Path

The Control Path is a implemented using technology derived from the Celerra Control Station (itself a Linux-based server) and the CLARiiON NaviSphere software. The Control Path is also where the Common Security Toolkit (CST) is found. The CST appears to be RSA technology and is used in multiple EMC products for security-relation functions. In contrast to the Data Path which consists of functionality directly relating to the transferring of data, the Control Path is concerned with management functionality.

Within the Control Path of the VNXe is the EMC CIM (Common Interface Module) Object Manager (ECOM) management server. ECOM interfaces with "Providers" which are essentially plug-ins. Within the VNXe, ECOM runs on the master SP.

There are a number of different Providers. These include Application Providers for Exchange, iSCSI, Shared Folders and VMware software provision, a Virtual Server Provider, Pools Provider, CLARiiON Provider and Celerra Provider. There are also providers for Registration, Scripting, Scheduling, Replication etc. As plug-ins to the ECOM server, additional services can be written to extend the functionality within the VNXe.

With the use of Providers, ECOM implements a middleware subsystem that can be called by front end applications such as Unisphere or the VNXe command line.

In addition to running Providers, ECOM also provides basic web server functionality used by the Unisphere GUI and CLI via the Apache web server.

Pulling it together

The VNXe uses some additional Linux software along with the custom CSX and ECOM components. High availability is implemented through the open source Pacemaker cluster resource manager and using the Softdog software timing kernel driver. CSX components are resource managed using the cgroups feature of the Linux kernel. The Logging system uses the Postgres database. Although this is covered in the EMC training, it's also possible to see this by checking the output of "ps" from an SSH session.

To understand how the various components hang together, the boot sequence looks a bit like this:

  2. Linux boots and initiates run level 3
  3. The "C4" stack is loaded by the Linux Upstart software:
    • CSX infra
    • Log daemon
    • GMS Container
    • FLARE Container
    • admin
  4. Pacemaker is loaded and automatically starts:
    • Logging
    • DART Container
    • Control Path software (ECOM on the master SP based on mgmt network status)
At which point, all the components are initialised and ready to go. Obviously there are additional details that I'm not covering (check the training if you want some more information on functionality such as replication, vault-to-flash, more on HA, licensing, logging and deduplication).

Hopefully this gives some insight into the complexity that underpins the VNXe. We're going to look at one more topic to conclude this mini series, and it's a subject that is the source of many questions on the EMC VNXe Community forum. In the next post we'll have a look at VNXe networking...

Wednesday, 4 April 2012

EMC VNXe - diving under the hood (Part 3: DART)

In the previous post, we looked at the parts of the VNXe that are derived from the FLARE (CLARiiON) code. The result is a number of LUNs that are presented up the stack to the DART (Celerra) part of the system.

Using the "svc_storagecheck -l" command, we can see that a total of 20 disks are found. These map to the two FLARE LUNs from the 300GB SAS RAID5 RAID Group and the sixteen FLARE LUNs from the 2TB NL-SAS RAID6 RAID Group, plus two other disks: root_disk and root_ldisk.

root_disk and root_ldisk appear to map to the internal SSD on the Service Processors and are not visible to the end user for configuration. These disks appear to have root filesystems, panic reservation and UFS log filesystems.

The FLARE LUNs are seen as disks to DART and are commonly referred to as "dvols".

The dvols are grouped into Storage Pools. The following are defined by the system, along with a subset of their parameters:

Name Description In use Members Volume Profile
clarsas_archive CLARiiON RAID5 on SAS False
clarsas_r6 CLARiiON RAID6 on SAS False
clar_r1_3d_sas 3 disk RAID-1 False
clar_r3_3P1_SAS RAID-3 (3+1) False
performance_dart0 performance True d18,d19 N/A
capacity_dart1 capacity True d23,d24,d25,d26

As the above table shows, the LUNs presented from the FLARE side of the VNXe are assigned to the performance_dart0 and capacity_dart1 pools.

The Volume Profile should be familiar to Celerra administrators and is the set of rules that define how a set of disks should be configured.

On a Celerra, disks could be configured manually (if you know exactly what you want) or automatically using the "Automatic Volume Manager" (AVM). Because the VNXe is designed to be simple, AVM does all the work.

An AVM group called "root_avm_vol_group_63" (the svc_neo_map command refers to this as the "Internal FS name") has been created and consists of two dvols, d18 and d19 that corresponds to the performance_dart0 storage pool. These two dvols map to the two LUNs presented from the 300GB SAS disk RAID Group. It appears when a filesystem is created, the first disk is partitioned into a number of slices (sixteen on d18). Each slice then has a volume created on it and finally, another volume is created that spans across all the other volumes. It's this top level volume, called v139 in the diagram below, on which a filesystem is created:

Note that d19 in the above diagram isn't used. If the filesystem is expanded beyond the capacity of the single disk, then presumably the next disk is used. For some reason, slice 68 doesn't have a corresponding volume. I would welcome any explanation as to why this is.

The configuration for the capacity_dart1 pool is very similar, albeit with many more disks (sixteen instead of two) and many more slices. Unfortunately it's too big to show here. As an example, the first disk, d23, has 40 slices of its own that form part of the pool.

The use of all these smaller slices presumably means that a filesystem can grow incrementally from the pool (and possibly shrink?).

When the filesystem is created, it isn't visible to an external host. On a Celerra or VNX, this functionality would be handled by a physical data mover. The VNXe uses a software "Shared Folder Server" (SFS) which acts as the server to the other hosts on the network.

Multiple Shared Folder Servers can be created (apparently up to 12 Shared Folder Servers (file) and/or iSCSI Servers (block) are supported), each with its own network settings and sharing its own filesystems out over NFS or CIFS. Note that while a SFS can handle both NFS and CIFS, a single filesystem within a SFS can support either NFS or CIFS, but not both at the same time.

From a disk perspective, EMC have done well to hide a lot of legacy cruft away from the user and the encapsulation of FLARE and DART, along with the software implementation of the data mover idea is a neat evolution of an aging architecture.

There is more to look into such as networking (which has provoked a significant number of questions on the EMC forums) and I'd like to find out more about the CSX "execution environment" that underpins much of the new design. I'd be sure to post more if/when I get more information, but hopefully you've found this a useful dive under the hood of the VNXe.