Saturday 21 April 2012

EMC VNXe - diving under the hood (Part 5: Networking)

A quick look at the EMC Community Forum for the VNXe will show a lot of questions around the best way to configure networking. This is partly due to the way that networking is handled by the VNXe.

Put simply, it's different from the CLARiiON and Celerra.

As the previous post illustrated, the VNXe operating system is based on Linux with CSX Containers hosting FLARE and DART environments. To understand how networking is handled in the VNXe requires looking at various parts of the stack.

The physical perspective

Each SP has, by default, four network interfaces:
  • 1 x Network Management
  • 1 x Service Laptop (not discussed here)
  • 2 x LAN ports (for iSCSI/CIFS/NFS traffic)
Additional ports can be added in the form a a SLIC module, but I don't have access to any of these, so won't discuss that here.

The Linux perspective

Running ifconfig in an SSH session reveals a number of different devices:
  • bond0
  • cmin0
  • eth2
  • eth3
  • eth_int
  • lab (???)
  • lo (loopback)
  • mgmt
  • mgmt:0

The Linux "mgmt" device maps to the physical management NIC port, but does not have an IP address assigned to it. A virtual interface, "mgmt:0" is created on top of "mgmt" and this is assigned the Unisphere management interface IP address. This is almost certainly due to the HA capabilities built into the VNXe. In the event of a SP failure, the virtual interface will be failed over to the peer SP.

End user data is transferred over "eth2" and "eth3". The first thing to note from the output of ifconfig is that the MAC addresses of both interfaces are the same.

Another device, "bond0", is created on top of eth2. If link aggregation is configured, eth3 is also joined to bond0. This provides load balancing of network traffic into the VNXe.

There is also a "cmin0" device which connects to the internal "CLARiiON Messaging Interface" (CMI). The CMI is a fast PCIe connection to the peer SP and is used for failover traffic and cache mirroring. The cmin0 device does not have an IP address. It's possible the CMI communicates using layer 2 only and therefore doesn't require an IP address, but that's only speculation.

Finally, there is an eth_int device that has an IP address in the subnet. This is used to communicate with the peer SP and either uses the CMI or has an internal network connection of some kind.

A quick check of a client machines ARP cache reveals that the IP addresses of Shared Folder Servers in the VNXe do not have a MAC address that is mapped to any of the Linux devices. So how does IP traffic reach the Shared Folder Servers if the network ports are not listening for those MAC addresses? Running the "dmesg" command shows the kernel log, including boot information. The answer to the question can be found here:

[  659.638845] device eth_int entered promiscuous mode
[  659.797273] device bond0 entered promiscuous mode
[  659.797283] device eth2 entered promiscuous mode
[  659.799994] device eth3 entered promiscuous mode
[  659.805381] device cmin0 entered promiscuous mode

On start up, all the Linux network ports are put into promiscuous mode. This means that the ports listen to all traffic passed to them regardless of the destination MAC address and can therefore pass traffic up the stack to the DART container.

The DART perspective

The CSX DART Container sits on top of the Linux operating system and provides its own network devices:
  • DART vnic0 maps to Linux bond0
  • DART vnic1 maps to Linux eth3 (presumably unless eth3 is joined to bond0)
  • DART vnic0-b maps to the Linux cmin0 device
  • DART vnic-int maps to the Linux eth_int device

DART also creates some "Fail Safe Network" devices on top of the vnics:
  • fsn0 maps to vnic0
  • fsn1 maps to vnic1
While ifconfig provides information from within Linux, to get the DART information requires the "svc_networkcheck -i" command. This reveals some useful information.

The DART fsnX devices are virtual devices that map to the underlying DART devices in an active/standby configuration:
  • fsn1 active=vnic1 primary=vnic1 standby=vnic1-b
  • fsn0 active=vnic0 primary=vnic0 standby=vnic0-b
If that wasn't enough layers of indirection, DART creates some additional interfaces on top of the fsnX devices:
  • rep30 - IP on the internal network and maps to vnic-int
  • el30 - IP address of DART instance "server_2" on the vnic-int interface, network
  • if_12 - maps to device fsn0 and contains the user configured IP address of the Shared Folder Server
This appears to suggest that a Shared Folder Server (which is basically equivalent to a Data Mover in the Celerra, albeit now implemented in software and not hardware) has a front end external network interface and a back end internal network interface. I'm not sure what the significance of "el30" or "if_12" is, but seems to be carried over from the Celerra.

I don't know what the rep30 interface is, but guess it could be an address for replication to use if licensed and configured.

This might be more clearly explained with a diagram:


So how does failover work?

It would appear that there are different failover technologies used. The Linux-HA software is used within Linux to provide management interface failover to the peer SP.

It's also likely that DART is doing some form of HA clustering as well. On a Celerra, DART redundancy was handled by setting up a (physical) standby Data Mover. Given that DART is running as a CSX Container, does the peer SP actually run two instances of the DART CSX Container, one active for the SP, the second running standby for the peer? I don't know, but it would make some sense if it did, and would also help explain what the 8GB of RAM in each SP is being used for.

Network Configuration

This document has a good overview on how to best configure networking for a VNXe. The following hopefully explains "why" networks should be configured in a particular way.

The "best" approach does depend on whether you are using NFS or iSCSI. The important thing to understand is that they make use of multiple links in different ways:

Stacking switch pair with link aggregation

If both eth2 and eth3 are aggregated into an Etherchannel, then the failure of one link should not cause a problem. Redundancy is handled at the network layer (through the bond0 device) and DART should not even notice that the physical link is down.

With an aggregation, traffic is load balanced based on a MAC or IP hash. With multiple hosts accessing the VNXe, the load should be balanced pretty evenly across both links. However, if you only have a single host accessing the VNXe, chances are you will be limited to the throughput of a single link. Despite this limitation, you will still have the additional redundancy of the second link.

Separate switches (no link aggregation)

If eth2 and eth3 are connected to separate non-stacking switches then eth3 would not be joined to bond0, but would connect directly to vnic1 and be used by a different Shared Folder Server or iSCSI Server to one using eth2.

Therefore, the only connection on the SP is via eth2 and if it fails, DART would detect that vnic0 has failed and the fsn0 device will failover from vnic0 to vnic0-b. This will then route traffic via the peer SPs physical Ethernet ports via the cmin0 device. Presumably a gratuitous ARP request is sent from the peer SP to notify the upstream network of the new route.

This is why eth0 on SPA must be on the same subnet (and VLAN) as SPB. If a failover occurs, then the peer SP must be able to impersonate the failed link.

This is a big difference from the CLARiiON which passes LUN ownership to the peer SP, or Celerra which relies on standby data movers to pick up the load if the active fails (although as noted above, it might still do this in software). In contrast, there should be little performance hit if traffic is directed across the CMI to the peer SP (although the peer SP network links may be overloaded as a result).


This pretty much concludes this mini-series into the VNXe!

It goes to show that even the simplest of storage devices have a fair amount of complexity under the hood and despite its limitations in some areas, the VNXe is a very good entry level array. It is impressive how EMC have managed to virtualise the CLARiiON and Celerra stacks and it makes sense that this approach will be used in other products in the future.

Thanks for reading! Any comments and/or corrections welcome.


habibalby said...

Does the VNXe Management IP require to reach the ESX/vCenter in order to be added?

My vCenter/ESX management network is behind a router/firewalls and are reachable only for Adminisrtator from the other vLAN.

The VNXe SAN Storage is on different vLAN. I don't understand how the communication will happen between the vCenter/ESX and the VNXe.

Could you please elaborate a bit about it?

JR said...

Hi Habibalby

I'm not 100% sure but I don't think you need management access.

In order to create an NFS or iSCSI volume on the VNXe, you first need to create a "Shared Folder Server". The VNXe can have multiple Shared Folder Servers and they each have a unique IP address.

My understanding is that it is the Shared Folder Server IP address and not the Management IP address that is used by vCenter (certainly, when mounted under vSphere, the NFS server IP is that of the Shared Folder Server).

Hope that helps. If you need more information, please let me know.


habibalby said...

Hi JR,
Thanks for your reply.
I have manged to configured it as follows;
SPA iSCSI_ServerA eth2
SPA iSCSI_ServerA eth3
SPB iSCSI_ServerB eth2
SPB iSCSI_ServerB eth3

from ESXi; I have created vSwitch and 2pNICs bound to it.
iSCSI-01 used by vmnic3 and unused by vmnic6
iSCSI-02 used by vmnic6 and unused by vmnic3

Binded both iSCSI vmkernel groups to the vmhba37 "iscsi Initiator" and it become active.

vmkping -I vmk1 success
vmkping -I vmk1 success
vmkping -I vmk2 success
vmkping -I vmk2 success

Tested failover SPA while I have one Exchange/DC & File Server VM connected to Datastore and doing Exchange JetStress, i noticed small number of IOps disturbs when I rebooted SPA1 which was handled the Datastore Iops, after the datastore failover to SPB, the connections resumed again as I was continually pinging the SPA1 eth's and running esxtop on network level, I saw the vmhba37 adapter disconnected the iops and it resumed again.

I checked the event viewer of the exchange/dc server looking for an error but nothing found.

The issue now is, the LUN will not failback to it's original SP every after manually initiating the failback process in the Management console.