This blog originally started because I was looking for ways to move much of my online life to cloud services. It's since grown to encompass some of my technical projects, but I still have an interest in what is now referred to as "personal cloud" services. So, as of May 2012, these are the services I use:
Email
I use Google Mail for my main, personal email. For signing up to web sites etc., I also have a Yahoo email address which supports disposable addresses, very useful for creating per-domain, unique addresses that can be removed if I start getting spam. Just for completeness, I also have a Hotmail account which gets limited use and is used primarily for logging into Microsoft services.
Calendar
Google Calendar synchronises with my iPhone/iPad and my Outlook calendar using Google Calendar Sync.
News
Google Reader remains my main application for RSS feed aggregation. I use it with Reeder on the iPhone and iPad for mobile reading.
Notes
I absolutely love Evernote. I use it to store multiple notebooks containing all my research, notes, web clipping (especially useful now Evernote Clearly has been released) and it acts as a single "dumping ground" in which to throw my thoughts and anything I find interesting. I run the client on Windows, Mac, iPhone and iPad. It's so good I pay for it as a premium user.
Backups
I have a Crashplan+ subscription and use it to backup my Windows PC, T's Windows PC and my Mac. Data is encrypted before leaving my home network so is secure online. There is a real peace of mind knowing that a copy of all our family documents - and photos - are backed up online. It's worth paying for.
Passwords
I've recently started using LastPass to manage all my website passwords as well as provide a place to store other private data (such as computer account credentials). LastPass encrypts data locally, but stores the results in the cloud. There is an app for the iPhone, but you need to be a premium user to get it. I'm only using the free version at the moment, but I like what I've seen with it so far and may subscribe.
File sharing
I've had a Box account for a few years (with 5GB of free space) and it's recently changed its focus to become more of a SharePoint alternative. It's pretty good, but this space is getting crowded with 25GB free with Microsoft SkyDrive, 5GB with Google Drive and 2GB with Dropbox. I wouldn't use any of these services to store my important, personal data (especially since it's unencrypted), but for non-sensitive data, it's good to have options, especially if you need to collaborate with someone. It's too early to determine what service I'll end up focusing on, so watch this space as the products mature.
Living on the Cloud
Technology-related ramblings from the edge of the cloud
Friday, 18 May 2012
Saturday, 21 April 2012
EMC VNXe - diving under the hood (Part 5: Networking)
A quick look at the EMC Community Forum for the VNXe will show a lot of questions around the best way to configure networking. This is partly due to the way that networking is handled by the VNXe.
Put simply, it's different from the CLARiiON and Celerra.
As the previous post illustrated, the VNXe operating system is based on Linux with CSX Containers hosting FLARE and DART environments. To understand how networking is handled in the VNXe requires looking at various parts of the stack.
The physical perspective
Each SP has, by default, four network interfaces:
The Linux perspective
Running ifconfig in an SSH session reveals a number of different devices:
The Linux "mgmt" device maps to the physical management NIC port, but does not have an IP address assigned to it. A virtual interface, "mgmt:0" is created on top of "mgmt" and this is assigned the Unisphere management interface IP address. This is almost certainly due to the HA capabilities built into the VNXe. In the event of a SP failure, the virtual interface will be failed over to the peer SP.
End user data is transferred over "eth2" and "eth3". The first thing to note from the output of ifconfig is that the MAC addresses of both interfaces are the same.
Another device, "bond0", is created on top of eth2. If link aggregation is configured, eth3 is also joined to bond0. This provides load balancing of network traffic into the VNXe.
There is also a "cmin0" device which connects to the internal "CLARiiON Messaging Interface" (CMI). The CMI is a fast PCIe connection to the peer SP and is used for failover traffic and cache mirroring. The cmin0 device does not have an IP address. It's possible the CMI communicates using layer 2 only and therefore doesn't require an IP address, but that's only speculation.
Finally, there is an eth_int device that has an IP address in the 128.221.255.0 subnet. This is used to communicate with the peer SP and either uses the CMI or has an internal network connection of some kind.
A quick check of a client machines ARP cache reveals that the IP addresses of Shared Folder Servers in the VNXe do not have a MAC address that is mapped to any of the Linux devices. So how does IP traffic reach the Shared Folder Servers if the network ports are not listening for those MAC addresses? Running the "dmesg" command shows the kernel log, including boot information. The answer to the question can be found here:
[ 659.638845] device eth_int entered promiscuous mode
[ 659.797273] device bond0 entered promiscuous mode
[ 659.797283] device eth2 entered promiscuous mode
[ 659.799994] device eth3 entered promiscuous mode
[ 659.805381] device cmin0 entered promiscuous mode
On start up, all the Linux network ports are put into promiscuous mode. This means that the ports listen to all traffic passed to them regardless of the destination MAC address and can therefore pass traffic up the stack to the DART container.
The DART perspective
The CSX DART Container sits on top of the Linux operating system and provides its own network devices:
DART also creates some "Fail Safe Network" devices on top of the vnics:
The DART fsnX devices are virtual devices that map to the underlying DART devices in an active/standby configuration:
I don't know what the rep30 interface is, but guess it could be an address for replication to use if licensed and configured.
This might be more clearly explained with a diagram:
Failover
So how does failover work?
It would appear that there are different failover technologies used. The Linux-HA software is used within Linux to provide management interface failover to the peer SP.
It's also likely that DART is doing some form of HA clustering as well. On a Celerra, DART redundancy was handled by setting up a (physical) standby Data Mover. Given that DART is running as a CSX Container, does the peer SP actually run two instances of the DART CSX Container, one active for the SP, the second running standby for the peer? I don't know, but it would make some sense if it did, and would also help explain what the 8GB of RAM in each SP is being used for.
Network Configuration
This document has a good overview on how to best configure networking for a VNXe. The following hopefully explains "why" networks should be configured in a particular way.
The "best" approach does depend on whether you are using NFS or iSCSI. The important thing to understand is that they make use of multiple links in different ways:
Stacking switch pair with link aggregation
If both eth2 and eth3 are aggregated into an Etherchannel, then the failure of one link should not cause a problem. Redundancy is handled at the network layer (through the bond0 device) and DART should not even notice that the physical link is down.
With an aggregation, traffic is load balanced based on a MAC or IP hash. With multiple hosts accessing the VNXe, the load should be balanced pretty evenly across both links. However, if you only have a single host accessing the VNXe, chances are you will be limited to the throughput of a single link. Despite this limitation, you will still have the additional redundancy of the second link.
Separate switches (no link aggregation)
If eth2 and eth3 are connected to separate non-stacking switches then eth3 would not be joined to bond0, but would connect directly to vnic1 and be used by a different Shared Folder Server or iSCSI Server to one using eth2.
Therefore, the only connection on the SP is via eth2 and if it fails, DART would detect that vnic0 has failed and the fsn0 device will failover from vnic0 to vnic0-b. This will then route traffic via the peer SPs physical Ethernet ports via the cmin0 device. Presumably a gratuitous ARP request is sent from the peer SP to notify the upstream network of the new route.
This is why eth0 on SPA must be on the same subnet (and VLAN) as SPB. If a failover occurs, then the peer SP must be able to impersonate the failed link.
This is a big difference from the CLARiiON which passes LUN ownership to the peer SP, or Celerra which relies on standby data movers to pick up the load if the active fails (although as noted above, it might still do this in software). In contrast, there should be little performance hit if traffic is directed across the CMI to the peer SP (although the peer SP network links may be overloaded as a result).
Conclusion
This pretty much concludes this mini-series into the VNXe!
It goes to show that even the simplest of storage devices have a fair amount of complexity under the hood and despite its limitations in some areas, the VNXe is a very good entry level array. It is impressive how EMC have managed to virtualise the CLARiiON and Celerra stacks and it makes sense that this approach will be used in other products in the future.
Thanks for reading! Any comments and/or corrections welcome.
Put simply, it's different from the CLARiiON and Celerra.
As the previous post illustrated, the VNXe operating system is based on Linux with CSX Containers hosting FLARE and DART environments. To understand how networking is handled in the VNXe requires looking at various parts of the stack.
The physical perspective
Each SP has, by default, four network interfaces:
- 1 x Network Management
- 1 x Service Laptop (not discussed here)
- 2 x LAN ports (for iSCSI/CIFS/NFS traffic)
The Linux perspective
Running ifconfig in an SSH session reveals a number of different devices:
- bond0
- cmin0
- eth2
- eth3
- eth_int
- lab (???)
- lo (loopback)
- mgmt
- mgmt:0
The Linux "mgmt" device maps to the physical management NIC port, but does not have an IP address assigned to it. A virtual interface, "mgmt:0" is created on top of "mgmt" and this is assigned the Unisphere management interface IP address. This is almost certainly due to the HA capabilities built into the VNXe. In the event of a SP failure, the virtual interface will be failed over to the peer SP.
End user data is transferred over "eth2" and "eth3". The first thing to note from the output of ifconfig is that the MAC addresses of both interfaces are the same.
Another device, "bond0", is created on top of eth2. If link aggregation is configured, eth3 is also joined to bond0. This provides load balancing of network traffic into the VNXe.
There is also a "cmin0" device which connects to the internal "CLARiiON Messaging Interface" (CMI). The CMI is a fast PCIe connection to the peer SP and is used for failover traffic and cache mirroring. The cmin0 device does not have an IP address. It's possible the CMI communicates using layer 2 only and therefore doesn't require an IP address, but that's only speculation.
Finally, there is an eth_int device that has an IP address in the 128.221.255.0 subnet. This is used to communicate with the peer SP and either uses the CMI or has an internal network connection of some kind.
A quick check of a client machines ARP cache reveals that the IP addresses of Shared Folder Servers in the VNXe do not have a MAC address that is mapped to any of the Linux devices. So how does IP traffic reach the Shared Folder Servers if the network ports are not listening for those MAC addresses? Running the "dmesg" command shows the kernel log, including boot information. The answer to the question can be found here:
[ 659.638845] device eth_int entered promiscuous mode
[ 659.797273] device bond0 entered promiscuous mode
[ 659.797283] device eth2 entered promiscuous mode
[ 659.799994] device eth3 entered promiscuous mode
[ 659.805381] device cmin0 entered promiscuous mode
On start up, all the Linux network ports are put into promiscuous mode. This means that the ports listen to all traffic passed to them regardless of the destination MAC address and can therefore pass traffic up the stack to the DART container.
The DART perspective
The CSX DART Container sits on top of the Linux operating system and provides its own network devices:
- DART vnic0 maps to Linux bond0
- DART vnic1 maps to Linux eth3 (presumably unless eth3 is joined to bond0)
- DART vnic0-b maps to the Linux cmin0 device
- DART vnic-int maps to the Linux eth_int device
DART also creates some "Fail Safe Network" devices on top of the vnics:
- fsn0 maps to vnic0
- fsn1 maps to vnic1
The DART fsnX devices are virtual devices that map to the underlying DART devices in an active/standby configuration:
- fsn1 active=vnic1 primary=vnic1 standby=vnic1-b
- fsn0 active=vnic0 primary=vnic0 standby=vnic0-b
- rep30 - IP on the internal 128.221.255.0 network and maps to vnic-int
- el30 - IP address of DART instance "server_2" on the vnic-int interface, 128.221.255.0 network
- if_12 - maps to device fsn0 and contains the user configured IP address of the Shared Folder Server
I don't know what the rep30 interface is, but guess it could be an address for replication to use if licensed and configured.
This might be more clearly explained with a diagram:
Failover
So how does failover work?
It would appear that there are different failover technologies used. The Linux-HA software is used within Linux to provide management interface failover to the peer SP.
It's also likely that DART is doing some form of HA clustering as well. On a Celerra, DART redundancy was handled by setting up a (physical) standby Data Mover. Given that DART is running as a CSX Container, does the peer SP actually run two instances of the DART CSX Container, one active for the SP, the second running standby for the peer? I don't know, but it would make some sense if it did, and would also help explain what the 8GB of RAM in each SP is being used for.
Network Configuration
This document has a good overview on how to best configure networking for a VNXe. The following hopefully explains "why" networks should be configured in a particular way.
The "best" approach does depend on whether you are using NFS or iSCSI. The important thing to understand is that they make use of multiple links in different ways:
Stacking switch pair with link aggregation
If both eth2 and eth3 are aggregated into an Etherchannel, then the failure of one link should not cause a problem. Redundancy is handled at the network layer (through the bond0 device) and DART should not even notice that the physical link is down.
With an aggregation, traffic is load balanced based on a MAC or IP hash. With multiple hosts accessing the VNXe, the load should be balanced pretty evenly across both links. However, if you only have a single host accessing the VNXe, chances are you will be limited to the throughput of a single link. Despite this limitation, you will still have the additional redundancy of the second link.
Separate switches (no link aggregation)
If eth2 and eth3 are connected to separate non-stacking switches then eth3 would not be joined to bond0, but would connect directly to vnic1 and be used by a different Shared Folder Server or iSCSI Server to one using eth2.
Therefore, the only connection on the SP is via eth2 and if it fails, DART would detect that vnic0 has failed and the fsn0 device will failover from vnic0 to vnic0-b. This will then route traffic via the peer SPs physical Ethernet ports via the cmin0 device. Presumably a gratuitous ARP request is sent from the peer SP to notify the upstream network of the new route.
This is why eth0 on SPA must be on the same subnet (and VLAN) as SPB. If a failover occurs, then the peer SP must be able to impersonate the failed link.
This is a big difference from the CLARiiON which passes LUN ownership to the peer SP, or Celerra which relies on standby data movers to pick up the load if the active fails (although as noted above, it might still do this in software). In contrast, there should be little performance hit if traffic is directed across the CMI to the peer SP (although the peer SP network links may be overloaded as a result).
Conclusion
This pretty much concludes this mini-series into the VNXe!
It goes to show that even the simplest of storage devices have a fair amount of complexity under the hood and despite its limitations in some areas, the VNXe is a very good entry level array. It is impressive how EMC have managed to virtualise the CLARiiON and Celerra stacks and it makes sense that this approach will be used in other products in the future.
Thanks for reading! Any comments and/or corrections welcome.
Thursday, 5 April 2012
EMC VNXe - diving under the hood (Part 4: CSX)
After the last post, I was pointed in the direction of the "VNXe Theory of Operations" online training available from education.emc.com (just do a search for it). This free course provides some interesting details into the VNXe architecture.
With knowledge gained from the course in mind, let's see if we can get a better understanding of what's happening under the hood...
C4LX and CSX
When the VNXe was announced, Chad Sakac at EMC referred to the it as "using a completely homegrown EMC innovation called C4LX and CSX to virtualize, encapsulate whole kernels and other multiple high performance storage services into a tight, integrated package."
In the same blog post, Chad also illustrated the operating system stack which showed the C4LX and CSX components are built on a 64bit Linux kernel.
CSX (short for "Common Software eXecution") is designed to provide a common API layer for EMC software that is not tied to the underlying operating system kernel. As a portable execution environment, CSX can run on many platforms in either kernel or user space. So when some functionality is written within the CSX framework (e.g., data compression), it can be easily ported to all CSX supporting platforms, regardless of whether the underlying operating system is DART, FLARE or something else. Steve Todd has some more details about CSX on his blog.
So if you read that CSX instances are similar to Virtual Machines, think of it in terms more like a Java Virtual Machine rather than a VMware Virtual Machine. It's an API abstraction and runtime environment, not a virtualisation of physical resources such as CPU and memory.
There aren't many details on what C4LX is, but here's my conjecture: There are some functions that CSX needs the underlying operating system to perform that may not be easily possible "out of the box". If that's true, then C4LX is the Linux kernel along with a bunch of kernel modules and additional software that provide this functionality. Or another way to describe it might be to call it EMC's own internal Linux distribution...
Data Path
Like the data plane and control plane in a network switch, software in the VNXe appears to be designed to operate on the "Data Path" or the "Control Path".
CSX creates various "Containers" that are populated with "CSX Modules". A Container is either a user space application or a kernel module. CSX Containers implement functionality within the Data Path.
The FLARE functionality described in part 2 of this series is implemented as a CSX Module, as is the DART functionality described in part 3. Both these modules run in the Linux user space. There is a degree of isolation between Containers in that they be terminated and restarted without interfering with other Containers. However, some Containers (such as DART) have dependencies on other Containers (FLARE).
In addition to the FLARE and DART Containers, a Global Memory Services (GMS) Container provides memory management functionality and services other Containers. As an example, the FLARE Container takes 500MB memory, while the DART Container takes 2.5GB, all allocated by the GMS.
A kernel space Container is responsible for allocating resources on behalf of user space Containers. The Linux Upstart software provides a means to start, stop and restart Containers.
Control Path
The Control Path is a implemented using technology derived from the Celerra Control Station (itself a Linux-based server) and the CLARiiON NaviSphere software. The Control Path is also where the Common Security Toolkit (CST) is found. The CST appears to be RSA technology and is used in multiple EMC products for security-relation functions. In contrast to the Data Path which consists of functionality directly relating to the transferring of data, the Control Path is concerned with management functionality.
Within the Control Path of the VNXe is the EMC CIM (Common Interface Module) Object Manager (ECOM) management server. ECOM interfaces with "Providers" which are essentially plug-ins. Within the VNXe, ECOM runs on the master SP.
There are a number of different Providers. These include Application Providers for Exchange, iSCSI, Shared Folders and VMware software provision, a Virtual Server Provider, Pools Provider, CLARiiON Provider and Celerra Provider. There are also providers for Registration, Scripting, Scheduling, Replication etc. As plug-ins to the ECOM server, additional services can be written to extend the functionality within the VNXe.
With the use of Providers, ECOM implements a middleware subsystem that can be called by front end applications such as Unisphere or the VNXe command line.
In addition to running Providers, ECOM also provides basic web server functionality used by the Unisphere GUI and CLI via the Apache web server.
Pulling it together
The VNXe uses some additional Linux software along with the custom CSX and ECOM components. High availability is implemented through the open source Pacemaker cluster resource manager and using the Softdog software timing kernel driver. CSX components are resource managed using the cgroups feature of the Linux kernel. The Logging system uses the Postgres database. Although this is covered in the EMC training, it's also possible to see this by checking the output of "ps" from an SSH session.
To understand how the various components hang together, the boot sequence looks a bit like this:
Hopefully this gives some insight into the complexity that underpins the VNXe. We're going to look at one more topic to conclude this mini series, and it's a subject that is the source of many questions on the EMC VNXe Community forum. In the next post we'll have a look at VNXe networking...
With knowledge gained from the course in mind, let's see if we can get a better understanding of what's happening under the hood...
C4LX and CSX
When the VNXe was announced, Chad Sakac at EMC referred to the it as "using a completely homegrown EMC innovation called C4LX and CSX to virtualize, encapsulate whole kernels and other multiple high performance storage services into a tight, integrated package."
In the same blog post, Chad also illustrated the operating system stack which showed the C4LX and CSX components are built on a 64bit Linux kernel.
CSX (short for "Common Software eXecution") is designed to provide a common API layer for EMC software that is not tied to the underlying operating system kernel. As a portable execution environment, CSX can run on many platforms in either kernel or user space. So when some functionality is written within the CSX framework (e.g., data compression), it can be easily ported to all CSX supporting platforms, regardless of whether the underlying operating system is DART, FLARE or something else. Steve Todd has some more details about CSX on his blog.
So if you read that CSX instances are similar to Virtual Machines, think of it in terms more like a Java Virtual Machine rather than a VMware Virtual Machine. It's an API abstraction and runtime environment, not a virtualisation of physical resources such as CPU and memory.
There aren't many details on what C4LX is, but here's my conjecture: There are some functions that CSX needs the underlying operating system to perform that may not be easily possible "out of the box". If that's true, then C4LX is the Linux kernel along with a bunch of kernel modules and additional software that provide this functionality. Or another way to describe it might be to call it EMC's own internal Linux distribution...
Data Path
Like the data plane and control plane in a network switch, software in the VNXe appears to be designed to operate on the "Data Path" or the "Control Path".
CSX creates various "Containers" that are populated with "CSX Modules". A Container is either a user space application or a kernel module. CSX Containers implement functionality within the Data Path.
The FLARE functionality described in part 2 of this series is implemented as a CSX Module, as is the DART functionality described in part 3. Both these modules run in the Linux user space. There is a degree of isolation between Containers in that they be terminated and restarted without interfering with other Containers. However, some Containers (such as DART) have dependencies on other Containers (FLARE).
In addition to the FLARE and DART Containers, a Global Memory Services (GMS) Container provides memory management functionality and services other Containers. As an example, the FLARE Container takes 500MB memory, while the DART Container takes 2.5GB, all allocated by the GMS.
A kernel space Container is responsible for allocating resources on behalf of user space Containers. The Linux Upstart software provides a means to start, stop and restart Containers.
Control Path
The Control Path is a implemented using technology derived from the Celerra Control Station (itself a Linux-based server) and the CLARiiON NaviSphere software. The Control Path is also where the Common Security Toolkit (CST) is found. The CST appears to be RSA technology and is used in multiple EMC products for security-relation functions. In contrast to the Data Path which consists of functionality directly relating to the transferring of data, the Control Path is concerned with management functionality.
Within the Control Path of the VNXe is the EMC CIM (Common Interface Module) Object Manager (ECOM) management server. ECOM interfaces with "Providers" which are essentially plug-ins. Within the VNXe, ECOM runs on the master SP.
There are a number of different Providers. These include Application Providers for Exchange, iSCSI, Shared Folders and VMware software provision, a Virtual Server Provider, Pools Provider, CLARiiON Provider and Celerra Provider. There are also providers for Registration, Scripting, Scheduling, Replication etc. As plug-ins to the ECOM server, additional services can be written to extend the functionality within the VNXe.
With the use of Providers, ECOM implements a middleware subsystem that can be called by front end applications such as Unisphere or the VNXe command line.
In addition to running Providers, ECOM also provides basic web server functionality used by the Unisphere GUI and CLI via the Apache web server.
Pulling it together
The VNXe uses some additional Linux software along with the custom CSX and ECOM components. High availability is implemented through the open source Pacemaker cluster resource manager and using the Softdog software timing kernel driver. CSX components are resource managed using the cgroups feature of the Linux kernel. The Logging system uses the Postgres database. Although this is covered in the EMC training, it's also possible to see this by checking the output of "ps" from an SSH session.
To understand how the various components hang together, the boot sequence looks a bit like this:
- BIOS/POST
- Linux boots and initiates run level 3
- The "C4" stack is loaded by the Linux Upstart software:
- CSX infra
- Log daemon
- GMS Container
- FLARE Container
- admin
- Pacemaker is loaded and automatically starts:
- Logging
- DART Container
- Control Path software (ECOM on the master SP based on mgmt network status)
Hopefully this gives some insight into the complexity that underpins the VNXe. We're going to look at one more topic to conclude this mini series, and it's a subject that is the source of many questions on the EMC VNXe Community forum. In the next post we'll have a look at VNXe networking...
Wednesday, 4 April 2012
EMC VNXe - diving under the hood (Part 3: DART)
In the previous post, we looked at the parts of the VNXe that are derived from the FLARE (CLARiiON) code. The result is a number of LUNs that are presented up the stack to the DART (Celerra) part of the system.
Using the "svc_storagecheck -l" command, we can see that a total of 20 disks are found. These map to the two FLARE LUNs from the 300GB SAS RAID5 RAID Group and the sixteen FLARE LUNs from the 2TB NL-SAS RAID6 RAID Group, plus two other disks: root_disk and root_ldisk.
root_disk and root_ldisk appear to map to the internal SSD on the Service Processors and are not visible to the end user for configuration. These disks appear to have root filesystems, panic reservation and UFS log filesystems.
The FLARE LUNs are seen as disks to DART and are commonly referred to as "dvols".
The dvols are grouped into Storage Pools. The following are defined by the system, along with a subset of their parameters:
As the above table shows, the LUNs presented from the FLARE side of the VNXe are assigned to the performance_dart0 and capacity_dart1 pools.
The Volume Profile should be familiar to Celerra administrators and is the set of rules that define how a set of disks should be configured.
On a Celerra, disks could be configured manually (if you know exactly what you want) or automatically using the "Automatic Volume Manager" (AVM). Because the VNXe is designed to be simple, AVM does all the work.
An AVM group called "root_avm_vol_group_63" (the svc_neo_map command refers to this as the "Internal FS name") has been created and consists of two dvols, d18 and d19 that corresponds to the performance_dart0 storage pool. These two dvols map to the two LUNs presented from the 300GB SAS disk RAID Group. It appears when a filesystem is created, the first disk is partitioned into a number of slices (sixteen on d18). Each slice then has a volume created on it and finally, another volume is created that spans across all the other volumes. It's this top level volume, called v139 in the diagram below, on which a filesystem is created:
Note that d19 in the above diagram isn't used. If the filesystem is expanded beyond the capacity of the single disk, then presumably the next disk is used. For some reason, slice 68 doesn't have a corresponding volume. I would welcome any explanation as to why this is.
The configuration for the capacity_dart1 pool is very similar, albeit with many more disks (sixteen instead of two) and many more slices. Unfortunately it's too big to show here. As an example, the first disk, d23, has 40 slices of its own that form part of the pool.
The use of all these smaller slices presumably means that a filesystem can grow incrementally from the pool (and possibly shrink?).
When the filesystem is created, it isn't visible to an external host. On a Celerra or VNX, this functionality would be handled by a physical data mover. The VNXe uses a software "Shared Folder Server" (SFS) which acts as the server to the other hosts on the network.
Multiple Shared Folder Servers can be created (apparently up to 12 Shared Folder Servers (file) and/or iSCSI Servers (block) are supported), each with its own network settings and sharing its own filesystems out over NFS or CIFS. Note that while a SFS can handle both NFS and CIFS, a single filesystem within a SFS can support either NFS or CIFS, but not both at the same time.
From a disk perspective, EMC have done well to hide a lot of legacy cruft away from the user and the encapsulation of FLARE and DART, along with the software implementation of the data mover idea is a neat evolution of an aging architecture.
There is more to look into such as networking (which has provoked a significant number of questions on the EMC forums) and I'd like to find out more about the CSX "execution environment" that underpins much of the new design. I'd be sure to post more if/when I get more information, but hopefully you've found this a useful dive under the hood of the VNXe.
Using the "svc_storagecheck -l" command, we can see that a total of 20 disks are found. These map to the two FLARE LUNs from the 300GB SAS RAID5 RAID Group and the sixteen FLARE LUNs from the 2TB NL-SAS RAID6 RAID Group, plus two other disks: root_disk and root_ldisk.
root_disk and root_ldisk appear to map to the internal SSD on the Service Processors and are not visible to the end user for configuration. These disks appear to have root filesystems, panic reservation and UFS log filesystems.
The FLARE LUNs are seen as disks to DART and are commonly referred to as "dvols".
The dvols are grouped into Storage Pools. The following are defined by the system, along with a subset of their parameters:
| Name | Description | In use | Members | Volume Profile |
|---|---|---|---|---|
| clarsas_archive | CLARiiON RAID5 on SAS | False | clarsas_archive_vp | |
| clarsas_r6 | CLARiiON RAID6 on SAS | False | clarsas_r6_vp | |
| clar_r1_3d_sas | 3 disk RAID-1 | False | clar_r1_3d_sas_vp | |
| clar_r3_3P1_SAS | RAID-3 (3+1) | False | clar_r3_3P1_SAS_vp | |
| performance_dart0 | performance | True | d18,d19 | N/A |
| capacity_dart1 | capacity | True | d23,d24,d25,d26 d27,d28,d29,d30 d31,d32,d33,d34 d35,d36,d37,d38 |
N/A |
As the above table shows, the LUNs presented from the FLARE side of the VNXe are assigned to the performance_dart0 and capacity_dart1 pools.
The Volume Profile should be familiar to Celerra administrators and is the set of rules that define how a set of disks should be configured.
On a Celerra, disks could be configured manually (if you know exactly what you want) or automatically using the "Automatic Volume Manager" (AVM). Because the VNXe is designed to be simple, AVM does all the work.
An AVM group called "root_avm_vol_group_63" (the svc_neo_map command refers to this as the "Internal FS name") has been created and consists of two dvols, d18 and d19 that corresponds to the performance_dart0 storage pool. These two dvols map to the two LUNs presented from the 300GB SAS disk RAID Group. It appears when a filesystem is created, the first disk is partitioned into a number of slices (sixteen on d18). Each slice then has a volume created on it and finally, another volume is created that spans across all the other volumes. It's this top level volume, called v139 in the diagram below, on which a filesystem is created:
Note that d19 in the above diagram isn't used. If the filesystem is expanded beyond the capacity of the single disk, then presumably the next disk is used. For some reason, slice 68 doesn't have a corresponding volume. I would welcome any explanation as to why this is.
The configuration for the capacity_dart1 pool is very similar, albeit with many more disks (sixteen instead of two) and many more slices. Unfortunately it's too big to show here. As an example, the first disk, d23, has 40 slices of its own that form part of the pool.
The use of all these smaller slices presumably means that a filesystem can grow incrementally from the pool (and possibly shrink?).
When the filesystem is created, it isn't visible to an external host. On a Celerra or VNX, this functionality would be handled by a physical data mover. The VNXe uses a software "Shared Folder Server" (SFS) which acts as the server to the other hosts on the network.
Multiple Shared Folder Servers can be created (apparently up to 12 Shared Folder Servers (file) and/or iSCSI Servers (block) are supported), each with its own network settings and sharing its own filesystems out over NFS or CIFS. Note that while a SFS can handle both NFS and CIFS, a single filesystem within a SFS can support either NFS or CIFS, but not both at the same time.
From a disk perspective, EMC have done well to hide a lot of legacy cruft away from the user and the encapsulation of FLARE and DART, along with the software implementation of the data mover idea is a neat evolution of an aging architecture.
There is more to look into such as networking (which has provoked a significant number of questions on the EMC forums) and I'd like to find out more about the CSX "execution environment" that underpins much of the new design. I'd be sure to post more if/when I get more information, but hopefully you've found this a useful dive under the hood of the VNXe.
Friday, 30 March 2012
EMC VNXe - diving under the hood (Part 2: FLARE)
In part one we looked at how the EMC midrange CLARiiON and Celerra combination (now replaced by the VNX) presented physical disks to a host as a filesystem. The VNXe is a new architecture physically, but dive beneath the hood and bits of FLARE and DART from the CLARiiON and Celerra are still visible.
The best way to get under the hood and see how the VNXe works is to enable SSH through the Unisphere web interface (it's under Settings > Service System). Then open an SSH session to the VNXe as user "service". First piece of information: It's running SUSE Linux Enterprise Server 11 (64bit).
The most interesting command to find out information about the VNXe storage system is "svc_storagecheck". This takes a number of parameters.
If we start at the bottom of the stack with the "svc_storagecheck -b" (for backend information), we get a lot of information about the array. Note that the word "Neo" crops up a lot and was the codename for the VNXe.
Some of the useful details revealed include the supported RAID types:
It's worth noting that not all of the above are directly accessible by the user (there is no way to manually create a RAID3 group to my knowledge). Also, the limit of 16 disks per RAID Group is also found on the CLARiiON.
The svc_storagecheck command also outputs details on "Flare's Memory Info" which shows two objects (one per SP?), each having a total capacity of 900 (MB?) with 128 (MB?) Read Cache and 768 (MB?) Write Cache. This might be a surprise if you were expecting the entire 8GB of memory to be available. A lot of this 8GB is used by the FLARE and DART subsystems, along with the Linux operating system itself.
There is also information on "2 Internal Disks" which presumably refer to the internal SSD on each SP which is used to store the operating environment.
Eight Ethernet ports are listed, as are two Link Aggregation Ports that I have setup within Unisphere.
Each of the 12 disks in the array are also detailed, including manufacturer (Seagate), capacity, speed, the slot number etc. Each disk is assigned a type of "NEO_DISK_TYPE_SAS" and also contains a reference to the RAID Group that it belongs to.
There are 2 Private RAID Groups, but no LUNs are presented from it and I cannot determine what this is for. I assume it's used by the operating system.
On my VNXe, there are an additional 3 RAID Groups:
The first of these RAID Groups is the RAID5 (4+1) of the 300GB SAS disks, the second isn't really a RAID Group and is the SAS hot spare disk. The final RAID Group comprises the six 2TB NL-SAS disks in a RAID6 (4+2) configuration.
The 2 LUNs are presented from the SAS disks in a way that looks similar to that done on CLARiiON (except on the CLARiiON, each LUN would typically be assigned to a different SP, whereas on the VNXe, this isn't the case and both LUNs are on the same SP).
I'm not sure why the NL-SAS RAID Group presents 16 LUNs, possibly due to the size of each disk. Each of these LUNs is striped across the RAID Group as follows:
The next part of the "svc_storagecheck -b" command details the 19 LUNs that have been defined above.
Each LUN has a name, prefixed with "flare_lun_" which gives a big clue to its history. The default owning and current controller is also defined.
The final part of the "svc_storagecheck -b" command details the Storage Groups used by the array. A Storage Group is an intersection of LUNs and servers. For example. LUN0, LUN1 and LUN2 could be added to a SG with ServerA and ServerB. In this example, both ServerA and ServerB can see LUNs 0, 1 and 2.
There are some in-built Storage Groups (~management, ~physical, ~filestorage and control_lun4) as well as dart0, dart1, dart2 and dart3. The 2 LUNs from the 300GB SAS RAID Group belong to Storage Group "dart0" along with the IDs of the Storage Processors. Similarly, the 16 LUNs from the 2TB NL-SAS RAID Group are mapped to "dart1" along with the two Storage Processors. Storage Groups dart2 and dart3 are unused and presumably for future use.
We can also get some more disk information by using the "svc_neo_map" command, specifying a LUN number:
# svc_neo_map --lun=0
This command can be used to help map the FLARE side to the DART side of the VNXe.
And this pretty much concludes the FLARE side of the VNXe. The resulting LUNs have been carved from the RAID Groups and presented to the Storage Processors in a configuration that the DART side of the VNXe will be able to understand. We'll look in more detail at the DART side in the next post.
The best way to get under the hood and see how the VNXe works is to enable SSH through the Unisphere web interface (it's under Settings > Service System). Then open an SSH session to the VNXe as user "service". First piece of information: It's running SUSE Linux Enterprise Server 11 (64bit).
The most interesting command to find out information about the VNXe storage system is "svc_storagecheck". This takes a number of parameters.
If we start at the bottom of the stack with the "svc_storagecheck -b" (for backend information), we get a lot of information about the array. Note that the word "Neo" crops up a lot and was the codename for the VNXe.
Some of the useful details revealed include the supported RAID types:
| RAID Type | Min Disks | Max Disks |
|---|---|---|
| NEO_RG_TYPE_RAID_5 | 3 | 16 |
| NEO_RG_TYPE_DISK | 1 | 1 |
| NEO_RG_TYPE_RAID_1 | 2 | 16 |
| NEO_RG_TYPE_RAID_0 | 3 | 16 |
| NEO_RG_TYPE_RAID_3 | 5 | 9 |
| NEO_RG_TYPE_HOTSPARE | 1 | 1 |
| NEO_RG_TYPE_RAID_1_0 | 2 | 16 |
| NEO_RG_TYPE_RAID_6 | 4 | 16 |
It's worth noting that not all of the above are directly accessible by the user (there is no way to manually create a RAID3 group to my knowledge). Also, the limit of 16 disks per RAID Group is also found on the CLARiiON.
The svc_storagecheck command also outputs details on "Flare's Memory Info" which shows two objects (one per SP?), each having a total capacity of 900 (MB?) with 128 (MB?) Read Cache and 768 (MB?) Write Cache. This might be a surprise if you were expecting the entire 8GB of memory to be available. A lot of this 8GB is used by the FLARE and DART subsystems, along with the Linux operating system itself.
There is also information on "2 Internal Disks" which presumably refer to the internal SSD on each SP which is used to store the operating environment.
Eight Ethernet ports are listed, as are two Link Aggregation Ports that I have setup within Unisphere.
Each of the 12 disks in the array are also detailed, including manufacturer (Seagate), capacity, speed, the slot number etc. Each disk is assigned a type of "NEO_DISK_TYPE_SAS" and also contains a reference to the RAID Group that it belongs to.
There are 2 Private RAID Groups, but no LUNs are presented from it and I cannot determine what this is for. I assume it's used by the operating system.
On my VNXe, there are an additional 3 RAID Groups:
| Number | RAID Type | Number of Disks | Number of LUNs |
|---|---|---|---|
| 0 | NEO_RG_TYPE_RAID_5 | 5 | 2 |
| 1 | NEO_RG_TYPE_HOTSPARE | 1 | 1 |
| 2 | NEO_RG_TYPE_RAID_6 | 6 | 16 |
The first of these RAID Groups is the RAID5 (4+1) of the 300GB SAS disks, the second isn't really a RAID Group and is the SAS hot spare disk. The final RAID Group comprises the six 2TB NL-SAS disks in a RAID6 (4+2) configuration.
The 2 LUNs are presented from the SAS disks in a way that looks similar to that done on CLARiiON (except on the CLARiiON, each LUN would typically be assigned to a different SP, whereas on the VNXe, this isn't the case and both LUNs are on the same SP).
I'm not sure why the NL-SAS RAID Group presents 16 LUNs, possibly due to the size of each disk. Each of these LUNs is striped across the RAID Group as follows:
The next part of the "svc_storagecheck -b" command details the 19 LUNs that have been defined above.
Each LUN has a name, prefixed with "flare_lun_" which gives a big clue to its history. The default owning and current controller is also defined.
The final part of the "svc_storagecheck -b" command details the Storage Groups used by the array. A Storage Group is an intersection of LUNs and servers. For example. LUN0, LUN1 and LUN2 could be added to a SG with ServerA and ServerB. In this example, both ServerA and ServerB can see LUNs 0, 1 and 2.
There are some in-built Storage Groups (~management, ~physical, ~filestorage and control_lun4) as well as dart0, dart1, dart2 and dart3. The 2 LUNs from the 300GB SAS RAID Group belong to Storage Group "dart0" along with the IDs of the Storage Processors. Similarly, the 16 LUNs from the 2TB NL-SAS RAID Group are mapped to "dart1" along with the two Storage Processors. Storage Groups dart2 and dart3 are unused and presumably for future use.
We can also get some more disk information by using the "svc_neo_map" command, specifying a LUN number:
# svc_neo_map --lun=0
This command can be used to help map the FLARE side to the DART side of the VNXe.
And this pretty much concludes the FLARE side of the VNXe. The resulting LUNs have been carved from the RAID Groups and presented to the Storage Processors in a configuration that the DART side of the VNXe will be able to understand. We'll look in more detail at the DART side in the next post.
Tuesday, 27 March 2012
EMC VNXe - diving under the hood (Part 1: Intro)
If you want a decent primer on the VNXe, check out Henriwithani's blog.
The VNXe is an interesting unit and this mini series will look at what the VNXe 3100 provides and how it works. In order to understand some of the thinking behind the design, we need some background.
First, some history...
(It's worth stating here at the start that I'm not an expert on CLARiiON or Celerra and the below is my understanding based on reading the documentation. If I've made errors, please correct me!)
Go back a couple of years and EMC had two product lines for the midrange: CLARiiON and Celerra.
CLARiiON was their block storage array capable of supporting Fibre Channel and iSCSI. Each controller in a CLARiiON ran an operating environment called FLARE which actually ran on top of a Windows XP Embedded kernel.
Celerra took a CLARiiON, added one or more “X-Blades” (aka “Datamovers”) and a 1U, Linux-based “Control Station”. The Datamovers were basically NAS head units and added support for NFS and CIFS/SMB. They could also do iSCSI, albeit in a somewhat more abstracted way than running native block on CLARiiON. The operating system running on the Datamovers was a UNIX-lilke derivative called DART.
More information on Celerra’s architecture here.
In January 2011, EMC announced the successor to both CLARiiON and Celerra: The VNX. At the same time, the VNXe was announced as an entry-level sibling to the VNX. The VNX appears to be a fairly straightforward upgrade to CLARiiON and Celerra: Faster, bigger, more features etc.
However, the VNXe sports a new software architecture running on an “execution environment” called CSX. From what I can tell by looking at the VNXe support logs and poking around the command line, CSX runs on a Linux kernel. Relevant parts of the FLARE and DART stacks have been ported to run on CSX and each Storage Processor (SP) (aka controller) in the VNXe runs an instance of the Linux/CSX/FLARE/DART stack.
(More information on VNXe here.)
So what you get in a VNXe is a fusion of bits of FLARE and DART, but mixed up in a new way.
The VNXe configuration
Because it is aimed at the non-storage administrator, a lot of the technical details in the VNXe are hidden. This is a bit frustrating for those of us who like to know how things work, so I’ve tried to dive under the hood a bit.
The base VNXe 3100 is a 2U unit and has 12 drive bays and up to two controllers. Each controller (Storage Processor, or “SP” in EMC language) has a dedicated management Ethernet port, and two ports for data called Eth2 and Eth3. There is an option to install a SLIC module in each controller that adds an additional 4 x 10/100/1000Mbit Ethernet ports (copper). We didn’t buy this expansion, so I can’t comment further on those.
With two controllers, the VNXe 3100 has the ability to support up to 96 drives through the addition of Disk Array Enclosures (DAEs). The DAEs connect to the base unit via 6Gbps SAS.
As an entry level system, the VNXe has some limitations in terms of disk configuration options. Disks are sold in packs and are designed to be added in one of the following ways:
- 300GB SAS: 5 pack in RAID5 (4+1)
- 300GB SAS: 6 pack in RAID10 (3+3)
- 600GB SAS: 5 pack in RAID5 (4+1)
- 600GB SAS: 6 pack in RAID10 (3+3)
- 1TB NL-SAS: 6 pack in RAID6 (4+2)
- 2TB NL-SAS: 6 pack in RAID6 (4+2)
The version we purchased has two controllers and 12 disks: 6x300GB SAS and 6x2TB NL-SAS.
Once installed in the system, the 6x300GB disks can be either configured as “High Performance” which is RAID10 (3+3), or as “Balanced Performance/Capacity” which is RAID5 (4+1) leaving 1 disk as a hot spare. We opted for the latter.
The 6x2TB NL-SAS can only be configured as RAID6 (4+2). Fine for our purposes (backups), but worth knowing as some use cases may require different configurations.
However, it's not as simple as popping in some disks and off you go (well, it is that simple, but there's a lot going on underneath!).
To understand what the VNXe does with the disks requires another background history lesson because there is a bunch of terminology carried over from both CLARiiON and Celerra.
CLARiiON terminology
The CLARiiON works on block storage in the following way:
Physical disks are added to a CLARiiON and a RAID set is created (such as a RAID5 4+1 configuration). The result is call a "RAID Group". Traditionally, up to 16 disks could belong to a single RAID Group.
From within the RAID Group, FLARE (the CLARiiON operating environment) would create LUNs. The most basic type of LUN is the cunningly named "FLARE LUN". Some FLARE LUNs have special purposes and are used internally (for snapshots, logging etc.). These are called Private LUNs.
The problem with a FLARE LUN is that you are limited to the maximum size of the RAID Group (16 disks worth of capacity minus overhead for parity or mirroring). To overcome this, EMC invented the MetaLUN. This construct combines multiple FLARE LUNs together either through striping or concatenation.
For the sake of completeness, later releases of FLARE introduced the concept of a Storage Pool, out of which Thick LUNs (with reserved space) and Thin LUNs (with no reserved space, also known as Virtual (aka Thin) Provisioning) could be created.
FLARE provides some additional features such as compression (which moves the LUN to a pool, converts it to a Thin LUN and compresses it) and Fully Automated Storage Tiering Virtual Provisioning (FAST VP) which combines multiple disk types (such as Flash, Fibre Channel and SATA) into a single pool and dynamically moves hot data to the fast disks). Pretty clever stuff.
Regardless of the type, each LUN is assigned to a preferred owning Storage Processor (SP), allowing the storage administrator the ability to manually balance LUNs for maximum performance. In the event of a controller failure, the peer controller would take ownership. The LUNs are then mapped to be visible to hosts.
Okay, so that's CLARiiON, what about Celerra?
Celerra terminology
To the CLARiiON, a Celerra is a host to which LUNs are presented. These LUNs must be configured in a specific way to be understood by the Celerra operating system, DART.
The LUNs that CLARiiON present are seen as single disks by the Celerra (regardless of the number of underlying physical disks that comprise the LUN). These disks are sometimes referred to as "dvols". You should assume a 1:1 mapping between CLARiiON LUNs and Celerra dvols.
In addition to the disk (dvol), Celerra offers some additional I/O constructs: A "slice" is a part of a dvol, and a single disk can be partitioned into many slices. Similarly, a "stripe" is an interlace of multiple dvols or multiple slices. The "meta" (not to be confused with a CLARiiON MetaLUN) is a concatenation of multiple slices, stripes, disks or other metas.
Finally, a "volume" is created. This is a block I/O construct into which a filesystem is created. And it's these filesystems that are made visible to hosts. The default filesystem is uxfs and is used by both NFS and CIFS fileservers.
As the above illustrates, the path between physical disk and filesystem is pretty complicated. How the VNXe is derived from this messy lineage will be the subject of part 2 (coming soon).
Monday, 13 February 2012
Passing the VCP for vSphere 5
With time running out on being able to certify as a VCP5 without having to take a course, I've spent the evenings of the last few weeks revising and testing in the home lab. As with previous exams, the scoring is emphatically not percentage based and is graded between 100 and 500 with a passing mark of 300(!).
Well, I passed, with a score of 378 (higher than when I did the VCP4) which I was very pleased with, but thought I'd share a few thoughts on the process:
Despite a lot of experience with vSphere and reading up on all the topics in the exam blueprint (this is a must if you want to pass!), I still found myself having to make guesses in a few of the questions. This is because some questions seemed to be more "trivia" oriented and you would only know the answer if you had done that exact operation outside of the exam.
On the plus side, you will note that the exam blueprint doesn't require you to remember a list of configuration maximums anymore!
There are 85 questions to complete within 90 minutes and I found that I answered them all with about 15 minutes to spare, but because I had marked quite a few for review, I used all the time available.
For revision, I used the aforementioned exam blueprint which contains pointers to various VMware documents. This is a huge amount to digest, but it's worth a skim read of all these docs.
I also used Scott Lowe's Mastering vSphere 5 book which currently seems to be the de facto book on the subject, although I did make reference to some parts of Mike Laverick's VMware vSphere 4 Implementation (yes, the previous release) as I feel it gave a better explanation to some sections.
Finally, there is no substitute to hands on experience, and the home lab was fully utilised with 3 nested ESXi servers, both the Windows vCenter Server and the vSphere Appliance tested, along with the VSA, Update Manager, Distributed vSwitches etc.
Special mention must go to the following two sites which have written notes on each of the blueprint sections, both are excellent and really helped in my revision:
It's worth noting that as a regular user of the "Enterprise" version, getting experience with the "Enterprise Plus" features in the lab (thanks to the 60 day trial) is very important as otherwise there are many things that I would never otherwise see. Hopefully the community effort to reignite the VMTN program is successful and we can all benefit from using the full feature set in our home labs.
So that's the VCP out the way until the next big release. More certs later this year...
Well, I passed, with a score of 378 (higher than when I did the VCP4) which I was very pleased with, but thought I'd share a few thoughts on the process:
Despite a lot of experience with vSphere and reading up on all the topics in the exam blueprint (this is a must if you want to pass!), I still found myself having to make guesses in a few of the questions. This is because some questions seemed to be more "trivia" oriented and you would only know the answer if you had done that exact operation outside of the exam.
On the plus side, you will note that the exam blueprint doesn't require you to remember a list of configuration maximums anymore!
There are 85 questions to complete within 90 minutes and I found that I answered them all with about 15 minutes to spare, but because I had marked quite a few for review, I used all the time available.
For revision, I used the aforementioned exam blueprint which contains pointers to various VMware documents. This is a huge amount to digest, but it's worth a skim read of all these docs.
I also used Scott Lowe's Mastering vSphere 5 book which currently seems to be the de facto book on the subject, although I did make reference to some parts of Mike Laverick's VMware vSphere 4 Implementation (yes, the previous release) as I feel it gave a better explanation to some sections.
Finally, there is no substitute to hands on experience, and the home lab was fully utilised with 3 nested ESXi servers, both the Windows vCenter Server and the vSphere Appliance tested, along with the VSA, Update Manager, Distributed vSwitches etc.
Special mention must go to the following two sites which have written notes on each of the blueprint sections, both are excellent and really helped in my revision:
It's worth noting that as a regular user of the "Enterprise" version, getting experience with the "Enterprise Plus" features in the lab (thanks to the 60 day trial) is very important as otherwise there are many things that I would never otherwise see. Hopefully the community effort to reignite the VMTN program is successful and we can all benefit from using the full feature set in our home labs.
So that's the VCP out the way until the next big release. More certs later this year...
Subscribe to:
Posts (Atom)



