Removing a disk group from a VSAN host

I had been playing around with my VSAN cluster for a bit the last couple of weeks and it literally has become messy. I created many VMs and many snapshots and removed many of those again, all of this while pulling cables of servers and pulling disks. Basically stress testing VSAN while injecting faults to see how it responds. It was time to clean up and upgrade to a later build as the beta refresh was just released. After deleting a bunch of VMs I noticed that not everything was removed, I had also uploaded ISOs and some other random stuff which I probably should not have. Anyway, I needed to clean one of my hosts up.

I figured I would use RVC for the exercise just to get a bit more familiar with it. First I wanted to check what the current state was of my cluster, I used the “vsan.disks_stats” command:

Then I figured I would want to just simply remove the disk group for Server “prmb-esx08” using “vsan.host_wipe_vsan_disks”:

Note that you can also do this using the UI:

  • Go to your cluster
  • Click “Manage” and “Virtual SAN” -> “Disk Management”
  • Select the “Disk Group” and click the “Remove the Disk Group” icon

If only the world of applications was as simple as that…

In the last 6-12 months I have heard many people making statements around how the application landscape should change. Whether it is a VC writing on GigaOm about how server CPU utilization is still low and how Linux Containers and new application architectures can solve this. (There are various reasons CPU util is low by the way, ranging from being memory constraints to storage perf constraints and by design) Or whether it is network administrators simply telling the business that their application will need to cope with a new IP-Addresses after a disaster recovery fail-over situation. Although I can see where they are coming from I don’t think the world of applications is as simple as that.

Sometimes it seems that we forget something which is fundamental to what we do.  IT as it exists today provides a service to its customers. It provides an infrastructure which “our / your customers” can consume. This infrastructure, to a certain point, should be architected based on the requirements of your customer. Why? What is the point of creating a foundation which doesn’t meet the requirements of what will be put on top of it? That is like building a skyscraper on top of the foundation of a house, bad things are bound to happen! Although I can understand why we feel our customers should change I do not think it is realistic to expect this will happen over night. Or even if it is realistic to ask them to change?

Just to give an example, and I am not trying to pick on anyone here, lets take this quote from the GigaOm article:

Server virtualization was supposed to make utilization rates go up. But utilization is still low and solutions to solve that will change the way the data center operates.

I agree that Server Virtualization promised to make the utilization rates go up, and they did indeed. And overall utilization may still be low, although it depends on who you talk to I guess or what you include in your numbers. Many of the customers I talk to are up to around 40-50% utilized from a CPU perspective, and they do not want to go higher than that and have their reasons for it. Was utilization the only reason for them to start virtualizing? I would strongly argue that it was not the only reason, there are many others! Reducing the number of physical servers to manage, availability (HA) of their workloads, transportability of their workloads, automation of deployments, disaster recovery, maintenance… the reasons are almost countless.

I guess you will need to ask yourself what all of these reasons have in common? They are non-disruptive to the application architecture! Yes there is the odd application that cannot be virtualized, but the majority of all X86 workloads can be, without the need to make any changes to the application! Clearly you would have to talk to the app owner as their app is being migrated to a different platform, but there will be hardly any work for them associated with this migration.

Oh I agree, everything would be a lot better when the application landscape was completely overhauled and magically all applications use a highly available and scalable distributed application framework. Everything would be a lot better when all applications magically were optimized for the infrastructure they are consuming, applications can handle instant ip-address changes, applications can deal with random physical servers disappearing. Reality unfortunately is that that is not the case today, and for many of our customers in the upcoming years. Re-architecting an application, which often for most app owners comes from a 3rd party, is not something that happens overnight. Projects like those take years, if they even successfully finish.

Although I agree with the conclusion drawn by the author of the article, I think there is a twist to it:

It’s a dynamic time in the data center. New hardware and infrastructure software options are coming to market in the next few years which are poised to shake up the existing technology stack. Should be an exciting ride.

Reality is that we deliver a service, a service that caters for the needs of our customers. If our customers are not ready, or even willing, to adapt this will not just be a hurdle but a showstopper for many of these new technologies. Being a disruptive (I’m not a fan of the word) technology is one thing, causing disruption is another.

Startup News Flash part 10

There we are, part 10 of the Startup News Flash. Someone asked me on Twitter last week why Company XYZ was never included in the news flash. Let it be clear that I am not leaving anyone out (unless I feel they aren’t relevant to this newsletter or my audience), I have limited time so typically do not do briefings… Which means that if the marketing team doesn’t sent me the details via email and I haven’t somehow stumbled across the announcement it will not appear on here. If you want your company to be listed, make sure they sent their press releases over.

Some new models announced by Nutanix. Funny to see how they’ve been pushing hard from a marketing perspective to remove the “pure VDI play” label they had and now launch a VDI focused model called the 7000 series. (Do not get me wrong, I think this is a brilliant move!) The 7000 series offers you the option to include NVIDIA K1 or K2 Grid cards. Primarily intended to accelerate graphics, so if you are for instance doing a lot of 3D rendering or just are a heavy graphical VDI user these could really provide a benefit over their (and other vendors) normal offerings. On top of that the 3000 and 6000 series has been overhauled. The NX-3061 and NX-3061 with 10 Core (2.8GHz) Ivy Bridge have been introduced and the NX6060 and NX6080 10 Core (2.8 and 3.0GHz respectively) have been introduced. Haven’t seen anything around pricing, so can’t comment on that.

No clue what it is exactly these guys do to be honest. I find their teaser video very intriguing. Not much detail to be found around what they are doing other than “re-imagine enteprise computing”. Hoping to hear more from these guys in the future as their teaser did make me curious.

I don’t care much about benchmarks, but it is always nice to see a smaller (or the underdog) company beat the big players. Kaminario managed to outperform Oracle, IBM and Fujitsu with their SPC-2 Performance Benchmark using their scale-out all flash array K2 v4. Just a couple of weeks after breaking the SPC-1 Benchmark World Record again. Like I said, I don’t care much about benchmarks  as it doesn’t typically say much about the operational efficiency etc. Still it is a nice indication of what can be achieved, though your results may vary depending on your IO pattern of course.

VSAN VDI Benchmarking and Beta refresh!

I was reading this blog post on VSAN VDI Benchmarking today on Vroom, the VMware Performance blog. You see a lot of people doing synthetic tests (max iops with sequential reads) on all sorts of storage devices, but lately more and more vendors are doing these more “real world performance tests”. While reading this article about VDI benchmarking, and I suggest you check out all parts (part 1, part 2, part 3), there was one thing that stood out to me and that was the comparison between VSAN and an All Flash Array.

The following quotes show the strength of VSAN if you ask me:

we see that VSAN can consolidate 677 heavy users (VDImark) for 7-node and 767 heavy users for 8-node cluster. When compared to the all flash array, we don’t see more than 5% difference in the user consolidation.

Believe me when I say that 5% is not a lot. If you are actively looking at various solutions, I would highly recommend to include the “overhead costs” to your criteria list as depending on the solution chosen this could make a substantial difference. I have seen other solutions requiring a lot more resources. But what about response time, cause that is where the typical All Flash Array shines… ultra low latency, how about VSAN?

Similar to the user consolidation, the response time of Group-A operations in VSAN is similar to what we saw with the all flash array.

Both very interesting results if you ask me. Especially the < 5% in user consolidation is what stood out to me the most! Once again, for more details on these tests read the VDI Benchmarking blog part 1, part 2, part 3!

Beta Refresh

For those who are testing VSAN, there is a BETA refresh available as of today. This release has a fix for the AHCI driver issue… and it increases the disk group limit from 6 to 7. From a disk group perspective this will  come in handy as many servers have 8, 16 or 24 disk slots allowing you to do 7HHDs + 1 SSD per group. Also some additional RVC commands have been added in the storage policy space, I am sure they will come in handy!

Nice side affect of the number of HDDs going up is increase in max capacity:

(8 hosts * (5 diskgroups * 7 HDDs)) * Size of HDD = Total capacity

With 2 TB disks this would result in:

(8 * (5 * 7)) * 2TB = 560TB

Now keep on testing with VSAN and don't forget to report feedback through the community forums or your VMware rep.

Virtual SAN and maintenance windows…

After writing the article that “4 is the minimum number of hosts for VSAN” I received a lot of questions via email and on twitter etc about the cost associated with it and if this was a must. Let me start with saying that I wrote this article to get people thinking about Sizing their VSAN environment. When it comes to it, Virtual SAN and maintenance windows can be a difficult topic.

I guess there are a couple of things to consider here. Even in a regular storage environment you typically do upgrades in a rolling fashion meaning that if you have two controllers one will be upgraded while they other handles IO. In that case you are also at risk. The thing is though, as a virtualization administrator you have a bit more flexibility, and you expect certain features to work as expected like for instance vSphere HA. You need to ask yourself what is the level of risk I am willing to take, the level of risk I can take?

When it comes to placing a host in to Maintenance Mode, from a VSAN point of view you will need to ask yourself:

  • Do I want to move data from one host to another to maintain availability levels?
  • Do I just want to ensure data accessibility and take the risk of potential downtime during maintenance?

I guess there is something to say for either. When you move data from one node to another, to maintain availability levels, your “maintenance window” could be stretched extremely long. As you would potentially be copying TBs over the network from host to host it could take hours to complete. If your ESXi upgrade including a host reboot takes about 20 minutes, is it acceptable to wait for hours for the data to be migrated? Or do you take the risk, inform your users about the potential downtime, and as such do the maintenance with a higher risk but complete it in minutes rather than hours? After those 20 minutes VSAN would sync up again automatically, so no data loss etc.

It is impossible for me to give you advice on this one to be honest, I would highly recommend to also sit down with your storage team. Look at what their current procedures are today, what they have included in their SLA to the business (if there is one), and how they handle upgrades / periodic maintenance.