I was on vacation the past two weeks, yesterday I got a message from Frank Denneman and Pete Flecha if I had some time available. I was working in my backyard so dropped my tools and hopped on. Apparently John was sick, so I took his spot and here’s the result. Interesting conversation with Frank on the topic of VMW Cloud on AWS. I can’t wait for it to be generally available. Enjoy the show!
I have been having discussions with various customers about all sorts of highly available vSAN environments. Now that vSAN has been available for a couple of years customers are starting to become more and more comfortable around designing these infrastructures, which also leads to some interesting discussions. Many discussions these days are on the subject of multi room or multi site infrastructures. A lot of customers seem to have multiple datacenter rooms in the same building, or multiple datacenter rooms across a campus. When going through these different designs one thing stands out, in many cases customers have a dual datacenter configuration, and the question is if they can use stretched clustering across two rooms or if they can do fault domains across two rooms.
Of course theoretically this is possible (not supported, but you can do it). Just look at the diagram below, we cross host the witness and we have 2 clusters across 2 rooms and protect the witness by hosting it on the other vSAN cluster:
The challenge with these types of configurations is what happens when a datacenter room goes down. What a lot of people tend to forget is that depending on what fails the impact will vary. In the scenario above where you cross host a witness the failure if “Site A”, which is the left part of the diagram, results in a full environment not being available. Really? Yeah really:
- Site A is down
- Hosts-1a / 2a / 1b / 2b are unavailable
- Witness B for Cluster B is down >> as such Cluster B is down as majority is lost
- As Cluster B is down (temporarily), Cluster A is also impacted as Witness A is hosted on Cluster B
- So we now have a circular dependency
Some may say: well you can move Witness B to the same side as Witness A, meaning in Site B. But now if Site B fails the witness VMs are gone also impacting all clusters directly. That would only work if only Site A is ever expected to go down, who can give that guarantee? Of course the same applies to using “fault domains”, just look at the diagram below:
In this scenario we have the “orange fault domain” in Room A, “yellow” in Room B and “green” across rooms as there is no other option at that point. If Room A fails, VMs that have components in “Orange” and on “Host3” will be impacted directly, as more than 50% of their components will be lost the VMs cannot be restarted in Room B. Only when their components in “fault domain green” happen to be on “Host-6” then the VMs can be restarted. Yes in terms of setting up your fault domains this is possible, this is supported, but it isn’t recommended. No guarantees can be given your VMs will be restarted when either of the rooms fail. My tip of the day, when you start working on your design, overlay the virtual world with the physical world and run through failure scenarios step by step. What happens if Host 1 fails? What happens if Site 1 fails? What happens if Room A fails?
Now so far I have been talking about failure domains and stretched clusters, these are all logical / virtual constructs which are not necessarily tied to physical constructs. In reality however when you design for availability/failure, and try to prevent any type of failure to impact your environment the physical aspect should be considered at all times. Fault Domains are not random logical constructs, there’s a requirement for 3 fault domains at a minimum, so make sure you have 3 fault domains physically as well. Just to be clear, in a stretched cluster the witness acts as the 3rd fault domain. If you do not have 3 physical locations (or rooms), look for alternatives! One of those for instance could be vCloud Air, you can host your Stretched Cluster witness there if needed!
A while ago (2014) I wrote an article on TPS being disabled by default in future release. (Read KB 2080735 and 2097593 for more info) I described why VMware made this change from a security perspective and what the impact could be. Even today, two years later, I am still getting questions about this and what for instance the impact is on swap files. With vSAN you have the ability to thin provision swap files, and with TPS being disabled is this something that brings a risk?
Lets break it down, first of all what is the risk of having TPS enabled and where does TPS come in to play?
With large pages enabled by default most customers aren’t actually using TPS to the level they think they are. Unless you are using old CPUs which don’t have EPT or RVI capabilities, which I doubt at this point, it only kicks in with memory pressure (usually) and then large pages get broken in to small pages and only then will they be TPS’ed, if you have severe memory pressure that usually means you will go straight to ballooning or swapping.
Having said that, lets assume a hacker has managed to find his way in to you virtual machine’s guest operating system. Only when memory pages are collapsed, which as described above only happens under memory pressure, will the hacker be able to attack the system. Note that the VM/Data he wants to attack will need to be on the located on the same host and the memory pages/data he needs to breach the system will need to be collapsed. (actually, same NUMA node even) Many would argue that if a hacker gets that far and gets all the way in to your VM and capable of exploiting this gap you have far bigger problems. On top of that, what is the likelihood of pulling this off? Personally, and I know the VMware security team probably doesn’t agree, I think it is unlikely. I understand why VMware changed the default, but there are a lot of “IFs” in play here.
Anyway, lets assume you assessed the risk and feel you need to protect yourself against it and keep the default setting (intra-VM TPS only), what is the impact on your swap file capacity allocation? As stated when there is memory pressure, and ballooning cannot free up sufficient memory and intra-VM TPS is not providing the needed memory space either the next step after compressing memory pages is swapping! And in order for ESXi to swap memory to disk you will need disk capacity. If and when the swap file is thin provisioned (vSAN Sparse Swap) then before swapping out those blocks on vSAN will need to be allocated. (This also applies to NFS where files are thin provisioned by default by the way.)
What does that mean in terms of design? Well in your design you will need to ensure you allocate capacity on vSAN (or any other storage platform) for your swap files. This doesn’t need to be 100% capacity, but should be more than the level of expected overcommitment. If you expect that during maintenance for instance (or an HA event) you will have memory overcommitment of about 25% than you could ensure you have 25% of the capacity needed for swap files available at least to avoid having a VM being stunned as new blocks for the swap file cannot be allocated and you run out of vSAN datastore space.
Let it be clear, I don’t know many customers running their storage systems in terms of capacity up to 95% or more, but if you are and you have thin swap files and you are overcommitting and TPS is disabled, you may want to re-think your strategy.
Most of us have been using DRS for the longest time. To be honest, not much has changed over the past years, sure there were some tweaks and minor changes but nothing huge. In 6.5 however there is a big feature introduced, but lets just list them all for completeness sake:
- Predictive DRS
- Network-Aware DRS enhancements
- DRS profiles
First of all Predictive DRS. This is a feature that the DRS team has been working on for a while. It is a feature that integrates DRS with VROps to provide placement and balancing decisions. Note that this feature will be in Tech Preview until vRealize Operations releases their version of vROPs which will be fully compatible with vSphere 6.5, hopefully sometime in the first half of next year. Brian Graf has some additional details around this feature here by the way.
Note that of course DRS will continue to use the data provided by vCenter Server, it will on top of that however also leverage VROps to predict what resource usage will look like, all of this based on historic data. You can imagine a VM currently using 4GB of memory (demand), however every day around the same time a SQL Job runs which makes the memory demand spike up to 8GB. This data is available through VROps now and as such when making placement/balancing recommendations this predicted resource spike can now be taken in to consideration. If for whatever reason however the prediction is that the resource consumption will be lower then DRS will ignore the prediction and simply take current resource usage in to account, just to be safe. (Which makes sense if you ask me.) Oh and before I forget, DRS will look ahead for 60 minutes (3600 seconds).
How do you configure this? Well that is fairly straight forward when you have VROps running, go to your DRS cluster and click edit settings and enable the “Predictive DRS” option. Easy right? (See screenshot below) You can also change that look ahead value by the way, I wouldn’t recommend it though but if you like you can add an advanced setting called ProactiveDrsLookaheadIntervalSecs.
One of the other features that people have asked about is the consideration of additional metrics during placement/load balancing. This is what Network-Aware DRS brings. Within Network IO Control (v3) it is possible to set a reservation for a VM in terms of network bandwidth and have DRS consider this. This was introduced in vSphere 6.0 and now with 6.5 has been improved. With 6.5 DRS also takes physical NIC utilization in to consideration, when a host has higher than 80% network utilization it will consider this host to be saturated and not consider placing new VMs.
And lastly, DRS Profiles. So what are these? In the past we’ve seen many new advanced settings introduced which allowed you to tweak the way DRS balanced your cluster. In 6.5 several additional options have been added to the UI to make it easier for you to tweak DRS balancing, if and when needed that is as I would expect that for the majority of DRS users this would not be the case. Lets look at each of the new options:
So there are 3 options here:
- VM Distribution
- Memory Metric for Load Balancing
- CPU Over-Commitment
If you look at the description then I think they make a lot of sense. Especially the first two options are options I get asked about every once in a while. Some people prefer to have a more equally balanced cluster in terms of number of VMs per host, which can be done by enable “VM Distribution”. And for those who much rather load balance on “consumed” vs “active” memory you can also enable this. Now the “consumed” vs “active” is almost a religious debate, personally I don’t see too much value, especially not in a world where memory pages are zeroed when a VM boots and consumed is always high for all VMs, but nevertheless if you prefer you can balance on consumed instead. Last is the CPU Over-Commitment, this is one that could be useful when you want to limit the number of vCPUs per pCPU, apparently this is something that many VDI customers have asked for.
I hope that was useful, we are aiming to update the vSphere Clustering Deepdive at some point as well to include some of these details…
Last week I was talking to a customer and they posed some interesting questions. What excites me in IT (why I work for VMware) and what is next for hyper-converged? I thought they were interesting questions and very relevant. I am guessing many customers have that same question (what is next for hyper-converged that is). They see this shiny thing out there called hyper-converged, but if I take those steps where does the journey end? I truly believe that those who went the hyper-converged route simply took the first steps on an SDDC journey.
Hyper-converged I think is a term which was hyped and over-used, just like “cloud” a couple of years ago. Lets breakdown what it truly is: hardware + software. Nothing really groundbreaking. It is different in terms of how it is delivered. Sure, it is a different architectural approach as you utilize a software based / server side scale-out storage solution which sits within the hypervisor (or on top for that matter). Still, that hypervisor is something you were already using (most likely), and I am sure that “hardware” isn’t new either. Than the storage aspect must be the big differentiator right? Wrong, the fundamental difference, in my opinion, is how you manage the environment and the way it is delivered and supported. But does it really need to stop there or is there more?
There definitely is much more if you ask me. That is one thing that has always surprised me. Many see hyper-converged as a complete solution, reality is though that in many cases essential parts are missing. Networking, security, automation/orchestration engines, logging/analytic engines, BC/DR (and orchestration of it) etc. Many different aspects and components which seem to be overlooked. Just look at networking, even including a switch is not something you see to often, and what about the configuration of a switch, or overlay networks, firewalls / load-balancers. It all appears not to be a part of hyper-converged systems. Funny thing is though, if you are going on a software defined journey, if you want an enterprise grade private cloud that allows you to scale in a secure but agile manner these components are a requirement, you cannot go without them. You cannot extend your private cloud to the public cloud without any type of security in place, and one would assume that you would like to orchestrate every thing from that same platform and have the same networking / security capabilities to your disposal both private and public.
That is why I was so excited about the VMworld US keynote. Cross Cloud Services on top of hyper-converged leveraging all the tools VMware provides today (vSphere, VSAN, NSX) will exactly allow you to do what I describe above. Whether that is to IBM, vCloud Air or any other of the mega clouds listed in the slide below is even besides the point. Extending your datacenter services in to public clouds is what we have been talking about for a while, this hybrid approach which could bring (dare I say) elasticity. This is a fundamental aspect of SDDC, of which a hyper-converged architecture is simply a key pillar.
Hyper-converged by itself does not make a private cloud. Hyper-converged does not deliver a full SDDC stack, it is a great step in to the right direction however. But before you take that (necessary) hyper-converged step ask yourself what is next on the journey to SDDC. Networking? Security? Automation/Orchestration? Logging? Monitoring? Analytics? Hybridity? Who can help you reach full potential, who can help you take those next steps? That’s what excites me, that is why I work for VMware. I believe we have a great opportunity here as we are the only company who holds all the pieces to the SDDC puzzle. And with regards to what is next? Deliver all of that in an easy to consume manner, that is what is next!