6.0

Storage capacity for swap files and TPS disabled

Duncan Epping · Dec 8, 2016 ·

A while ago (2014) I wrote an article on TPS being disabled by default in future release. (Read KB 2080735 and 2097593 for more info) I described why VMware made this change from a security perspective and what the impact could be. Even today, two years later, I am still getting questions about this and what for instance the impact is on swap files. With vSAN you have the ability to thin provision swap files, and with TPS being disabled is this something that brings a risk?

Lets break it down, first of all what is the risk of having TPS enabled and where does TPS come in to play?

With large pages enabled by default most customers aren’t actually using TPS to the level they think they are. Unless you are using old CPUs which don’t have EPT or RVI capabilities, which I doubt at this point, it only kicks in with memory pressure (usually) and then large pages get broken in to small pages and only then will they be TPS’ed, if you have severe memory pressure that usually means you will go straight to ballooning or swapping.

Having said that, lets assume a hacker has managed to find his way in to you virtual machine’s guest operating system. Only when memory pages are collapsed, which as described above only happens under memory pressure, will the hacker be able to attack the system. Note that the VM/Data he wants to attack will need to be on the located on the same host and the memory pages/data he needs to breach the system will need to be collapsed. (actually, same NUMA node even) Many would argue that if a hacker gets that far and gets all the way in to your VM and capable of exploiting this gap you have far bigger problems. On top of that, what is the likelihood of pulling this off? Personally, and I know the VMware security team probably doesn’t agree, I think it is unlikely. I understand why VMware changed the default, but there are a lot of “IFs” in play here.

Anyway, lets assume you assessed the risk and feel you need to protect yourself against it and keep the default setting (intra-VM TPS only), what is the impact on your swap file capacity allocation? As stated when there is memory pressure, and ballooning cannot free up sufficient memory and intra-VM TPS is not providing the needed memory space either the next step after compressing memory pages is swapping! And in order for ESXi to swap memory to disk you will need disk capacity. If and when the swap file is thin provisioned (vSAN Sparse Swap) then before swapping out those blocks on vSAN will need to be allocated. (This also applies to NFS where files are thin provisioned by default by the way.)

What does that mean in terms of design? Well in your design you will need to ensure you allocate capacity on vSAN (or any other storage platform) for your swap files. This doesn’t need to be 100% capacity, but should be more than the level of expected overcommitment. If you expect that during maintenance for instance (or an HA event) you will have memory overcommitment of about 25% than you could ensure you have 25% of the capacity needed for swap files available at least to avoid having a VM being stunned as new blocks for the swap file cannot be allocated and you run out of vSAN datastore space.

Let it be clear, I don’t know many customers running their storage systems in terms of capacity up to 95% or more, but if you are and you have thin swap files and you are overcommitting and TPS is disabled, you may want to re-think your strategy.

Benchmarking an HCI solution with legacy tools

Duncan Epping · Nov 17, 2016 ·

I was driving back home from Germany on the autobahn this week when thinking about 5-6 conversations I have had the past couple of weeks about performance tests for HCI systems. (Hence the pic on the rightside being very appropriate ;-)) What stood out during these conversations is that many folks are repeating the tests they’ve once conducted on their legacy array and then compare the results 1:1 to their HCI system. Fairly often people even use a legacy tool like Atto disk benchmark. Atto is a great tool for testing the speed of your drive in your laptop, or maybe even a RAID configuration, but the name already more or less reveals its limitation: “disk benchmark”. It wasn’t designed to show the capabilities and strengths of a distributed / hyper-converged platform.

Now I am not trying to pick on Atto as similar problems exist with tools like IOMeter for instance. I see people doing a single VM IOMeter test with a single disk. In most hyper-converged offerings that doesn’t result in a spectacular outcome, why? Well simply because that is not what the solution is designed for. Sure, there are ways to demonstrate what your system is capable off with legacy tools, simply create multiple VMs with multiple disks. Or even with a single VM you can produce better results when picking the right policy as vSAN allows you to stripe data across 12 devices for instance (which can be across hosts, diskgroups etc). Without selecting the right policy or having multiple VMs, you may not be hitting the limits of your system, but simply the limits of your VM virtual disk controller, host disk controller, single device capabilities etc.

But there is even a better option, pick the right toolset and select the right workload(Surely only doing 4k blocks isn’t representative of your prod environment). VMware has developed a benchmarking solution that works with both traditional as well as with hyper-converged offerings called HCIBench. HCIBench can be downloaded for free, and used for free, through the VMware Flings website. Instead of that single VM single disk test, you will now be able to test many VMs with multiple disks to show how a scale-out storage system behaves. It will provide you great insights of the capabilities of your storage system, whether that is vSAN or any other HCI solution, or even a legacy storage system for that matter. Just like the world of storage has evolved, so has the world of benchmarking.

Disk format version 4.0 update to 2.0 suggested

Duncan Epping · Jun 15, 2016 ·

I’ve seen some people reporting a strange message in the Virtual SAN UI. The UI states the following: Disk format version 4.0 (update to 2.0 suggested). This is what that looks like (stole the pic from VMTN, thanks Phillip.)

A bit strange considering you apparently have 4.0 why would you go to 2.0 then? Well you are actually on 2.0 and are supposed to go to 3.0. The reason this happens is because, most likely, not all hosts within you cluster are on the same version of Virtual SAN, or vCenter Server was not updated to the last version and ESXi has a higher version. So far I have seen this being reported when people upgrade to vSphere 6.0 Update 2. If you are upgrading, make sure to upgrade all hosts to ESXi 6.0 Update 2, but before you do, upgrade the vCenter Server to 6.0 Update 2 first!

Getting a 404 error with the Host Client

Duncan Epping · Jun 10, 2016 ·

Just a short post. I was getting a 404 error with the Host Client when hitting https://<ip of esxi host>/ui. No clue what it was caused by. I re-installed the latest version of the host client but that didn’t solve it. Then I noticed that my endpoints.conf had “/ local” missing. You can check that as follows when logged in through SSH:

cat /etc/vmware/rhttpproxy/endpoints.conf

I did the following (edit + restarted the HTTP reverse proxy) to get it working again:

Edit the config file:

vi /etc/vmware/rhttpproxy/endpoints.conf

add the following:

/ local 8309 redirect allow

Restart the service:

/etc/init.d/rhttpproxy restart

600GB write buffer limit for VSAN?

Duncan Epping · May 17, 2016 ·

I get this question on a regular basis and it has been explained many many times, I figured I would dedicate a blog to it. Now, Cormac has written a very lengthy blog on the topic and I am not going to repeat it, I will simply point you to the math he has provided around it. I do however want to provide a quick summary:

When you have an all-flash VSAN configuration the current write buffer limit is 600GB. (only for all-flash) As a result many seem to think that when a 800GB device is being used for the write buffer that 200GB will go unused. This simply is not the case. We have a rule of thumb of 10% cache to capacity ratio. This rule of thumb has been developed with both performance and endurance in mind as described by Cormac in the link above. The 200GB that is above the 600GB limit of the write buffer is actively used by the flash device for endurance. Note that an SSD usually is over-provisioned by default, most of them have extra cells for endurance and write performance. Which makes the experience more predictable and at the same time more reliable, the same applies in this case with the Virtual SAN write buffer.

The image at the top right side shows how this works. This SSD has 800GB as advertised capacity. The “write buffer” is limited to 600GB however the white space is considered “dynamic over provisioning” capacity as it will be actively used by the SSD automatically (SSDs do this by default). Then there is an additional x % of over provisioning by default on all SSDs, which in the example is 28% (typical for enterprise grade) and even after that there usually is an extra 7% for garbage collection and other SSD internals. If you want to know more about why this is and how this works, Seagate has a nice blog.

So lets recap, as a consumer/admin the 600GB write buffer limit should not be a concern. Although the write buffer is limited in terms of buffer capacity, the flash cells will not go unused and the rule of thumb as such remains unchanged: 10% cache to capacity ratio. Lets hope this puts this (non) discussion finally to rest.