(Inter-VM) TPS Disabled by default, what should you do?

We’ve all probably seen the announcement around inter-VM(!!) TPS (transparent page sharing) being disabled by default in future releases of vSphere, and the recommendation to disable it in current versions. The reason for this is the fact there was a research paper published which demonstrates how it is possible to get access to data under certain highly controlled conditions. As the KB article describes:

Published academic papers have demonstrated that by forcing a flush and reload of cache memory, it is possible to measure memory timings to determine an AES encryption key in use on another virtual machine running on the same physical processor of the host server if Transparent Page Sharing is enabled. This technique works only in a highly controlled environment using a non-standard configuration.

There were many people who blogged about what the potential impact is on your environment or designs. Typically in the past people would take a 20 to 30% memory sharing in to account when sizing their environment. With inter-VM TPS disabled of course this goes out of the window. Frank described this nicely in this post. However, as Frank also described and I mentioned in previous articles when large pages are being used (usually the case) then TPS is not used by default and only under pressure…

The under pressure part is important if you ask me as TPS is the first memory reclaiming technique used when a host is under pressure. If TPS cannot sufficiently reduce the memory pressure then ballooning is leveraged, followed by compression and swapping ultimately. Personally I would like to avoid swapping at all costs and preferably compression as well. Ballooning typically doesn’t result in a huge performance degradation so it could be acceptable, but TPS is something I prefer as it just breaks up large pages in to small pages and collapses those when possible. Performance loss is hardly measurable in that case. Of course TPS would be way more effective when pages between VMs can be shared rather then just within the VM.

Anyway, the question remains should you have (inter-VM) TPS disabled or not? When you assess the risk you need to ask yourself first who has access to your virtual machines as the technique requires you to login to a virtual machine. Before we look at the scenarios, not that I mentioned “inter-VM” a couple of times now, TPS is not completely disabled in future versions. It will be disabled for inter-VM sharing by default, but can be enabled. More to be found on that in this article on the vSphere blog.

Lets explore 3 scenarios:

  1. Server virtualisation (private)
  2. Public cloud
  3. Virtual Desktops

In the case of “Server virtualisation”, in most scenarios I would expect that only the system administrators and/or application owners have access to the virtual machines. The question then is, why would they go to this level when they have access to the virtual machines anyway? So in the scenario where Server Virtualization is your use case, and access to your virtual machines is restricted to a limited number of people, I would definitely reconsider enabling inter-VM TPS.

In a public cloud environment this however is different of course. You can imagine that a hacker could buy a virtual machine and try to retrieve the AES encryption key. What he (the hacker) does with it next of course is even then still the question. Hopefully the cloud provider ensures that that the tenants are isolated from each other from a security/networking point of view. If that is the case there shouldn’t be much they could do with it. Then again, it could be just one of the many steps they have to take to break in to a system so I would probably not want to take the risk, although the risk is low. This is one of the scenarios where I would leave inter-VM TPS disabled.

Third and last scenario is Virtual Desktops. In the case of a virtual desktop many different users have access to virtual machines… The question though is if you are running any applications or accessing applications which are leveraging AES encryption or not. I cannot answer that for you, so I will leave that up in the air… you will need to assess that risk.

I guess the answer to whether you should or should not disable (inter-VM) TPS is as always: it depends. I understand why inter-VM TPS was disabled, but if the risk is low I would definitely consider enabling it.

VMware / ecosystem / industry news flash… part 3

It has been a couple of weeks since the last VMware / ecosystem / industry news flash… but we have a couple of items which I felt are worth sharing. Same as with the previous two parts I will share the link, my thoughts around it and hope that you will leave a comment with your thoughts around a specific announcement. If you work for a vendor, I would like to ask to add a disclaimer mentioning this so that all the cards are on the table.

  • PernixData FVP 2.0 available! New features and also new pricing / packaging!
    Frank Denneman has a whole slew of articles describing the new functionality of FVP 2.0 in-depth. If you ask me especially the resilient memory caching is a cool feature, but also the failure domains is something I can very much appreciate as it will allow you to build smarter clusters! The change in pricing/packaging kind of surprised me, an “Enterprise” edition was announced and the current version was renamed to “Standard”. The SMB package was renamed to “Essentials Plus” which from a naming point of view now aligns more with the VMware naming scheme, which makes life easier for customers I guess. I have not seen details around the pricing itself yet, so don’t know what the impact actually is. PernixData has upped the game again and it keeps amazing me how fast they keep growing and at which pace they are releasing new functionality. It makes you wonder what is next for these guys?!
  • Nutanix Unveils Industry’s First All-Flash Hyper-Converged Platform and Only Stretch Clustering Capability!
    I guess the “all-flash” part was just a matter of time considering the price point flash devices have reached. I have looked at these configurations many times, and if you consider that SAS drives are now as expensive as decent SSDs it only makes sense. It should be noted that “all-flash” also means a new model, NX-9000, and this comes as a 2U / 2Node form factor. List price is $110,000 per node… As that is 220k per block and with a 3 node minimum 330K it feels like a steep price, but then again we all know that the street price will be very different. The NX-9000 comes with either 6x 800GB or 1.6TB flash device for capacity, and I am guessing that the other models will also have “all-flash” options as well in the future… it only makes sense. What about that stretched clustering? Well this is what excited me most from yesterdays announcement. In version 4.1  Nutanix will allow for up to 400KM of distance between sites for a stretched cluster. Considering their platform is “vm aware” it should be very easy to select which VMs you want to protect (and which you do not). On top of that they provide the ability to have two different hardware platforms in each of the sites. In other words you can run with a top of the line block in your primary site, while having a lower end block in your recovery site. From a TCO/ROI point of view this can be very beneficial if you have no requirement for a uniform environment. Judging by the answers on twitter, the platform has not gone through VMware vSphere Metro Storage Cluster certification yet but this is likely to happen soon. SRM integration is also being looked at. All in all, nice announcements if you ask me!
  • SolidFire announces two new models and new round of funding (82 million!)
    What is there to say about the funding that hasn’t been said yet. 82 million in series D says enough if you ask me. SolidFire is one of those startups which have impressed me from the very beginning. They have a strong scale-out storage system which offers excellent quality of service functionality, a system which is primarily aimed at the Service Provider market. Although that seems to slowly change with the introduction of these new models as their smallest model now brings a 100K entry point. Note that the smallest configuration with SolidFire is 4 nodes, spec details can be found here. As stated, what excites me most with SolidFire is the services that the systems brings: QoS, data reduction and replication / SRM integration.

Thanks, and again feel free to drop a comment / leave your thoughts!

RE: The VCDX candidates advantage over the panellists

I was reading Josh Odger’s post on the VCDX Defense. Josh’s article can be summarised with the following part:

As a result, the candidate should be an expert in the design being presented and answering questions from the panel about the design should not be intimidating.

Having gone through the process myself, knowing many of the VCDX’s and having been on countless of panels I completely disagree with Josh. Sure, you do need to know your design inside/out… but, it is not about “who’s having an advantage”, the panel member is not there to fail or pass the candidate… they are there to assess your skills as an architect!

If you look at the defense day there are three parts:

  1. Defend your design
  2. Design scenario
  3. Troubleshooting scenario

For the design and troubleshooting scenario you get a random exercise, so you have no prior knowledge of what will be asked. When it comes to defending your design of course you will know your design (hopefully) better then anyone else. However, the questions you get will not necessarily be about the specifics or details of your design. The VCDX panel is there to assess your skills as an architect and not your “fact cramming skills”. A good panel will ask a lot of hypothetical questions like:

  • Your design uses NFS based storage, how would FC connected storage have changed your design?
  • Your design is based on capacity requirements for 80 virtual machine, what would  you have done differently when the requirement would be 8000 virtual machines?
  • Your design …

So when you do mock exams, prepare for these types of hypothetical questions. That is when you really start to understand the impact decisions can have, and when during your defense you get one of these questions and you do not know the answer make sure you guide the panel through your thought process. That is what differentiates someone who can learn facts (VCP exam) and someone who can digest them, understand them and apply them in different scenarios (VCDX exam).

As I stated, it may sound like that you knowing your design inside out means having a big advantage over the panel members but it probably isn’t… that is not what they are testing you on! Your ability to assess and adapt are put through the wringer, your skills as an architect are tested thoroughly and that is where you will need to do well.

Good luck!

VMware patches for #shellshock

Last night a whole bunch of patches for the shellshock security issue were released. Although I am hoping that all of you have your datacenter secured for outside threads and inside threads by isolating networks, firewalls etc… It would be wise to install these patches ASAP. Majority of linux based VMware appliances were impacted, but luckily patching them is not a huge thing. Below you can find a list of the patches and links to the downloads for your convenience.

Note that the downloads are in the middle of the list, so you need to scroll down before you see them. There are also patches for products like the VMware VSA, vSphere Replication, VC Ops etc. Make sure to download those as well!

Changes – Joining Office of CTO

Almost 2 years ago I joined Integration Engineering (R&D) within VMware. As part of that role within Integration Engineering I was very fortunate to work on a very exciting project called “MARVIN”, as most of you know MARVIN became EVO:RAIL, which is what was my primary focus for the last 18 months or so. EVO:RAIL evolved in to a team after a successful prototype and came “out of stealth” at VMworld when it is was announced by Pat Gelsinger. Very exciting project, great opportunity and an experience I would not have wanted to miss out on. Truly unique to be one of the three founding members and see it grow from a couple of sketches and ideas to a solution. I want to thank Mornay for providing me the opportunity to be part of the MARVIN rollercoaster ride, and the EVO:RAIL team for the ride / experience / discussions etc!

Over the last months I have been thinking about where I wanted to go next and today I am humbled and proud to announce that I am joining VMware’s Office of CTO (OCTO as they refer to it within VMware) as a Chief Technologist. I’ve been with VMware little over 6 years, and started out as a Senior Consultant within PSO… I never imagined, not even in my wildest dreams, that one day I would have the opportunity to join a team like this. Very honoured, and looking forward to what is ahead. I am sure I will be placed in many uncomfortable situations, but I know from experience that that is needed in order to grow. I don’t expect much to change on my blog, I will keep writing about products / features / vendors / solutions I am passionate about. That definitely was Virtual SAN in 2014, and could be Virtual Volumes or NSX in 2015… who knows!

Go OCTO!