design

VMware vSphere Design Workshop

Duncan Epping · Mar 24, 2010 ·

A week ago I posted about an upcoming class titled “managing for performance“. As many of you know Scott Drummonds, VMware’s performance guru was involved in crafting this excellent course. This is VMware’s new approach for developing new training material / courses. Course material will be developed with the help of the Subject Matter Experts within the company.

I was fortunate enough to be part of the development team for the VMware vSphere Design Workshop. Notice the “Workshop” part as this course is different then any other VMware course you have done so far. Indeed it is a real workshop and it is all about discussing design considerations with your peers and crafting a design based on specific requirements and constraints. Although it is not an official pre-requisite I would highly recommend this to anyone who wants to become a VCDX. I hope you will enjoy this workshop as much as we did creating it.

Again a great job by the VMware Education Team, I just love this new concept. For those who can’t wait, check this section of the VMware website to see where they are scheduled. And for the Benelux readers, Eric Sloof is running three of these workshops soon… be there!

Source: VMware vSphere: Design Workshop

Module 1: Course Introduction

Provide a general overview of the course

Module 2: Design Process Overview

Discuss the design methodology, criteria, and approach

Introduce an example five-step design process

Module 3: ESX/ESXi Host Design

Identify useful information for making host design decisions

Analyze best practices and host design alternatives

Module 4: vSphere Virtual Datacenter Design

Identify useful information for making vCenter Server, database, cluster, and resource pool design decisions

Analyze best practices and vCenter Server, database, cluster, and resource pool design alternatives

Module 5: vSphere Network Design

Identify useful information for making network design decisions

Analyze best practices and network design alternatives

Module 6: vSphere Storage Design

Identify useful information for making storage design decisions

Analyze best practices and storage design alternatives

Module 7: Virtual Machine Design

Identify useful information for making virtual machine design decisions

Analyze best practices and virtual machine design alternatives

Module 8: Management and Monitoring Design

Identify useful information for making management and monitoring design decisions

Analyze best practices and management and monitoring design alternatives

VCDX – Design and Troubleshooting scenarios

Duncan Epping · Mar 22, 2010 ·

After the VCDX defenses last week in Munich and the defense sessions we had during VMware PEX I want to stress the following from my VCDX Defense blog article:

Next two are role-play based. The panel is the customer and you are the architect. By asking questions, white boarding, discussions you will need to solve an issue or come to a specific solution for the customer. This is something you can not really prepare. Although you may think you will have more than enough time, you will not have time enough. Time flies when the pressure is on. Keep in mind that it’s not the end result that counts for these scenarios, it’s your thought process! (source)

Please read the underlined section several times, it is about showing your skills as an architect/consultant. We do not expect you to to craft a full design in 30 minutes, we expect you to gather requirements and based on the info start crafting a high level overview/design. Or as John stated in his VCDX tips:

Scenarios for VCDX defenses test journey to solution, not necessarily the final answer. Whiteboard, talk and ask questions.
Troubleshooting scenarios – think of the architecture and implementation approach to resolution. Logs, design, SC commands.

Keep in mind that during the scenarios the panellists are working through a scoring rubric, and have prescreened apps and have specific questions that they need answered in order to score effectively. So ask questions and LISTEN to the answers!

Scale UP!

Duncan Epping · Mar 17, 2010 ·

Lately I am having a lot of discussions with customers around sizing of their hosts. Especially Cisco UCS(with the 384GB option) and the upcoming Intel Xeon 5600 series with six cores per CPU takes the “Scale Up” discussion to a new level.

I guess we had this discussion in the past as well when 32GB became a commodity. The question I always have is how many eggs do you want to have in one basket. Basically do you want to scale up(larger hosts) or scale out(more hosts).

I guess it’s a common discussion and a lot of people don’t see the impact sizing your hosts correctly. Think about this environment, 250 VMs in total with the need of roughly 480GB of memory:

10 Hosts, each having 48GB and 8 Cores, 25 VMs each.
5 Hosts, each having 96GB and 16 Cores, 50 VMs each.

If you look at it from an uptime perspective; Would a failure occur in scenario 1 you will lose 10% of your environment. If you look at scenario 2 this is 20%. Clearly the associated cost with the down time for 20% of your estate is higher than for 10% of your estate.

Now it’s not only the associated cost with the impact of a host failure it is also for instance the ability of DRS to load balance the environment. The less hosts you will have the smaller the chances are DRS will be able to balance the load. Keep in mind DRS uses a deviation to calculate the imbalance and simulates a move to see if it results in a balanced cluster.

Another thing to keep in mind is HA. When you design for N+1 redundancy and need to buy an extra host the costs associated for redundancy is high for a scale up scenario. Not only the costs associated are high, the load when the fail-over needs to occur will also increase immense. If you only have 4 hosts and 1 host fails the added load on the 3 hosts will have a higher impact than it would have on for instance 9 hosts in a scale out scenario.

Licensing is another often used argument for buying larger hosts but for VMware it usually will not make a difference. I’m not the “capacity management” or “capacity planning” guru to be honest but I can recommend VMware Capacity Planner as it can help you to easily create several scenarios. (Or Platespin Recon for that matter.) If you have never tried it and are a VMware partner check it out and run the scenarios based on scale up and scale out principles and do the math.

Now, don’t get me wrong I am not saying you should not buy hosts with 96GB but think before you make this decision. Decide what an acceptable risk is and discuss the impact of the risk with your customer(s). As you can imagine for any company there’s a cost associated with down time. Down time for 20% of your estate will have a different financial impact than down time for 10% of your estate and this needs to be weighted against all the pros and cons of scale out vs scale up.

Impact of decisions…

Duncan Epping · Feb 15, 2010 ·

I’ve been conducting VCDX Defense Interviews for a while now. Last week in Las Vegas during PEX something struck me and I guess this post by Frank Denneman is a good example…

On a regular basis I come across NFS based environments where the decision is made to store the virtual machine swap files on local VMFS datastores. Using host-local swap can affect DRS load balancing and HA failover in certain situations. So when designing an environment using host-local swap, some areas must be focused on to guarantee HA and DRS functionality.

Every decision you make has an impact on your design/environment. What does a decision exactly impact? In most cases every decision impacts the following:

Cost
Availability
Performance

In the example Frank wrote about (see quote) a decision which clearly had an impact on all three. Although at the time it might have been a best practice the decision to go along with this best practice still had an impact on the environment. Because it was a best practice this impact might not have been as obvious. But when listed as follows I hope you understand why I am writing this article:

Costs – Reduced costs by moving the .vswp file to local disks.
Performance – VMotion performance is effected because .vswp files need to be copied from HOST-A to HOST-B.
Availability – Possibly less availability when the amount of free disk space on local VMFS isn’t sufficient to restart VMs in case of disaster.

As you can see a simple decision has a major impact, even though it might be a best practice you will need to think about the possible impact it has and if this best practice fits your environment and meets your (customer) requirements. Another great example would for instance be LUN sizing. So what if I would randomly pick a LUN size. Lets say 1TB:

Cost – As the average VM size is 35 GB, I want a max of 20VMs on a datastore and I need 20% of overhead for vswp files and snapshots I end up with max usage of 840GB. Added overhead: 160GB!
Availability – Although the availability of the datastore will be unaffected the uptime of your environment might change. When a single datastore fails you will lose 1TB worth of data. Not only will you lose more VMs, restoring will also take longer.
Performance – Normally I would restrict the LUN size to reduce the amount of VMs on a single datastore. More VMs on a datastore means more higher possibility of SCSI reservation conflicts.

The VCDX certification is not about knowing all the technical details, of course it is an essential part of it, it’s about understanding the impact of a decision. It’s about justifying your decision based on the impact it has on the environment/design. Know the pros / cons. Even if it is a best practice it might not necessarily apply to your situation.

Storage Masking?

Duncan Epping · Feb 5, 2010 ·

I received a bunch of questions around storage masking over the last couple of weeks. One of them was around VMware’s best practice to mask LUNs on a per cluster basis. The best practice has been around for years and basically is there to reduce conflicts. More hosts accessing the same LUNs means more overhead, just to give you an example every 5 minutes a rescan of both HBAs takes place automatically to check for dead storage paths. You can imagine that there’s a difference between 64 hosts accessing your storage or limiting it to for instance 16 hosts. Also think about things like the failure domain you are introducing, what if an APD condition exists, this now doesn’t just impact 1 cluster… It could impact all of them.

For vSphere 5.1 read this revision…

The obvious next question is, won’t I lose a lot of flexibility? Well in a way you do as a simple VMotion to another cluster will not work anymore. But of course there’s always a way to move a host to a different cluster. In my design I usually propose a so called “Transfer Volume”. This Volume(NFS or VMFS) can be used to transfer VMs to a different cluster. Yes there’s a slight operational overhead here, but is also reduces overhead in terms of traffic to a LUN and decreases the chance of scsi reservation conflicts etc.

Here’s the process:

Storage VMotion the VM from LUN on Array 1 to Transfer LUN
VMotion VM from Cluster A to Cluster B
Storage VMotion the VM from Transfer LUN to LUN on Array 2

Of course these don’t necessarily need to be two separate arrays, it could just as easily be a single array with a group of LUNs masked to a particular cluster. For the people who have a hard time visualizing it: