esxi

Single initiator zoning

Duncan Epping · Oct 28, 2008 ·

I’ve been doing VMware Design Reviews lately and so are my colleagues of the PSO department. A Design Review is quick scan of your design documentation by a VMware consultant. The consultant will hold your docs against best practices and propose changes to the design.

One of the things we encounter on a regular base is that admins took the easy path for their Storage Design zoning. So what’s zoning? In short: a way to partition your fabric into smaller subsets. These small subsets provide you with a better security and less interference.

You can do zoning in two ways, Soft and Hard. With “soft zoning” you use the device WWN in a zone without any restrictions to what port this WWN is attached. With “hard zoning” you put the port into a specific zone. So what do I prefer? I would prefer “hard zoning” because you need to know how your devices are connected and it makes troubleshooting a lot easier.

So now I’ve chosen a way to zone I can just write down all my port numbers, create a zone and drop them in and I’m done… Well not so fast, that’s another choice one has to make before you start. How am I going to zone, single initiator zoning or multi initiator zoning? So what’s a single initiator zone: a single hba in a zone with the target device(s). And a multi initiator zone is all initiators that need to communicate with a device(s) in one zone. As one can imagine multi initiator zones are really easy to setup but definitely not my first choice.

Single initiator zones are the way to go. If there’s no need, and for ESX there isn’t, for initiators to be able to communicate with each other then they shouldn’t be able to. Not only is this more secure, because initiators can’t communicate with each other, it also cuts out a lot of rubbish on your fibre. Rubbish as for instance “Registered State Change Notifications”. Although RSCN storms don’t occur that often anymore as they used to it’s still a risk of contention and should be avoided when possible. So if you’re doing a design or preparing for one keep this in mind: Single Initiator Zones are the way to go!

There are a whole bunch of good articles on the net about zoning, read them you might learn a thing or two:

TechTarget.com: part1, part2, part3
Storage Networking 101: Understanding Fibre Channel Zones
Single HBA Zoning

Have fun,

Queuedepth, how and when

Duncan Epping · Oct 27, 2008 ·

So you’ve heard this probably from a few dozens of people by now when you don’t hit the expected SAN performance: Set your queuedepth to a larger size.

So how do you set this queuedepth? Find out for which module you’ll need to set this option:

vmkload_mod -l | grep qla

Now set it to a depth of 64 for module qla2300_707

esxcfg-module -s ql2xmaxqdepth=64 qla2300_707
esxcfg-boot –b

So now you’ve set the queue depth to 64 for your HBA cards, but why? Well I hope the answer is:”because I monitored my system with esxtop and I noticed that the “QUED” value was high”.

So there’s your when. You’ll need to set this setting if you notice a high “QUED” value in esxtop. Take a look at the following example I borrowed from a great blog on this subject:

As you can see in the example, the “ACTV” has a value of 32. Indeed 32 active commands cause that’s the default queue depth for qlogic cards. And 31 outstanding commands, in other words if we bump up the queue depth to 64 than all the commands should be processed instead of queued in the VMkernel.

What will this result in?

HA best practices

Duncan Epping · Oct 27, 2008 ·

So I’ve been collecting some HA best practices lately. I just wanted to have them all in one place so I can use them myself for the VMTN forum and/or customers. The first two are obvious in my opinion but still often overlooked:

Your ESX host-names should be in lowercase and use fqdn’s
Provide Service Console redundancy
If you add an isolation validation address with “das.isolationaddress”, add an additional 5000 to “das.failuredetectiontime”
If your Service Console network is setup with “active / standby” redundancy then your “das.failuredetectiontime” needs to be set to 60000
If you ensured Service Console redundancy by adding a secondary service console then “das.failuredetectiontime” needs to be set to 20000 and you need to setup an additional “das.isolationaddress”
If you setup a secondary Service Console use a different subnet and vSwitch then your primary has
If you don’t want to use your default gateway as an isolation validation address or can’t use it because it’s a non-pingable device then disable the usage by setting das.usedefaultisolationaddress to false and add a pingable “das.isolationaddress”
Change default isolation response to “power off vm” and set restart priorities for your AD/DNS/VC/SQL servers

So if you’ve got more, add them into the comments and I will update the list!

das.failuredetectiontime for active/standby COS vswitch

Duncan Epping · Oct 22, 2008 ·

It used to be a best practice to increase the “das.failuredetectiontime” to 30000 for an active/standby setup. This way when a failover to another nic occurs one would have atleast 30 seconds to switch over before HA starts shutting down VM’s. The default value is 15000 by the way.

If it’s not really clear I’m talking about a setup like this:

vSwitch0 – 2 Physical nics(vmnic0 & vmnic2) – 2 Portgroups (Service Console & VMkernel)
Service Console active on vmnic0 and standby on vmnic2
VMkernel active on vmnic2 and standby on vmnic0
Each portgroup has a VLAN assigned and runs dedicated on its own nic, only in the case of a fault it’s switched over to the standby nic, but it will return to the original nic when the connection is up again.

I just noticed in the Resource Management Guide pdf that the best practice is to increase it to 60000. In other words, it can take up to 60 seconds before your HA starts restarting machines. For a secondary service console you only need to increase by 5 seconds cause of the fact that an additional isolation address needs to be checked. In other words a secondary service console saves you 30 seconds when isolation occurs which can be a lot in a 7×24 environment.

So like I blogged three months ago, going for a secondary service console is definitely the best option you have for service console redundancy today! Keep in mind though that your secondary service console needs to be in a different subnet than the primary!

Bluebear Kodiak beta invites

Duncan Epping · Oct 21, 2008 ·

If anyone needs an invite, Bluebear gave me 50 again.. So just drop me an email, or leave a comment with your email and I’ll make sure you’ll be able to beta test this new version!

I don’t have any invites left… Sorry, you’ll guys need to help eachother out!