I was thinking about one of the most challenging aspects with DR procedures, IP changes. This is a very common problem. Although changing the IP address of a VM is usually straight forward it doesn’t mean that this is propagated to the application layer. Many applications use hardcoded IP addresses and changing these is usually a huge challenge.
But what about using vShield Edge? If you look at how vShield Edge is used in a vCloud Director environment, mainly NAT’ing and Firewall functionality, you could use it in exactly the same way for your VMs in a DR enabled environment. I know there are many Apps out there which don’t use hardcoded IP adresses and which are simple to re-IP. But for those who are not, why not just leverage vShield Edge… NAT the VMs and when there is a DR event just swap out the NAT pool and update DNS. On the “inside” nothing will change… and the application will continue to work fine. On the outside things will change, but this is an “easy” fix with a lot less risk than re-IP’ing that whole multi-tier application.
I wonder how some of you out in the field do this today.
Interesting insight about vShield Edge, I’ve never used this product yet.
Our DR site is at a different locality about 300km away from prod. We have different subnets there, therefore we do have to change IP/Mask/GW/DNS settings on our VMs and coordinate DNS regens with our network team. We have a plan that is regularly updated for this, and one of the prereqs was that all apps should reference FQDN hostnames (usually aliases for load balancers in case of cluster). Until now we relied on HP Storage Mirroring for Virtual Infrastructure which at least maps the source VM network to one of the target VM networks during creation of a VM protection. Curious to see how we can ease this up with the upcoming migration to vSphere 5 and possible use of Veeam BKP 6.
Mike Brown says
I had thought about this recently too. I just haven’t had time to test it out. I can’t think of any reasons why it wouldn’t work and would make recovery a much simpler process, just need some lab time to prove it. : )
Brad Clarke says
I don’t think the will/won’t work is what’s going to stop it from happening. vShield Edge is one of the many recent products from VMware where I just don’t understand where the target market is. Smaller shops are held back by the upfront costs and terrified by anything that’s licensed per-VM. Larger companies can buy hardware and hire someone to do the job exactly how they want cheaper than paying the licensing and support costs (or, more likely, they already have the people and hardware and just need to pay a little overtime to set it up). I don’t fault VMware for trying to automate common tasks, but that tends to only have value in the bottom of the market and VMware seems to price it for the top.
Scott Bethke says
vShield Edge will NAT the IP’s and that is fine as long as the application is ok with NAT.
I think a much better solution could be to design your applications with internal IP addressing that does not change when the application moves to another site. By separating public IP’s needing external connectivity with internal IP’s that are there for application glue you make changes only to the “real” IP’s with the confidence that it won’t break your application. You could (should) even automate your external facing IP addresses using a DHCP/DNS/IPAM appliance/solution so that changes are registered when the move occurs without having to manually go in and adjust these settings.
Dan Barr says
Our current plan does not involve changing IPs on the recovery side, but rather to begin routing the affected subnets from the DR site rather than from the (presumably down) primary site. The only manual change needed is to the layer 3 switch in the DR site to add the gateway IPs to the recovery VLANs and alter the BGP prefix-list to advertise out to the rest of the company via the MPLS network. For public-facing services, we maintain a NAT and access-list script to run into the DR site’s firewall should the time come. Since most everything public is run thru a reverse proxy server (which is SRM protected), the firewall set is actually pretty small and infrequently changed. Eventually I’ll probably just put a permanent reverse proxy into the DR site and keep it in sync with the primary so even the firewall changes wouldn’t be needed in a recovery.
Trever Jackson says
In our environment we try to lessen the need to re-IP by deploying apps like exchange 2010 and using DAGs so you already have the application running in the alternate site and don’t require a failover . For the other VMs, especially VMs in the dmz using DNS is a must. We are unfortunate to have a lot of apps that aren’t Datacenter ready and scale by adding ‘islands’ of itself. We use F5’s GTM technology to direct traffic to web servers at both locations already. In case of DR we only failover database VMs so the idea of using vShield Edge to NAT the database servers during a DR event is something I would definitely like to look into to make things even more streamlined.
Frank James says
F5 presented basically this solution at VMworld last year, using their route domains feature. F5 have the advantage of very smart application proxies for apps that don’t like to be NATed.
Either way, with Edge or hardware it’s certainly more elegant/likely to work than trying to stretch a VLAN between sites.
Attila Bognár says
Think also about IPv6: NAT is not (yet?) included (in every implentation). If you don’t have NAT maybe you shouldn’t implement NAT just for this.
The other thing to consider is that the network design/topology would change in case of DR which is far from being optimal as this can lead to untested use cases. (or you have to always test your applications both for normal and DR operation)
+ you have a new component in the infrastructure with which you may not be familiar with (another risk)
I agree with Dan Barr’s concept: design the network to support the DR scenario nearly transparently. There will be much more to do in case of DR to fulfil the SLA/RTO than to deal with unexpected application problems due to networking changes.
That is use NAT or dynamic routing for both normal and DR operation. By reconfiguring your L3 routing devices you won’t need any IP changes.
The solution also depends on the network topology, DR concept, number of external services and project timeframe+possibilities of course.
Trever Jackson says
The L3 change concept sounds great for DR, but it complicates matters somewhat after the event has ended. Say you rebuilt your site and wanted to use the same IP scheme as before. You would need to reverse replication and take at least a small service outage to move some VMs back to the original site.
Dan, what are you using for a reverse proxy?
Attila Bognár says
This is a design question. If you have such problems/concerns you can even do the IP change during the failback when you may have more time for (pre)testing and/or do the migration in several phases to not have all the possible issues at a time.
(you asked Dan, but one option for reverse proxy is apache httpd)
Fons Biemans says
we have just revised our DR Plan. in consultation with our network admins, we are going for DHCP with the option of reservations.
we have hardly any applications left wich requere a fixed IP. some of them we have running active/active so we don’t requere an ip change. for the once left over we have procedure’s in place in case of an failover, wich are tested every year.
the main drive behind DHCP is that we can’t change our dns (appliance), and the DHCP server can. And one of the advantages is that in an failover we don’t have to think about changing the IP.
(DHCP is also an appliance)
We are migrated VM’s from being in /24 subnets to their own /30’s. We can then move VM’s between two data centres (over 100km apart) easily, just by adding the /30 to the local layer 3 switch, it then gets announced via OSPF into our MPLS cloud and is visible from other sites.
Gives us lots of flexibility for moving services about. Current the 1st IP of the /30 is on the L3 switch (gateway) and the 2nd on the VM, we are trialling running the gateway for each /30 on a Vyatta VM instead, so only that speaks OSPF with the local Cisco L3 switch. Production staff can then move VMs about and just add/remove /30’s from the Vyatta VMs at either site.
We support lots of different business apps and many of them and their suppliers don’t support NAT or reverse proxying. By allocating /30s we can shift load between our two diverse clusters easily, vReplicator keeps the storage in sync.
Trever J says
Wow, isn’t that a huge amoumt of overhead for you to manage a /30 for each VM? I have 500 VMs!
Using a hardcoded IP in an application is just bad application development. Never use an IP or the hostname, always use aliases (CNAMES). DNS is your friend.
Yep its a bit of overhead, but decent automation can make it super simple. The big benefit is we can shift a VM 100km and it can retain its IP within a minute. Our VM’s are scattered between different VRFs too, so we have 6 security zone with OSPF instances in each of those.
I disappointed VMWare hasn’t included any support for /32’s on VM’s like you can do easily with XEN/KVM using tun/tap devices.
You can use stretch VLANs. So that you don’t need to change any IP when you perform DR.
We have done this 5 times in last two years.
Emaneul Ferguson says
How do you do stretch VLANs?
Iwan 'e1' Rahabok says
Based on multiple discussion with customers (mostly global banks), I created a document that gives details on how to achieve that elusive “single click DR” (that was a term told by a regional bank with 3500 VM).
It can be found at http://communities.vmware.com/docs/DOC-19992
Hope it’s useful. Feel free to use it.
Bala Raju says
Its some thing network team can help you on creating strech vlans/mobile vlans. All they do is, create same VLANs in Production site and DR site( VLAN ID should be same in both locations).
Bala Raju says
Also, i would suggest you to add DR site DNS IP in all prodction servers when you provisioning them. So that, you don’t have worry about name resolution in DR site.