I thought I wrote an article about this years ago, but it appears I wrote an article about doing maintenance mode with a 2-node configuration instead. As I’ve received some questions on this topic, I figured I would write a quick article that describes the concept of site maintenance. Note that in a future version of vSAN, we will have an option in the UI that helps with this, as described here.
First and foremost, you will need to validate if all data is replicated. In some cases, we see customers pinning data (VMs) to a single location without replication, and those VMs will be directly impacted if a whole site is placed in maintenance mode. Those VMs will need to be powered off, or you will need to make sure those VMs are moved to the location that remains running if they need to stay running. Do note, if you flip “Preferred / Secondary” and there are many VMs that are site local, this could lead to a huge amount of resync traffic. If those VMs need to stay running, you may also want to reconsider your decision to replicate those VMs though!
These are the steps I would take when placing a site into maintenance mode:
- Verify the vSAN Witness is up and running and healthy (see health checks)
- Check compliance of VMs that are replicated
- Configure DRS to “partially automated” or “Manual” instead of “Fully automated”
- Manually vMotion all VMs from Site X to Site Y
- Place each ESXi host in Site X into maintenance mode with the option “no data migration”
- Power Off all the ESXi hosts in Site X
- Enable DRS again in “fully automated” mode so that within Site Y the environment stays balanced
- Do whatever needs to be done in terms of maintenance
- Power On all the ESXi hosts in Site X
- Exit maintenance mode for each host
Do note, that VMs will not automatically migrate back until the resync for that given VM has been fully completed. DRS and vSAN are aware of the replication state! Additionally, if VMs are actively doing IO when hosts in Site X are going into maintenance mode, the state of data stored on hosts within Site X will differ. This concern will be resolved in the future by providing a “site maintenance” feature as discussed at the start of this article.