We’ve all seen those posts from people about worn-out SD/USB devices, or maybe even experience it ourselves at some point in time. Most of you reading this probably also knew there was an issue with 7.0 U2, which resulted in USB/SD devices wearing out a lot quicker. Those issues have been resolved with the latest patch for 7.0 U2. It has, however, resulted in a longer debate around whether SD/USB devices should still be used for booting ESXi, and it seems that the jury has reached a verdict.
On the 16th of September, a KB article was published by VMware, which contains statements around the future of SD/USB devices. I can be short about it, if you are buying new hardware make sure to have a proper persistent storage device, USB/SD is not the right choice going forward! Why? The volume of reads/writes to and from the OS-DATA partition continues to increase with every release, which means that the lower grade devices will simply wear out faster. Now, I am not going to repeat word for word what is mentioned in the KB, I would just like to urge everyone to read the KB article, and make sure to plan accordingly! Personally, I am a fan of M.2 flash devices for booting. They are not too expensive(greenfield deployments), plus they can provide enterprise-grade persistent storage to store all your ESXi related data. Make sure to follow the requirements around endurance though!
Nick van Ogtrop-de Ruiter says
Thanx for the heads-up, Duncan …!
In a previous response I had already indicated my preference to get off SD/USB storage a.s.a.p., but I’m wondering when will an ESXi upgrade allow for the subsequent removal of an SD/USB boot device? In other words, when will VMware facilitate an OS upgrade which will exclusively use a persistent device as a result?
Duncan Epping says
Don’t know to be honest.
tom says
It would be nice to if you include the actual patch KB/documentation so people can find it easier. 🙂 VMware does not make it easy to “Siri, find the SD/USB storage patch”.
olivierfeuillerat says
@tom : https://docs.vmware.com/en/VMware-vSphere/7.0/rn/vsphere-vcenter-server-70u2c-release-notes.html
tom says
that’s a vCenter *server* release notes item?? Not vSphere??
The problem so far as I know is in *vSphere* not in in vCenter.
Duncan Epping says
https://docs.vmware.com/en/VMware-vSphere/7.0/rn/vsphere-esxi-70u2c-release-notes.html
Babak says
Thanks. But USB/SD card is useful when we are using vSAN to install ESXi on SD card . Because when using SD card and we can passthrough the Raid controller and can replace one of the disk if failed in vSAN but if we don’t passthrough it is hard to replace or change the disk . This is so bad . What should we do for vSAN ?
Duncan Epping says
As mentioned in the post, M.2 devices could be an option. Some vendors like Dell also have addon cards which come with persistent storage for booting etc.
Babak says
Now I am using this model “HPE 32GB microSD Raid1 USB BOOT Drive”
This is not suitable ?
Duncan Epping says
Please read the KB Babak.
Marco says
Hi Babak,
look at “Resolution” under https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00118986en_us.
There will be exception for the 7.x releases with the “HPE 32GB microSD Raid1 USB BOOT Drive”. But you have to complete some steps as noted in the document.
Look also to the quickspecs for the “HPE 32GB microSD Raid1 USB BOOT Drive” https://www.hpe.com/psnow/doc/c04123175?jumpid=in_lit-psnow-red:
“For any major release beyond VMware ESXi 7.x, VMware will require M.2 or another local persistent device as the standalone boot option.”
Jonathan Hope (@lillilblurkin) says
Im going to be honest Duncan, i don’t get this change. Spent a lot of time with VMware support after one of our brand new host starting locking up to this bug and it just doesn’t make sense. We have deployed VSAN and all of our VMware installs on Dual R1 mirrored SD cards and we are running 6 year old C220M4’s that have never failed or worn out. It seems like a decision was made by Engineering with the intro of VMFS-L to do some weird choices to force people to use actual SSD which are by far more expensive. From what I understood they removed Storage IO to the boot device as well as it sounded like introduced additional read/writes where VMware tools is constantly calling back to the central repo for each VM. Overloading the boot device. It all seemed weird to me. Luckily the 7.0u2c patch has resolved the issue on our DellEMC host with dual SD cards but I just don’t buy they are quicker to fail. There shouldn’t be that much going to it. Again We have had 7 host for going on 5-6 years now that to this day haven’t sustained a failure of the OS on an SD card.
Jonathan Hope (@lillilblurkin) says
Now that i’m thinking about this is a huge issue for us as a client. We have a 10 node VSAN that is maxed out. We use every PCIe slot for cache drives and the entire drive bays for capacity. If we were to update these from 6.7u3 to 7.0u2 how would we use anything else. We don’t have a SAN nor do we have the ability to add an additional drive. This just seems like a strange issue that Engineering caused in an effort to push the product forward which is fine but its not going to be sustainable for some of us esp with VSAN deployments. Would be interested on your thoughts about this.
Babak says
I think this is a big issue for all Admins especially who are using vSAN. As I know new lay-out that has been added in ESXi 7.x is ESX-OSDATA and most data that has resided on this partition is VMware tools , log file and core dump . but what if we create a scratch partition on san and forward all of them on scratch partition. I think the problem had to solve with this. Is that correct ?
Animus.Ac.orexis (@AnimusAcOrexis) says
It’s a big issue for vanilla clusters especially when you have north of 1.2k viable m630 ESXi host, all booting from reliable SD, with no issues until we found that 7.0.2 decimated a cluster. Thankfully we caught the problem after only 64 of the host were updated. It prevents HA/DRS and to event get the host in a state to migrate VM’s take about 30 minutes to prevent down time. Once you can do that, then you will be lucky if you don’t have to perform a clean install and then update to 2c, that is assuming you don’t have to replace the SD cards that 7.0.2 destroyed.
M.2/SSD are not that expensive though right?!?!
M.2 drives not an option for m630’s. (tap,tap)
That means purchasing ssd or hdd.
A small mix use SSD is 480GB SSD is about $540.64 (Dells website) we would need 2400 of those, so that would run us 1.3mil before taxes and shipping.
Let’s assume we had that many equivalent R630’s that could support a dell boss controller card using the pricing from the Dell site a boss card with 2 m.2 raid 1 config at 240GB is $553.23 that is $639,600 before shipping handling and taxes which is about 1/2 the price.
But its not that expensive right… Keep driving your customer to Aws and Azure.
Animus.Ac.orexis (@AnimusAcOrexis) says
That excludes time and cost for installing the SSD, install/configuration of ESXi, lets not forget its a good time to do firmware updates as well don’t want to get called lazy.
Animus.Ac.orexis (@AnimusAcOrexis) says
Before I hear you can still use SD cards, we would still need to purchase all those SSD to accommodate recommendations from support even after going to update 2c.
Animus.Ac.orexis (@AnimusAcOrexis) says
Update 2c has not fixed a god damn thing, we are still experiencing loss of SD cards and HA issues. 22 days later! If you can re-install back to 6.5 or 6.7. VMware has costs us thousands of dollars already, it time to cut bait and stop waiting for these clowns to provide an acceptable stable solution. VMware has become a verb.
Duncan Epping says
No need for name calling. If you want the issue to be resolved, please file an SR, include all relevant logs. If the SR is not getting the traction you expect, please share the SR with me so I can escalate it for you.
Animus.Ac.orexis (@AnimusAcOrexis) says
Instead go through all the customers SR for this issue and escalate all of them. Mine will be one of the many. VMware does not seem to understand the impact this destructive change has caused. I was being mild with that name, unlike what VMware has done to its customers. We have been patiently waiting since April for a fix, only to find out we are still having the exact same issue.
Duncan Epping says
If you are not able to share details then I can’t help you unfortunately. For any further feedback or complaints, I will refer you to your local VMware resource.
Jonathan Hope (@lillilblurkin) says
Thats aggressive. What issues are you seeing. We have applied the patch to a number of installs and can confirm the issue has been resolved on our end. I would take Duncans help if you could. They have always been great at escalating and getting the matter resolved. I agree its frustrating but they are also clearly trying to help here.
Animus.Ac.orexis (@AnimusAcOrexis) says
We have also applied the patch to roughly 30 of 64 hosts, and it appeared that everything was trending good. Until today. Same as issues as pre patch.
Jonathan Hope (@lillilblurkin) says
Understood. It is possible that there is something unique in your environment still that is happening. I get its frustrating I was equally pissed when things just started locking up and not responding. I would just state that Duncan has helped us in the past with issues like this. It may be worth getting the escalation and finding out whats going on. Just spoke to head of Product support today over the storage division and VMware is definitely discussing the impact this has had to their clients. I was delighted to be able to speak to them on this topic. Im sorry you are having continued issues but hope you guys are able to resolve this. Have a great weekend.
Animus.Ac.orexis (@AnimusAcOrexis) says
The problem is the continued spread of bad information and VMware pushing a false narrative.
The only thing unique in our environment is the introduction of the destroyer 7.0.2 on extremely vanilla servers, while dated still on the compatibility list.
I did not provide a case number due to company policy about public posts.
The choice to not even look at the tickets is another issue but a typical one when someone hears news outside the bubble of a company narrative.
The ticket I do have open, in an ever-growing email thread includes TAM/SAM support. Meetings with additional management types etc. They were informed of the discovery today.
We are now going to roll back to a stable version of ESXi but because 7.x was so destructive and offered no roll back solution we literally have the same amount of work to correct and stabilize.
IE clean install to a stable older version and re-configure each host and repeat for each one while keeping 1k VM’s up and serving customers. The reason patching to u2c has taken so long is having to replace SD cards that 7.x has caused to fail prematurely. In 5 years, we have maybe replaced 10 or 20 SD cards on servers that have seen 4 different ESXi versions, I think.
Now SD replacements happen on almost every server we are trying to fix. Keep in mind that the other 192 servers identical to this deployment except location, remain on 6.5 happy, healthy, productive, and stable because we noticed the problem with 7.0.2 and stopped are upgrades where they.
You have a great weekend, mine is already shot. sad lol.
Jonathan Hope (@lillilblurkin) says
Appreciate you doing these post and getting this information out there Duncan. I have since been in contact with VMware Product Support and am discussing some of the use cases and issues as a partner, and more specifically a partner who adopted VSAN early on, that we are now facing because of this decision and hope that maybe we can find a middle ground for some time without having to go crazy on spending. They mentioned this post when we discussed the issues.