I will see you at VMworld 2015!

Just got the word that I am going to be part of three sessions at VMworld this year, and still waiting on news about the quick talks I submitted. So if you are going to VMworld 2015 make sure to note down these session IDs:

  • SDDC5027 – VCDX Unwrapped – Everything You Wanted to Know About VCDX (US only)
    The VMware Certified Design Expert (VCDX) program is growing every year. More and more people are interested in what it takes to become a VCDX. This moderated talk-show style panel session made up of a VCDX from each of the four tracks, (DCV, NV, DTM, CMA) will help potential candidates understand the value of getting their VCDX. It will also be a no-holds barred open discussion on what it takes to achieve this premier VMware certification. Hear from these experts on their journey and the incredible value that comes with becoming a VCDX yourself. This session will also feature live Q&A from the audience so come prepared with your own questions! This is the place to find out everything you ever wanted to know about becoming a VCDX, from live VCDX holders in a lively interactive session. Featuring: Chris Colotti, Simon Long, Jason Nash and Matt Vandenbeld.
  • SDDC4593 – Ask the Expert vBloggers (US only)
    Back on stage with Rick Scherer, Chad Sakac, Scott Lowe and for me the first time with Chris Wahl! 8th year at VMworld, awesome panel of the industries top bloggers. In this session there are no powerpoints, no sales pitches and no rules! Experts in the industry are here to answer the audiences questions while having some fun in the process. Bring your topic, anything from Software-Defined Data Center, End-User Computing to Hybrid Cloud… Storage, Networking, Security. No questions are off limits.
  • INF4535 – 5 Functions of Software Defined Availability (US and EMEA)
    Together with my friend Frank Denneman… Long time since I’ve been up on stage with Frank, and this VMworld we will be looking at Software Defined Availability. We will discuss 5 functions of Software Defined Availability, which are part of vSphere 6.0. For each of these functions certain scenarios will be discussed to explain how vSphere can help improving availability of your workloads. This ranges from “how Site Recovery Manager and Storage DRS are loosely coupled but tightly integrated” with vSphere 6.0 to “how vSphere HA responds in the case of a certain failure”. Be prepared to get in to the trenches of workload availability

Synchronet leverages Virtual SAN to provide scale, agility and reduced costs to their customers

This week I had the pleasure to talk to John Nicholson who works for one of our partners (Synchronet out of Houston). John has been involved with various Virtual SAN implementations and designs and I felt that it would make for an interesting conversation. John in my opinion is a true datacenter architect, he has a good understanding of all aspects and definitely has a lot of experience with different storage platforms (both traditional and hyper-converged). Something I did not mention during our conversation, but the answers John was giving me to some of the questions were most definitely VCDX-level. (If you can find the time, do it John :-)) Below is John’s bio, make sure to follow him on twitter:

John Nicholson vExpert (2013-2015) is the manager of client services for Synchronet.  He oversees the professional services who deploy cutting edge virtualization, VDI, and storage solutions for customers as well as the managed services who keep these environments running smoothly.  He enjoys a deep dive into the syslog, and can telepathically sense slow and undersized storage.

First customer / project we discussed was a Virtual SAN environment for a construction company. The environment was build on top of Dell R720s and they have 400GB flash capacity in each node and 7x 1.2TB 10K RPM. In this environment MS SQL is running on top of Virtual SAN and Exchange. The SQL database is used for ~ 1000 customers as part of a real time bidding and tracking solution. As you can imagine reliability and predictable performance is key in this environment. Also hosted on Virtual SAN is their ERP system and it is also used for their development environment for their end-customer applications.

What was interesting with this particular project is that there were some strange performance anomalies, as you can imagine Virtual SAN being a new product was a suspect but after troubleshooting the environment they found out that there was a mismatch driver/firmware mismatch for the 10GbE Intel NICs they were using. Further investigation revealed that all types of traffic were impacted. John wrote about it on their corporate blog here, worth reading if you are using the Intel X540 10GbE NICs.

Key take away: Always verify driver / firmware combination and compatibility as it can have an impact.

What pleased John and the customer the most is probably the performance Virtual SAN is providing. Especially when it comes to latency, or should I say the lack of latency as they are hitting sub millisecond numbers. They’ve been so happy running Virtual SAN in their environment that they’ve just purchased new hosts and a DR site with VSAN is being implemented this week. The DR site will be used at first to test VSAN 6.0 and when proven stable and reliable the production environment will be upgraded to 6.0 and the DR site will be configured for DR purposes leveraging vSphere Replication. I asked John how they went about advising the customer to leverage a virtual replication technology which is asynchronous and John mentioned that as part of their advisory/consultancy services they have business analysts on-board which will assess what the cost of down time is and map out the cost of availability and decide a solution based on that outcome. Same applies to de-duplication by the way, what is the price of disk, what is the dedupe ratio, does it make sense in your environment?

While discussing this project John mentioned that he has worked with customers in the past which had two or three IT folks of which one being a dedicated storage admin, primarily because of the complexity of the environment and the storage system. In todays world with solutions like Virtual SAN that isn’t needed any longer and the focus of IT people should be enabling the business.

During our discussion about networking John mentioned that Synchronet has a long history with IP based storage solution (primarily iSCSI), and based on their experience top grade switches were in absolute must when deploying these types of storage. While talking to some of the Virtual SAN engineers John asked about how Virtual SAN would handle switches which have a lower “PPS” (packets per second). The Virtual SAN team mentioned that VSAN was less prone to the common issues faced in iSCSI/NFS environments, John being the techie that he is of course was skeptical and wanted to test this for himself. The results were published in this white paper, fair to say that John and his team were impressed with how Virtual SAN handled itself indeed with relatively cheap switches. For the majority of Virtual SAN deployments their typical customer setup is leveraging 2 VMkernel interfaces each connected to a different switch so that traffic isn’t going outside of the switch, this is what it would look like for those interested:

Host 1 / NIC-A —> Switch-A
Host 2 / NIC-A —> Switch-A
Host 3 / NIC-A —> Switch-A
Host 1 / NIC-B —> Switch-B
Host 2 / NIC-B —> Switch-B
Host 3 / NIC-B —> Switch-B

The second project John mentioned was for a software startup in the healthcare space. They’ve been doing a lot of mergers and acquisitions. Initially they wanted to get 6 VMs up and running but with the ability to scale-up and scale-out when needed. I found the “scale-up” comment interesting so asked what John was referring to. In this scenario the server configuration used was SuperMicro Fat Twin initially deployed with 3 hosts using a single socket and 800GB of flash capacity and two NL-SAS drives. As the company started growing of course the number of virtual machines increased and they have over 70 VMs running currently, simply achieved by adding an additional CPU in each box and add some drives. The question I had was what about flash capacity then compared to disk capacity? John said that they started out with flash overprovisioned simply to allow them to scale-up when required. Especially in the merger and acquisition space where the growth pattern is unknown this is a huge advantage of a solution like Virtual SAN which allows you to both scale-out and scale-up when required. Compared to traditional storage systems this model worked very well as they avoided the huge up front cost (50K USD – 100K USD not uncommon). On top of that, John said that with the majority of storage systems a big discount is given during the initial purchase but when it is time to add a disk shelve that discount has magically disappeared. Also, with traditional storage systems you can fairly easily reach the limits of a storage controller and be stuck with a system which can’t scale to the size you need it to scale. Another problem that disappears when leveraging VSAN.

Key take away: Large upfront costs can be avoided while offering flexibility in terms of scaling and sizing

Synchronet isn’t just an implementation partner by the way, they also do managed services and one of the things they are doing for instance is monitor customer environments leveraging Log Insight. This includes monitoring Virtual SAN, and they’ve created custom dashboard so that they can respond to  issues like for instance when a snapshot removal has failed and solve the problem before an issue arises as a result of it. They can go as far as monitoring the raw syslog feeds if needed, but each time a problem occurs in any environment this is recorded and custom dashboards and warnings are created so that every customer immediately benefits from it. For some customers they even do full management of the vSphere environment.

We had some small talk about VDI. John mentioned that VSAN is great for PoC’s and small test environments because it is easy to get in their, use it and then grow it as soon as the PoC / test has completed. Especially the price per desktop licensing is really handy as it keeps the cost down initially, and at the same time the customer knows what it is paying and getting. From an architectural point of view John mentioned that the majority of their customers use non-persistent desktops and as such the Virtual SAN environment looks different then the traditional server VM environments. Typically less disk capacity and higher flash capacity to ensure performance.

Before we wrapped up there was one thing I was interested in knowing, that was if they tweaked any of the Virtual SAN related settings (within the storage policy or for instance advanced settings). John mentioned that they would tweak the number of stripes per VM from 1 to 3 by default. This is primarily to speed up the backup with Virtual SAN 5.5, preliminary tests are showing though that with Virtual SAN and the new snapshotting mechanism this isn’t needed any longer. While talking about striping John also mentioned that for their hosting services the one thing that stood out to him is that Virtual SAN was performing so well that the customers paying for a lower tier of storage were actually getting a lot more storage performance resources then they paid for and the storage policies were used to ensure that a tier 2 VM wouldn’t receive more resources than a tier 1 VM, pretty neat problem to have I guess.

Key take away: Increasing stripe-width with Virtual SAN 5.5 can have a positive impact on performance. With 6.0 this appears no longer needed.

Last thing John wanted to mention was the VIP Tool (https://vip.vmware.com/). He said it helped them immense figuring out how much data was active and designing / sizing Virtual SAN environments for customers. I think it is fair to say that John (and Synchronet) has had huge success introducing Virtual SAN to their customers and deploying it there where applicable. Thanks John for taking the time, and thanks for being a great VMware and Virtual SAN advocate!

VSAN and large VMDKs on relative small disks?

Last week and this week I received a question and as it was the second time in a short time I figured I would share it. The question was around how VSAN places a VMDK which is larger than the disks. Lets look at a diagram first as that will make it obvious instantly.

If you look at the diagram you see these stripes. You can define the number of stripes in a policy if you want. In the example above, the stripe width is 2. This is not the only time when you can see objects being striped though. If an object (VMDK for instance) is larger than 256GB it will create multiple stripes for this object. Also, if a physical disk is smaller than the size of the VMDK it will create multiple stripes for that VMDK. These stripes can be located on the same host as you can see in the diagram but also can be across hosts. Pretty cool right.

How Virtual SAN enables IndonesianCloud to remain competitive!

Last week I had the chance to catch up with one of our Virtual SAN customers. I connected to Neil Cresswell through twitter and after going back and forth we got on a conference call. Neil showed me what they had created for the company he works for, a public cloud provider called IndonesianCloud. No need to tell you where they are located as the name kind of reveals it. Neil is the CEO of IndonesianCloud by the way, and very very passionate about IT / Technology and VMware. It was great talking to him, and before I forget I want to say thanks for taking time out of your busy schedule Neil, I very much appreciate it!

IndonesianCloud is a 3 year old, cloud service provider, part of the vCloud Air Network, which focuses on the delivery of enterprise class hosting services to their customers. Their customers primarily run mission critical workloads in IndonesianCloud’s three DC environment, which means that stability, reliability and predictability is really important.

Having operated a “traditional” environment for a long time Neil and his team felt it was time for a change (Servers + Legacy Storage). They needed something which was much more fit for purpose, was robust / reliable and was capable of providing capacity as well as great performance. On top of that, from a cost perspective it needed to be significantly cheaper. The traditional environment they were maintaining just wasn’t allowing them to remain competitive in their dynamic and price sensitive market. Several different hyperconverged and software based offerings were considered, but finally the settled on Virtual SAN.

Since the Virtual SAN platform was placed into production two months ago, they have deployed over 450 new virtual machines onto their initial 12 node cluster. In addition, migration of another 600 virtual machines from one of their legacy storage platforms to their Virtual SAN environment is underway. While talking to Neil I was mostly interested in some of the design considerations, some of the benefits but also potential challenges.

From a design stance Neil explained how they decided to go with SuperMicro Fat Twin hardware, 5 x NL-SAS drives (4TB) and Intel S3700 SSDs (800GB) per host. Unfortunately no affordable bigger SSDs were available, and as such the environment has a lower cache to capacity ratio than preferred. Still, when looking at the cache hit rate for reads it is more or less steady around 98-99%. PCIe flash was also looked at, but didn’t fit within the budget. These SuperMicro systems were on the VSAN Ready Node list, and this was one of the main reasons for Neil and the team to pick them. Having a pre-validated configuration, which is guaranteed to be supported by all parties, was seen as a much lower risk than building their own nodes. Then there is the network; IndonesianCloud decided to go with HP networking gear after having tested various products. One of the reasons for this was the better overall throughput, better multicast performance, and lower price per port. The network is 10GbE end to end of course.

Key take away: There can be substantial performance difference between the various 10GbE switches, do your homework!

The choice to deploy 4TB NL-SAS drives was a little risky; IndonesianCloud needed to balance the performance, capacity, and price ratios. Luckily having already run their existing cloud platform for 3 years, there was a history of IO information readily available. Using this GB/IOPS historical information meant that IndonesianCloud were able to make a calculated decision that 4TB drives with 800GB SSD would provide the perfect combination of performance and capacity. With very good cache hit rates, Neil would like to deploy larger SSD drives when they become available, as he believes that cache is a great way to minimise the impact of the slower drives. Equally, the write performance of the 4TB drives was also concerning. Using the default VSAN stripe size configuration of 1 meant that at most, only 2 drives were able to service write de-stage requests for a given VM, and due to the slow speed of the 4TB drives, this could have an impact on performance. To mitigate this, IndonesianCloud performed a series of internal tests that baselined different stripe sizes to get a good balance of performance. In the end a stripe size of 5 was selected, and is now being used for all workloads. This also helps in situations where reads are coming from disk by the way, great side effect. BTW, the best way to think about Stripe Size and Failures to Tolerate is like Raid 1E (mirrored stripes).

Key take away: Write performance of large NL-SAS drives is low, striping can help improving performance.

IndonesianCloud has standardised on a 12 node Virtual SAN cluster, and I asked why, given that Virtual SAN 5.5 U1 supports up to 32 nodes (64 with 6.0 even). Neil’s response was that 12 nodes is what comprises an internal “zone”, and that customers can balance their workloads across zones to provide higher levels of availability. Having all nodes in a single cluster, whilst possible, was not considered the best fit for a service provider that is all about containing risk. 12 nodes also maps to approximately 1000 VMs, which is what they have modelled the financial costs against, so 1000 VMs deployed on the 12 node cluster would consume CPU/Memory/Disk at the same ratio, effectively ensuring maximum utilisation of the asset.

If you look at the workloads IndonesianCloud customers run, they range from large databases, time sensitive ERP systems, webservers, streaming TV CDN services, and they are even running Airline ERP operations for a local carrier… All of these VMs are from external paying customers by the way, and all of them are mission critical for those customers. On top of Virtual SAN some customers even have other storage services running. One of them for instance is running SoftNAS on top of Virtual SAN to offer shared file services to other VMs. Vast ranges of different applications, with different IO profiles and different needs but all satisfied by Virtual SAN. One thing that Neil stressed was that the ability to change the characteristics (failures to tolerate) specified in a profile was key for them, it allows for a lot of flexibility / agility.

I did wonder, with VSAN being relative new to the market, if they had concerns in terms of stability and recoverability. Neil actually showed me their comprehensive UAT Testing Plan and the results. They were very impressed by how VSAN handled these tests without any problem. Tests ranging from pulling drives, failing network interfaces and switches, through to removing full nodes from the cluster, all of these were performed whilst simultaneously running various burn-in benchmarks. No problems whatsoever were experienced, and as a matter of fact the environment has been running great in production (don’t curse it!!).

Key take away: Testing, Testing, Testing… Until you feel comfortable with what you designed and implemented!

When it comes to monitoring though, the team did want to see more details than what is provided out of the box, especially because it is a new platform they felt that this gave them a bit more insurance that things were indeed going well and it wasn’t just their perception. They worked with one of VMware’s rock stars (Iwan Rahabok) when it comes to VR Ops on creating custom dashboards with all sorts of data ranging from cache hit ratio to latency per spindle to ANY type of detail you want on a per VM level. Of course they start with generic dashboard which then allow you to drill down; any outlier is noted immediately and leveraging VR Ops and these custom dashboards, they can drill deep whenever they need. What I loved most is how relatively easy it is for them to extend their monitoring capabilities. During our WebEx Iwan felt he needed some more specifics on a per VM basis and added these details literally within minutes to VR Ops. IndonesianCloud has been kind enough to share a custom dashboard they created, where they can catch a rogue VM easily. In this dashboard, when a single VM, and it can be any VM, generates excessive IOPS it will trigger a spike right away in the overall dashboard.

I know I am heavily biased, but I was impressed. Not just with Virtual SAN, but even more so with how IndonesianCloud has implemented it. How it is changing the way IndonesianCloud manages their virtual estate and how it enables them to compete in today’s global market.

Rubrik follow up, GA and funding announcement

Two months ago I published an introduction post on Rubrik. Yesterday Rubrik announced that their platform went GA and they announced a funding round (series B) of 41 million dollars led by Greylock. I want to congratulate Rubrik with this new milestone, major achievement and I am sure we will hear much more from them in the months to come. For those who don’t recall, here is what Rubrik is all about:

Rubrik is building a hyperconverged backup solution and it will scale from 3 to 1000s of nodes. Note that this solution will be up and running in 15 minutes and includes the option to age out data to the public cloud. What impressed me most is that Rubrik can discover your datacenter without any agents, it scales-out in a fully automated fashion and will be capable of deduplicating / compressing data but also offer the ability to mount data instantly. All of this through a slick UI or you can leverage the REST APIs , fully programmable end-to-end.

When I published the article some people made comments that you can do the above with various of other solutions and people asked why I was so excited about their solution. Well, first of all because you can do all of that from a single platform and don’t need a backup solution plus a storage solution and have multiple pieces to manage without scale-out capabilities. I like the model, the combination of what is being offered, the fact that is is a single package designed for this purpose and not glued together… But of course there is more, I just couldn’t talk about it yet. I am not gonna go in to an extreme amount of detail as Cormac wrote an excellent piece here and there is this great blog from Chris, who is a user of the product, which explains the value of the solution. (Always nice to see by the way people read your article and share their experience as well in return…)

I do want to touch on a couple of things which I feel sets Rubrik apart. (And there may be others who do this / offer this, but I haven’t been briefed by them.)

  • Global search across all data
    • “Google-alike” search, which means you start typing the name of a file in the UI of any VM and while typing the UI already presents a list of potential files you are looking for. Then when it shows the right file you click it and it presents a list of options. The file with this name could of course be on one or many VMs, you can pick which one you want and select from which point in time. When I was an admin I was often challenged with this problem “I deleted a file, I know the name… but no clue where I stored it, can you recover it?”. Well that is no problem any longer with global search, just type the name and restore it.
  • True Scale Out
    • I’d already highlighted this, but I agree with Scott Lowe that there is “scale-out” and there is “Scale-Out”. In the case of Rubrik we are talking scale out with capital S and capital O. Not just from a capacity stance, but also when it comes to (as Scott points out) task management and the ability to run any task anywhere in the cluster. So with each node you add you aren’t just scaling capacity, but also performance on all fronts. No single choking point with Rubrik as far as I can tell.
  • Miscellaneous, stuff that people take for granted… but does matter
    • API-Driven – Not something you would expect I would get excited about. And it seems such an obvious thing, but Rubrik’s solution can be configured and managed through the API they expose. Note that every single thing you see in the UI can be done through the API, the UI is simply an API client.
    • Well performing instant mount through the use of flash and serving the cluster up as a scale-out NFS solution to any vSphere host in your environment. Want to access a VM that was backed-up? Mount it!
    • Cloud archiving… Yes others offer this functionality I know. I still feel it is valuable enough to mention that Rubrik does offer the option to archive data to S3 for instance.

Of course there is more to Rubrik then what I just listed, read the articles by Scott, Cormac and Chris to get a good overview… Or just contact Rubrik and ask for a demo.