Tag Archives: vsan

Why I moved from NFS to vSAN… and why it went wrong

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I wanted to retire my Synology DS414slim, and switch completely to vSAN. Okay, no big deal. Many folks use vSAN in their lab. But I’d like to explain why I moved to vSAN and why this move failed. I think some of my thoughts are also applicable for customer environments.

So far, I used a Synology DS414slim with three Crucial M550 480 GB SSDs (RAID 5) as my main lab storage. The Synology was connected with two 1 GbE uplinks (LAG) to my  network, and each host was connected with 4x 1 GbE uplinks (single distributed vSwitch). The Synology was okay from the capacity perspective, but the performance was horrible. RAID 5, SSDs and NFS were not the best team, or to be precise, the  CPU of the Synology was the main bottleneck.

nas_ds414slim

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

1,2 GHz is not enough, if you want to use NFS or iSCSI. I never got more than 60 MB/s (sequential). The random IO performance was okay, but as soon as the IO increased, the latencies went through the roof. Not because the SSDs were to slow, but because the CPU of the Synology was not powerful enough to handle the NFS requests.

Workaround: Add more flash storage

The workaround for the poor random IO performance was adding more flash storage. This time, the flash storage was added to the hosts. I used PernixData FVP to boost my lab. FVP was a quite cool product (unfortunately it was a cool product.) PernixData granted me, as a PernixPro, some licenses for my lab.

End of an era

The acquisition of PernixData by Nutanix, the missing support für vSphere 6.5, and the end of availability of all PernixData products led to the decision to remove PernixData FVP from my lab. Without PernixData FVP, my lab was again a slow train crawling up a hill. Four HPE ProLiant, with enough CPU (40 cores) and memory resources (384 GB RAM) were tied down by slow IO.

Redistribution of resources

I had

  • three 480 GB SSDs, and
  • three 40 GB SSDs

in stock. The 40 GB SSDs were to small and slow, so I replaced them with 120 GB SSDs. I was able to equip three of my four hosts with SSDs. Three hosts with flash storage were enough to try VMware vSAN.

Fortunately, not all hosts have to add capacity to a vSAN cluster. Hosts can also only consume storage from a vSAN cluster. With this in mind, vSAN appeared to be a way out of my IO dilemma. In addition, using the 480 GB SSDs as capacity tier, a vSAN all-flash config was possible.

Migration

It took me a little time to move around VMs to temporary locations, while keeping my DC and my VCSA available. I had to remove my datastore on the Synology to free up the 480 GB SSDs. The necessary vSAN licenses were granted by VMware (vExpert licenses).

The creation of the vSAN cluster itself was easy. Fortunately, wiping partitions from disks is easy. You can use the vSphere Web Client to do this.

vsan_wipe_partitions

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The initial performance was quite good, much better than expected and much better than the NFS performance of the old Synology NAS. I enabled deduplication and compression, but as soon as I moved VMs to the vSAN datastore, the throughput dropped and latencies went through the roof. It was totally unusable. Furthermore, I got health alarms:

vsan_congestion_error

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

As the load increased, the errors became more severe.

vsan_performance_error

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0


I was able to solve this with a blog post of Cormac Hogan (VSAN 6.1 New Feature – Handling of Problematic Disks). Even without compression and deduplication, the performance was not as expected and most times to low to work with. At this point, I got an idea what was causing my vSAN problems.

Do not use consumer-grade hardware with vSAN

To be honest: The budget is the problem. I had to take consumer-grade SSDs.

This is a screenshot from the vSAN Observer. esx1 to esx3 are equipped with SSDs, esx4 is only consuming storage from the vSAN cluster.

vsan_observer_perf

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Red is not the color to highlight good things…

An explanation attempt

This blog post of Duncan Epping (Why Queue Depth matters!) is a bit older, but still valid in my case. The controller I use  (HPE Smart Array P410i) has a a deep queue (1011), the RAID device has a queue length of 1024, but the SATA SSDs have only a queue length of 32. Here’s the disk adapter and disk device view of ESXTOP.

The consumer-grade SSDs drowned in IOs, unable to handle parallel read and write operations. There’s nothing much that I can do. Currently there are two options:

  • Replacing the SSDs with devices, that have a deeper queue depth
  • Replace the Synology NAS with a more powerful NAS and move back to NFS

I don’t know which way I will go. To get this clear:

  • This is my lab, not a customer environment
  • It is not a vSAN related problem
  • It is because of consumer-grade hardware

Do not try this at production kids. Go vSAN, but please use the right hardware.

VMware jumps on the fast moving hyper-converged train

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

The whole story began with a tweet and a picture:

This picture  in combination with rumors about Project Mystic have motivated Christian Mohn to publish an interesting blog post. Today, two and a half months later, “Marvin” or project Mystic got its final name: EVO:RAIL.

What is EVO:RAIL?

Firstly, we have to learn a new acronym: Hyper-Converged Infrastructure Appliance (HCIA). EVO:RAIL will be exactly this: A HCIA. IMHO EVO:RAIL is VMwares try to jump on the fast moving hyper-converged train. EVO:RAIL combines different VMware products (vSphere Enterprise Plus, vCenter Server, Virtual SAN and vCenter Log Insight) along with EVO:RAIL deployment, configuration and management to a hyper-converged infrastructure appliance. Appliance? Yes, an appliance. A single stock keeping unit (SKU) including hardware, software and support. To be honest: VMware will no try to sell hardware. The hardware will be provided by partners (currently Dell, EMC, Fujitsu, Inspur, NetOne and SuperMicro).

VMware Chief Technologist Duncan Epping described four advantages of EVO:RAIL in a today published blog post:

EVO:RAIL is software-defined. Based on well-known VMware products, the EVO:RAIL engine simplifies the deployment, management and configuration of the building blocks.

EVO:RAIL is simple: The EVO:RAIL engine allows you to reduce the time from rack & stack until you can power-on your first VM. You need less time for basic tasks, like creation of VMs or for the patch management of the hosts. If you need more compute or storage capacity, simply add additional 2U blocks (currently max 4 blocks > 16 nodes).

EVO:RAIL is highly resilient: A 2U block consists of four nodes. This results in a single four host vSphere cluster, with a single VSAN datastore and full support für VMware HA, DRS, FT etc. This facilitate no downtime for VMs during planned maintenance or node failures.

EVO:RAIL allows customers to choose: Customers can obtain EVO:RAIL using a single SKU from their preferred EVO:RAIL partner. The partner provides hardware, software and support for the EVO:RAIL HCIA.

Each HCIA node will provide at least:

  • 2x Intel Xeon E5-2620 v2 six-core CPUs
  • at least 192GB of memory
  • 1x SLC SATADOM or SAS HDD as boot device
  • 3x SAS 10K RPM 1.2TB HDD for the VMware Virtual SAN datastore
  • 1x 400GB MLC enterprise-grade SSD for read/ write cache
  • 1x Virtual SAN-certified pass-through disk controller
  • 2x 10GbE NIC ports (either 10GBase-T or SFP+)
  • 1x 1GbE IPMI port for out-of-band management

This results in a four node vSphere cluster with 48 cores, 768 GB RAM and 14,4 TB raw disk space on just 2U. A single block allows you to run 100 average-sized (2 vCPU, 4GB RAM, 60GB with redundancy) general-purpose VMs, or 250 View VMs (2vCPU, 2GB RAM, 32GB linked clones).

My thoughts

Looks like a Nutanix clone, isn’t it? Yes, it’s a HCIA like a Nutanix block. But it’s focused on VMware (you can’t run Microsoft Hyper-V or KVM on it) and it will be sold by EVO:RAIL partners. This allows VMware to use a much wider distribution channel. It will be fun to see how other hyper-converged companies will react to this announcement. Unfortunately HP isn’t listed as a HCIA partner company. But DELL is listed. Fun fact: DELL and Nutanix signed a contract in June 2014.

Strategic Relationship Significantly Expands Access and Distribution of Nutanix Solutions with Dell’s World-Class Hardware, Services and Marketing to Accelerate Adoption of Web-scale Converged Infrastructure in the Enterprise

Take a look into the “Introduction to VMware EVO: RAIL” whitepaper. There are other great blog posts about EVO:RAIL:

Duncan EppingMeet VMware EVO:RAIL™ – A New Building Block for your SDDC
Christian MohnNO MORE SPECULATION: IT’S OFFICIAL EVO:RAIL IT IS
Chris WahlVMware Announces Software Defined Infrastructure with EVO:RAIL
Marcel van den BergVMware announces EVO:RAIL , a turnkey appliance  offering SDDC in a box featuring vSphere  and Virtual SAN
Marco BroekenVMworld 2014: Introducing VMware EVO: RAIL
Vladan SEGETVMware EVO:RAIL – New Hyper-Converged Solution By VMware
Eric SloofVMware EVO: RAIL Hyper-Converged Infrastructure Appliance