Introduction to VMware FT (Fault Tolerence)

FT is a new feature which VMware introduced to the world during VMworld 2008.  The feature is a continuous availability solution for use with some virtual machines.  The following notes were compiled from session BC2621 at VMworld which introduced the forthcoming FT feature from VMware’s Application vServices.  The feature should be available sometime in 2009.

FT will enable a VM to be protected with zero downtime and zero data loss due to a hardware failure.  FT allows for a VM to have a secondary copy running simultaneously on a second ESX host which is executing every instruction and every input in lockstep with the primary VM.  In the event of a failure, the secondary VM becomes the primary within a matter of seconds, while preserving state and without disconnecting any connections to the virtual machine.  All traffic is redirected to the secondary.  In addition, once the secondary assumes the role as primary, it spawns a new secondary instance on another ESX host and brings full fault tolerance back to the virtual server.

The idea of continuous access is very impressive and like the HA feature, should lower the bar in relation to ease of adoption for continuous access within the enterprise.  One of the things which VMware touted during the introduction is that they believe this will really increase the number of applications which can be protected this way.  Other solutions on the market are usually cost prohibitive for all but the most mission critical of applications.

This particular solution is also very appealing because it requires no modification to the application to support this technology.  The application does not need to be made cluster aware or altered in any way, because the fault tolerance is all accomplished at the virtualization layer.  In the same way that applications don’t have to be altered to run in a VM, applications will not need to be altered for FT.

There are some requirements for running FT in its first incarnation.  First, FT will only support uni-processor VM’s, at least in the beginning.  It requires an HA/DRS cluster and VMotion to work (more on this later).  One good thing to note, is that there isn’t an extra storage cost for the secondary VM.  It uses the same virtual hard disks as the primary on the shared VMFS, so that too is a requirement – shared VMFS of some sort.

The FT feature doesn’t come without some cost.  In addition to these requirements, the secondary VM will be alive and running, so it will consume the same amount of resources on a second ESX server as the primary VM.  This means you probably will not want to protect every VM.  FT will also require a dedicated NIC on the cluster for FT logging with at least 1 gig speed.  Lastly, the FT feature will increase latency on the VM, albeit very slightly.  Keeping the VM in lockstep is the reason for the latency increase.

To support this technology, VMware is working closely with processor makers.  Processors will need to be HV-compatible (I think this is not the same as HVM – somebody tell me for sure…) processors introduced in Intel’s Harpertown and AMD’s Barcelona chips.  So, the feature will be only available for the newest of processors, which precludes my quad-core Xeon’s in the blades we purchased end of last year.  Bummer.

FT will not support thin provisioning.  In fact, per the demo, if a VM is thin provisioned (that is if it is created from a linked clone rather than a full virtual HD) then enabling FT will automatically force the disk to be converted to a full virtual hard disk.

VMotion is supported with FT virtual machines, though we were warned that you probably don’t want to do this often.  You may want to set affinity to a particular ESX host for the primary FT VM to prevent this, but it will work if enacted.  Storage VMotion, on the other hand, is not supported with FT virtual machines.  This is because both VMs are accessing the same physical files on the storage.

The speaker gave us a few key points for picking candidates for FT:

  • Applications that run well on a single processor.
  • Applications which can tolerate higher latency.
  • Applications with medium bandwidth requirements (<600Mbits).

Additional Reading from Other Blogs:

Tags: , , ,

 

About the Post

Author Information

Philip is a IT solutions engineer working for AmWINS Group, Inc., an insurance brokerage firm in Charlotte, NC. With a focus on data center technologies, he has built a career helping his customers and his employers deploy better IT solutions to solve their problems. Philip holds certifications in VMware and Microsoft technologies and he is a technical jack of all trades that is passionate about IT infrastructure and all things Apple. He's a part-time blogger and author here at Techazine.com.

12 Responses to “Introduction to VMware FT (Fault Tolerence)”

  1. rick #

    There is far more to be considered here then meets the eye. You do mention that “FT will only support uni-processor VM’s”. That is a bit misleading. Today’s “processors” have 2 to 4 cores, and this software can only take advantage of one core per application. So if you have an application that requires more “horsepower” then a single core can provide, VMware FT can not scale to meet this need.

    You also state that “Other solutions on the market are usually cost prohibitive…”, but we have no idea what this feature will cost. How much will VMware FT cost is only one unanswered question related to ROI, here are a few others to consider.
    VMware recommends “over configuring” your server for this feature, what will be the incremental hardware, software, and infrastructure cost of VMware FT?
    • If you need to ‘over configure’ for performance reasons, will you be willing to do that twice? If so, what additional cost is incurred? If not, what performance penalties are incurred?
    • Are you willing to over-configure your system/network/storage for this feature?
    • Do you know what the costs are, both hardware and licenses, associated with ‘over-configuring’ your system/network/storage for this feature?
    • Does this over configuring defeat the purpose of your consolation/reduction plans?
    • Can your application run on a single core at peak load? Do you even know the answer to that question?
    • Will your new systems work with your installed systems or do you need to upgrade your hardware?

    Once I took a close look at this, it seems to be as costly as any other solution I’ve looked at, maybe more so, and with a lot more unanswered questions. A few more that come to mind right off the top.
    • Will VMware FT prevent a problem from propagating to the secondary server after a failover? Or does the error get replicated?
    • What was the cause of the crash? Is the problem root-caused to insure it does not occur again? Or did it just move somewhere else?
    • What happens to my environment during the time before the secondary server recognizes that the primary server is no longer working and shifts into active mode?

    Not sure I’d ever use a version 1.0 of anything for my more critical apps, and this leaves way too many unanswered questions.

    September 22, 2008 at 9:44 am Reply
  2. Philip #

    You definitely bring up good points, Rick. I certainly don’t have all the answers you’re seeking and you share some of my same questions. There were some things that I can relay from the session I attended:

    FT is not a solution that is application aware. It is a solution to protect from hardware failure. It is HA taken one step forward – giving you the ability to avoid downtime and losing connections by seemless failover of services to the standby VM. FT does not address application errors, blue screens, kernel panics, etc. Those type of errors will be transferred to the secondary and will cause downtime.

    There are additional resources needed to enable FT for a VM, in that you’ll be executing the same thing on two ESX hosts. As far as ‘over configuring’, not sure what you’re asking here. VMware advertises that FT isn’t something you’re probably going to do for every VM. It is intended and positioned for the most mission critical of virtual machines.

    During the demo we saw, the interim between the failover from primary to secondary shows a brief (and I mean sub-1 second) pause while duties are transferred. No sessions were broken and the client application continued communicating and running.

    While we don’t know the pricing, my point was that because of how this feature is created, it will lower the bar for implementing a continuous access solution in most datacenters. The bigger point here isn’t as much cost. Its more that you can implement this solution without needing to change the application – which itself can be expensive. The datacenter in which I work is a medium sized center, but we don’t have anything comparable currently implemented because we can’t justify the cost when we have so many applications that need this sort of protection. All of our current solutions are active/standby and require a certain amount of time to start up and begin processing again. That would interrupt any established sessions.

    I appreciate your feedback and I hope maybe these answers can shed a little more light on the features and targeted use of FT.

    September 22, 2008 at 1:45 pm Reply
  3. Denny #

    I too was in the BC2621 at VMworld – a packed room that reflects the need to increase availability in a virtual world.

    Philip, I agree with your assessment that (VMware FT) “will lower the bar for implementing a continuous access solution in most datacenters”. It will allow more users to think about moving their more important apps over to their VI environment – a natural evolution.

    I think Rick was questioning if VMware was overselling the capability. Discussing the limitation as “1-way” or “Uni-processor” misses the fact that your typical 2 socket/4 core server (8 cores) can only use 1/8th of its power against a given workload. Since most databases are designed as SMP – it is hard to imagine someone being pleased with the performance when it is throttled down to a single core.

    Further the performance hit “up to 20%” seems like another big strike against having a positive experience with a mission critical application. Most S/W companies give a rosy view when it comes to negative performance hits. Everyone will have to judge for themselves when the product arrives sometime in 2009.

    I agree with Dr Scales’ suggestion that there are many low to mid applications that will benefit from this technology – a natural evolution of HA. And for the mid to higher end apps (often running on FT hardware today) they can naturally move over to VMware running on fault tolerant hardware platforms.

    When VMware announced that VI3 was shipping for fault tolerant servers in Jan ’08 (from NEC and Stratus) it addressed the top of the availability pyramid.

    * Full SMP
    * Managed as a single image (server)
    * No performance impact
    * No application modification
    * Even runs with ESX Foundations, Standard, or Enterprise

    Now VMware FT is filling in the middle of that pyramid for the more numerous but less demanding applications.

    Oh and about the comment of “Other solutions on the market are usually cost prohibitive” it is hard to make a comparison since VMware will not disclose pricing of VMware FT. There were some web rumors of additional prices up to $8000. But just consider a real fault tolerant server can use one ESX Foundations license (only $995) today vs at least 2 copies of Enterprise edition for $11,500. Hmmmmmm I know which VMware would prefer to sell you.

    September 22, 2008 at 4:34 pm Reply
  4. Philip #

    Good points. I hadn’t heard anything on price, so if those costs are reality, that would be a difficult. I hope its not that cost per ESX host. I can’t see VMware demanding that price point, especially considering the amount of competition they are facing, but I am eager to find out for sure.

    I do think that they’ve been upfront (at least during VMworld) about the single virtual processor, but that is a major restriction for virtual database servers.

    I also wonder how FT will work with RDM’s. That is something that is only experimentally supported in SRM, and I wonder if it would be supported for FT.

    September 22, 2008 at 7:45 pm Reply
  5. Brian #

    Unlike some of the other posters I wasn’t able to attend VM World but I have viewed the online presentations that covered the VMware FT introduction. First: kudos to VMware for recognizing the growing need for availability. Having said that, VMware is entering a segment of the IT market that is unlike any other, for the customers who deploy a fault tolerant solution to protect their mission critical applications are among the most demanding in the industry: and rightfully so. Those applications are called mission critical for a reason. If for any reason they become unavailable, the consequences are measured in money, reputation, careers, and sometimes, even in lives. Ultimately, VMware had better deliver what they’re advertising, for the consequences of failure will boomerang straight back to Palo Alto.
    This brings me to my second point: while viewing the video demonstration of the casino application, there was a short, but very noticeable pause between the time Dr. Herrod pulled the active server blade from the chassis and when the secondary server recognized that something bad had occurred and shifted from a passive to an active state. While one could argue that no data was lost, this ignores the reality of many environments that deploy fault tolerant solutions. For example, in a manufacturing environment that depends on a series of valves opening or closing at very specific intervals, a pause – no matter how short in duration – could have dire consequences to the batch of drugs or chemicals that are being mixed. In contrast to what was shown in the demo, readers should know that a true fault tolerant solution is engineered expressly to handle any type of outage and incurs no failover time.
    Finally, one of the hallmarks of a true fault tolerant solution is its simplicity. While VMware’s “2 clicks and you’re all set” demo was brilliant marketing, it ignores all the behind the scenes planning that will have to go into a real-world deployment given the product’s well-discussed performance limitations and overhead issues.
    Will VMware FT be a success? That depends on what you define as a success. I suppose VMware will sell a number of licenses and garner some incremental revenue. But for those vendors who live and breathe true fault tolerance, success is measured by the endorsement of customers who continue to put their trust in such products year-after-year, decade-after-decade. From what I’ve seen so far, VMware has a long way to go.

    September 23, 2008 at 10:54 am Reply
  6. rick #

    Some great dialog going on here. I was not aware of some of the things the other posters have brought up. If the pricing is close to the $8K mentioned above, that would be somewhat prohibitive in my case. Also, I was not aware that Stratus ran VMware. I can see that being an interesting solution for the most critical applications that one might not be comfortable in virtualizing.

    The poster above (Brian) makes a good point around the vendors who “live and breath” fault tolerance. The Tandem’s and Stratus’s of the world have made a living of providing “the nines” and know what happens if you don’t. It’s not a mistake you get a second chance on. And what used to cost 6 figures a few years ago (for the hardware) is now running on Intel with windows.

    It will be interesting to watch this unfold, and to your point (Philip) the competition is only going to get tougher.

    September 23, 2008 at 11:47 am Reply
  7. saqp@nnit.com #

    I’d like to add my two cent’s worth.

    One thing that amazes me is that no one really talks site redundancy when discussing VMware FT.
    Unlike Stratus VMware FT and Marathon FT/Xen for that matter actually has the ability to span across two separate sites. (provided off cause that the infrastructure can handle it bla. bla. bla.)

    Now please correct me if I’m wrong but… to the best of my knowledge no one else offers this ability for “regular” Windows or Linux boxes.

    Why would you want to run VMware FT on hosts in the same physical sites when we have several other players in that market with a whole lot more experience???

    October 23, 2008 at 10:03 am Reply
  8. Hi,

    If you want to see a sneak preview and a sneak video of VMware FT before it hit the market then you might want to check out http://www.virtualizationteam.com/virtualization-vmware/vmware-esx-40-ft-fault-tolerant-sneak-peek.html

    And I do agree that VMware FT is going to be a great feature and disagree with people putting it down for the benefits of their own company. The video above show how beneficial FT can be and that should get the point across better than words.

    Enjoy,
    Virtualization Master

    November 7, 2008 at 11:59 am Reply
  9. Did you hear about VMware FT? I just read a bit about it on . Would u think that would replace VMware HA? I had seen a video even of FT on that link. Is it available yet?

    December 16, 2008 at 2:40 am Reply
  10. I’m wondering if anyone has heard about a additional price or not yet. I was reading a few posts about it, and the one mentioned that FT was a component or “part” of HA. The way it read was bascially if you had HA then FT would be there in the new version. I would love it if this was true, but as many others have mentioned I doubt it. I also believe that they need to increase number of processors supported. As well as the speed for failover, I have several MS clusters and the typical failover is 2-3 seconds after it’s failed. Also I would think that they should be able to come up with a version where it tracks what is happening, realizes if there is a BSOD, or similar. And if so failover to the other node, and not run those last transactions. Which granted depending on the environment/OS that could be a problem, but if done correctly and if it were more OS/Application aware it shouldn’t be impossible.

    April 16, 2009 at 12:01 pm Reply
    • Philip #

      Kyle – those announcements will be made today with the introduction of vSphere. VMware has already published their pricing, packaging and licensing overview on their website. The short version is that FT will be included for users with current enterprise level agreements. VMware is also adding a licensing level in between Standard and Enterprise called Advanced which will include the feature. Check out http://www.vmware.com/files/pdf/vsphere_pricing.pdf/

      April 21, 2009 at 11:41 am Reply
  11. Anyone knows what is the HCL for this feature?
    After a quick check i saw nothing but N/A on most of the recent server models.

    How do you think this competes with Stratus ?

    Stratus is an FT server in-a-box

    http://www.stratus.com/

    What kinds of I/O do you think you Vmware can handle ?

    I’m looking at something like 1000 transactions per second for a real time database software.

    Would Vmware FT do ?

    June 4, 2009 at 9:24 am Reply

Leave a Reply

%d bloggers like this: