Provisioning Services Failover: Myth Busted

Provisioning Services Failover: Myth Busted -

For a long time, if you ask a lot of experts in the field, how long it takes, a Provisioning Services (PVS) Server Failover would, many have cited an old case study where it took eight minutes about 1500 VMs failover. It was long ago, and much has changed over the years, so we know plays when failover time still wanted to see a part in PVS server sizing. In this post I will share learned lessons from recent tests showing because of high downtime PVS server horizontal scaling is a thing of the past. I will that provided all this, you have an understanding of PVS, if you can not verify the eDocs and read the Virtual Desktop Guide.

To test failover times, a surrounding area of 1000 Windows 8.1 x64 target devices that PVS was performed 7.1 Target software used. In accordance with Citrix best practices that Windows was 8.1 image as per CTX140375 optimized. Environment Details are summarized at the end of the post.

Test Methodology

There are three scenarios run failover time to determine. These tests were carried out, first failover time to determine the factor of the number of target devices and then as a factor of the registry changes, as specified in CTX119223. The three scenarios are listed below:

500 target device failover with the default settings
1000 target device failover with the default settings
1000 target device failover with registry changes

Before I deliver the results, we can see how actually to measure the failover time, or simulate a failover. I needed a consistent manner, the failover time of the measurement, or more importantly, in which time the desktop is unresponsive for a user. To do this, I ran the following test process .:

All target machines were from a single PVS server booted
A second PVS Server was brought online, but was not streaming on all target devices.
all target devices were loaded with a simulated workload LoginVSI 4.0 to reflect the actual use. The "light" workload to test a higher volume of targets was used, although the nature of the workload should not affect the failover time.
After all devices constantly ran the workload, a script was executed any goal ever the time to write to a remote file . This acted as a failover timer.
uses the NIC to the target device has been disabled to stream on the first PVS server. This corresponds to an ungraceful failover caused as a graceful failover by disabling the streaming service on the console
During the failover caused unlike objectives, unresponsive was temporarily stopped, the time to the remote file to write. Once the second PVS server failover, they put the time writing. The length of this strike has been marked as failover time for the target device
. Note: The streaming service through the PVS console Stop causes a graceful failover and no error scenario to simulate. This is a product function and a good thing if you need to perform maintenance, but not when an unplanned failover J . want

test [now

results

with this method, the type of failover times did we get? The maximum downtime for each scenario is as follows:

500 target device failover with the default settings = 76 seconds
1000 target device failover with the default settings = = 35 seconds

is definitely not more than 8 minutes and were the maximum times recorded 76 seconds

1000 target device failover with registry changes , Many meetings were never responded because of the use of the Windows standby cache used by the PVS Target. If the target does not have to read all of the vDisk when a failover occurs (as in everything is cached locally), is the goal never responded.

Key Takeaways:

There was no change between 500 and 1000 targets in failover time. For this reason, fail-over time should not be the determining factor, the number of PVS servers that is necessary to support an environment. Instead, the number of required PVS servers should based on whether the number of target terminals may in the case to support that a PVS server fails. That is, use of the N + 1 rule for PVS server redundancy. Other design decisions regarding the PVS Virtual Desktop Manual.
It is possible to modify the registry to change the failover behavior. Proceed with caution! As the article mentions, to change these settings values are very low in constant failover may have due to a network failure and the result is therefore not initiated recommended . In this test modified I two settings as noted in CTX119223. The first is to change the timeout value for a response for each packet to the PVS server sent. I have this to 1 second, which is very low. The second change that I made was to initiate the number of repetitions before the failover. I changed it when you must to reduce the potential downtime of the default of 10 to 4.
, because the standard is too long, I propose that the minimum timeout to 2 seconds help to increase the risk of constant failover and to reduce the number of repetitions to 4 instead of the default of 10.
the registry keys for this reduction under HKEY_LOCAL_MACHINE SYSTEM CurrentControlSet Services changed BNIStack Parameters :
IosPacketMinTimeoutms = 000000
IosRetryLimit = 00000004

environment Overview

a 8 host cluster with 125 virtual machines on each host with the write cache drives located on an EMC SAN was used which presented hypervisor as the clustered shared volumes (CSV). I also had a XenDesktop furnishings 7.1 application along with LoginVSI 4.0 to shed some light to simulate user workload during my tests are run. The environmental details are summarized below, along with an environment map. (Click on the image to enlarge it). Note that the tests were run with the Cache in Device HDD write cache option, as I these tests a little ran some time ago before 1 IOPS moving train (several recent blog posts see on the new RAM cache with overflow option). However, the write cache option should not affect the failover time.

server	HP ProLiant BL460c Gen8
CPU	16 cores Intel Xeon @ 2.60 GHz
memory	192GB
Storage	EMC VNX5500 Storage array
hypervisor	Hyper-V 3 on Server 2012 R2
PVS Specs	2 Server at 4vCPU and 16GB vRAM

thank

I would like to give a big thank you to the team, to assist in the Citrix Solutions Lab for the mobilization of all hardware, made these tests possible and EMC to thank for the VNX array, which was used as the primary storage device. I also want to Carl Fallis, Gregorz Iskra, Martin Zugec and Andy Baker for their various contributions and input on the audit.

to recognize Thanks for reading,

Amit Ben-Chanoch
Worldwide Consulting
Desktop & Apps Team
Project Accelerator
Virtual Desktop Manual

What is a VPN Used For