
Load Testing and Failover
Without load testing in a production-comparable environment, you won’t know your overall service capacity. Many organisations don’t have a second full-size environment, so load-testing in live it is. Remember — at quiet times, you can just test to degradation, rather than destruction. It’s not a sin to test your live environment.
If you have redundant systems, you will naturally want to test failover. Firewall pairs, load balancers, even just redundant servers — you should do failover testing of all of them. It’s easy enough — present a sample load, and turn the primary one off. What happens to the sample load? If it carries on, you’re golden.
Well… walk with me to a slightly more rigorous place for a moment.
A major cause of sudden failovers, beyond actual hardware failures, is subtle memory leaks, or file handle leaks, and appliances can suffer from these just as much as your code. Another source, especially for devices like firewalls and load-balancers which by their very nature handle a lot of simultaneity, is uncommon race conditions in code.
Leaks can trigger failure at any time, but you should be cautious about that “any”. They are typically driven by the number of visits to the offending code path, accumulating over time. So let’s imagine your daily user activity and break it into two halves, two windows of time.
One is centred at your peak minute, with 25% of usage either side of it. The other is the rest of the day. If you have any kind of daily busy period, there are good odds one of those windows is maybe 4 hours long — your busy period — and the other is the remaining 20 hours. YMMV.
So it’s as likely as not that that leak-based failure will happen close to the worst possible time.
The second set of causes — concurrency, race conditions, livelocks and deadlocks — well those are straight up biased towards your busy times. A numerical analysis is complex, but suffice it to say, most of those nanosecond-long risky windows will happen at peak load.
So when you tested that failover… was it at peak load?
When you load tested, did you go straight in at peak load from zero, or did you ramp up slowly “to avoid breaking everything”?
Therefore, we arrive at where these two things come together. There is a significant chance that one of your redundant devices will fail at peak times, and the other will go from zero to slightly-on-fire in less than a second. That’s what you should be testing.
If you need more guidance, why not contact us to see if we can help.