Poor Software Load Testing Can Cause Many People Misery
A software professional should have responsibility and integrity. Do software functional and load testing properly and professionally.
Chengdu, a Chinese city with 21 million residents, was put in lockdown on 2022–09–01, and the residents were required to do mandatory daily PCR tests due to the country’s zero-COVID policy. However, the software system used to assist in recording PCR tests “seriously failed” the next day.
Table of Contents:
· The Pain of a Failed Sofware System Caused to millions of already stressful people
· The Solution is actually Simple, Quick and Low-Cost
· My Specific Solution to load test Chengdu PCR system
The Pain of a Failed Sofware System Brought to Millions of already highly-stressful people
The severely delayed testing process caused millions of people to line up for hours on a rainy night. Chengdu people have recently suffered a long period of extreme-high temperatures, severe drought, electricity shortage, and water shortage. This COVID outbreak and mandatory PCR every 24 hours with a bad software system only made matters worse.
Can you imagine the stress level of those people (I am talking about tens of millions of them) when they were told to go home after lining up for 4 hours on a rainy night? All these were due to a simple software system that didn’t get load tested properly.
The company behind it is Neusoft, the largest software company in China, with 20,000+ employees. By the way, this is not the first time Neusoft’s PCR testing software failed, the last major one was during the Xi’an lockdown time last year.
Not surprisingly, Neusoft blamed others, “Neusoft said that according to the technical experts, the malfunction had nothing to do with the nucleic acid detection system software.” They claimed it was mainly a network issue. Apparently, it was a lie. Many pictures as below were posted:
The fact that the residents could browse web/social media properly meant the network was fine that night.
On September 3, Chengdu Pandemic Prevention Command apologized and stated that the abnormal situation was due to insufficient short-term large-capacity estimates, which caused the system to become stuck.
To simplify the cause of the above incident: the software system couldn’t handle the load. In other words, the software was not properly load tested. This, unfortunately, is very common. Here I will name a few (news headlines):
2007–11, “Beijing Olympic ticketing system crashes on the first day”
2010–01, “My School website bombs on launch”
2012–01 “More (London) Olympic ticket chaos as website crashes causing re-sale to be suspended”
2016–08, “Australian Census website collapses”
There are a lot more of these kinds of events, the above is just a subset that had an impact on me and my family members.
Why? Some non-tech people might ask. These days, the network and hardware are much faster and far more reliable than before, along with the advancement of cloud computing (scalable). Therefore, this kind of situation shouldn’t happen, especially not for high-profile systems by large companies such as Neusoft and IBM!
All these incidents are caused by the fact that most software projects lack a systematic process for conducting load testing. In comparison, the airline industry does a much better job. If there is a plane accident, there are processes to investigate and find out the technical/human reasons, then there will be a new/updated process to prevent that. That is why we feel safe with air travel. And if there are cases of not following the protocol, such as the Boeing 737 MAX issue, the punishment is severe ($20 billion, according to this CNN report).
The reality is that the software industry generally has no regulation and virtually no punishment for failures due to bad practices. Consequently, software Testing has always been under-appreciated and often neglected. This is unfortunate, as software (apps) are an essential part of most people’s life nowadays. Sometimes, software failures can lead to losses of lives, such as London’s computer-aided ambulance-dispatch system.
“Computerized systems can help, or hurt. In the case of the London Ambulance Service, it definitely hurt.” — The Wired.
The Solution is actually Simple, Quick and Low-Cost
First of all, Money is not the cause of software system failures due to load issues. All the failed systems (listed earlier) had been allocated an abundant budget for load testing. This is a human issue as those IT managers/senior tech leads simply do not understand load testing (they thought so, though).
In the article, “Simple Solution to Embarrassing Collapses of Census Website due to Poor Load Testing”, I showed a simple solution that can prevent that big embarrassment in Australia’s Census 2016. The tech giant IBM and Revolution IT (claimed Australia’s leading testing service provider) failed badly on load testing despite spending millions on staffing and software.
Those load testing engineers/consultants at IBM and Revolution IT (Revolution IT has changed its name to Ampion since that event) significantly lack of load testing knowledge and experience. From what I see, they only knew basic load testing with a specific tool. This might sound harsh. Considering a different perspective, load testing is exactly the same for 99+% of websites. If a claimed ‘bicycle’ rider fell on your bicycle ten times in a minute and used the bicycle brand difference as the excuse. Do you believe he/she actually could ride a bicycle?
The following aspects were where they failed:
Modern apps (in particular web apps) are dynamic, and traditional protocol-based (and expensive) load testing tools, such as Load Runner, are no longer applicable for most operations.
Software changes often (under the banner of Agile). If a severe performance/loading issue was not detected earlier, the software might pass the so-called “point-of-no-return”. The IT managers usually don’t really care much about the load issues. Hit the deadline is the top priority.
Certain load issues can happen with different activities on the system. Some fake load testers will hit the server in LoadRunner with thousands of threads (virtual users) on the login screen, which has little value as static pages are often cached.
Software managers/Tech Leads bet on their luck: just purchase more computing power (this sometimes worked at considerable cost, though sometimes not) if performance issues do appear.
Over my 20+ years of work in the software industry, I have found there hasn’t been a single software company/project that had a good load-testing process. Typically, the project manager just allocates a week or two to do performance/load testing by a separate team towards the end of development. Imagine if a core library/module has severe performance/load issues, what will the team do? It is simply too late to change. A “logical step” is to release it anyway.
The solution: In-Sprint Performance/Load Testing, a part of “In-Sprint Test Automation”.
My Specific Solution to load test Chengdu PCR system
The implementation shall only take one or two days. Besides human power costs, the other cost might be ~$30 for each load test execution.
First of all, the users of the PCR system are not residents but rather the dedicated & trained staff at the collection centres. In other words, the load is limited and predictable (by the number of collection centres). For the same reason, there won’t be highly unusual spikes (residents are lining up for it, they were getting used to waiting). In summary, this is an easy load-testing scenario.
Assume that we need to complete 20 million tests in 12 hours. Then, on average, there will be 462 hits per second, which is not a lot.
Here is how I will do it.
Set up the free BuildWise Server on an IAAS platform (AWS, Azure of Vultr)
Set up one BuildWise Agent machine, then clone 500 of them. (This can be easily achieved with VM or IAAS deployment scripts)
Develop a set of load tests in Selenium WebDriver (YES, real UI tests, driving the real UI in Chrome) or raw API tests in Ruby. This can be done quickly using TestWise, within 2 hours or less.
Create several load projects on the BuildWise server to run those automated tests, for 10 minutes or so.
Destroy all the build agent machines (except one).
The cost:
Two long-running cloud instances: BuildWise Server and the first Build Agent, cost: $40 — $80 per month.
One BuildWise Agent license: $30/month (for free version, relaunch after 40-min free up time)
500 Build Agent Cloud Instances, running for 1 hour. ~$0.027 /hr * 500 = $13.5. Here I quoted a Vultr VM instance I used, AWS/Azure probably double that amount.
some seasoned engineers would argue that 500 ‘real browser sessions’ does not equal 500 hits/second. That’s correct. How many hits the agent machines can generate depends on many factors: test scripts and execution. Here, 500 is just a starting number. If we can do 500, effort-wisely, there will be no difference to go for 1000 or 2000. We just need to pay more to IAAS providers.
Assuming you do two load test runs per day for a 30-day month. The total cost of a month of Continuous Load Testing is $920, about $15 per round of load testing. Every software project can afford this.
Final note, this setup can be used for load testing any web apps.
If you are interested in this Continuous Load Testing approach, check out my new book (80% complete, available to purchase now): Practical Performance and Load Testing on Leapub.
Related reading: