Making things reliable in a software-dominated world
Before the advent of the software-powered reality, things used to work in a predictable way. You could call a doctor’s office and be sure you’ve fixed an appointment. A sure-fire way of turning off your electric oven was to simply press the button.
These days, in an ever increasing number of instances, we depend on software to facilitate these tasks. And, since fragility stands in direct proportion to complexity, the websites, programs, and IoT devices we interact with may fail us for various reasons.
There is a lot of concern about IoT security and AI/automation spreading too quickly. At the same time, there is hardly an escape from “the software eating the world” at this point.
In Victorian times, people feared that riding a train could drive you mad, because they believed that vibrations produced by railroad carriages affected your brain. This did not stop railroads from being constructed, and they quickly proliferated.
Likewise, concerns about software getting out of hand are not likely to arrest its expansion into our lives. However, just like Victorian-era railroad administration took steps to reduce the number of “train madness” instances (e.g. by putting see-through doors between cars), the IT community should take steps to make the software more reliable and fail-proof.
Reliability as a quality parameter
Quality assurance is a science, and if you treat it as such, you’ll realize that “reliability” is a quality parameter. It’s also quite complex.
Reliability testing shows how frequently a defect reveals itself. Keep in mind that nobody will tell you the precise number of errors in your software. Application failure can be due to a combination of errors. Also, errors can compensate for one another, and I’ve seen situations when, after a partial bug-fix, the software started to work even worse than before.
Reliability further incorporates such characteristics as:
- Security
- Fault tolerance
- Maintainability
- Flexibility
- Others.
Some developers overlook reliability testing in favor of functional testing, because they think the latter is an immediate necessity. As a result, they may be gambling with all their business has.
Your app is either dead or alive, no third option
Instances of “epic” software failures are not that many, but when they occur, they draw a lot of attention and the company’s reputation suffers a serious blow. For a specific business, this could have serious consequences.
When an app fails, there is often no middle ground. The app either works or it doesn’t. Situations when it works very incorrectly could be even worse than when the app doesn’t work altogether.
At the same time, why is it that some app failures go unnoticed, while others cause massive outcry? Usually it’s because of the following factors (which are good to consider in the context of reliability).
- Vital interests
If your software malfunctioning can cause deaths, significant damage to the health of living beings, or otherwise endanger someone’s vital interests, then it’d better be reliable.
Example
In 2016, “smart” thermostat maker Nest released a new version of their app, which contained a bug. This caused batteries to die really quickly in some devices in the middle of the night. To make things worse, that was the night when a cold cyclone hit much of the US, and many people were woken up by crying babies or because it was freezing in the room. Others tweeted they were concerned about pipes bursting, etc. in their abandoned homes.
- Large user base
How many users do you have? If you offer a niche product and your app has some 10,000 users, an outage may not incur serious consequences. But if your software is used by 70% of your country’s population for some critical activity, it has be to tested thoroughly.
Example
In 2014, a computer failure wreaked chaos on UK airports near London. The software at the main traffic control center got a glitch, which resulted in many flights beings cancelled, people missing their flights, and carriers scrambling to find hotels for out of town folks. Transport secretary Patrick McLoughlin called the service disruption this big “simply unacceptable” and asked NATS, the organization to blame, to elaborate on what happened to prevent future problems.
- Financial interests
Some software mistakes can lead to people losing a lot of money and may endanger their financial interests. As a rule, banking and trading software undergoes rigorous testing, but that does not always work.
Example
In 2012, Knight Capital Group, a stock trading platform, lost $440 million in one day due to a software error. The trading algorithm went crazy and started buying and selling stocks completely on its own. By the time someone took note and the frenzy stopped, the damage was done. While it lasted only for 45 minutes, KCG lost $440 million because the irrational operations by the flawed algorithm caused random fluctuations in stock prices.
Get an emergency plan
Apart from testing for reliability, it’s a good idea to work out an emergency situation plan. What do you do if there’s a major problem with the software?
As PetNet’s case (when lots of pets missed their scheduled meals because of a server outage) demonstrates, poor emergency response during a crisis can anger users even more than the problem itself. In PetNet’s case, some users complained that support lines were busy and information about the outage was shared via email (many of which went to spam). At the same time, no notifications were pushed to the app itself or posted on social media.
It boils down to this
Again, there is no single indicator that will tell you whether your software is reliable or not. Just like in a house of cards, there could be many individual weak links that can take the system down.
Poor test coverage and software maintainability increase the risk that a new release will break existing functionality. At the same time, basic XP practices such as code reviews could help you avoid silly (yet potentially costly) mistakes.
Hence, it’s better to leave reliability testing to professional QA managers, since they’d be aware of the common pitfalls.