The New Year is traditionally a time for reflections as well as resolutions. With Throwback Thursday we are happy to offer you both, as we look back to June 2008, when Freeform Dynamics published a report on risk and resilience in IT systems, especially servers. Since we are looking back more than a decade, you might be thinking that surely, infrastructure resilience is far better today than then. And you’d be right in many ways, but of course the bigger picture is rather more complex than that.
Indeed, the survey of 1223 mainly IT professionals did start by reporting that system failures occurred frequently, with over half stating they happened at least once a month. Today, hardware and software improvements have made such incidents less common, and also made them and their impact much less visible to end users.
All well and good, but the findings went on to mention other factors that impact resilience which are still prevalent today. Close to the top of the list was that funding needed to build in resilience was often not included when projects were conceived and costed.
In many cases, this reflected a parallel challenge, which was that project planners did not even think about resilience at the outset, only planning for it much later in the process, if at all. One reported consequence of this was that by the time risk and resilience were finally looked at, it was often too late to do anything about them without causing delays to go-live dates. This is a problem we still see happening, 13 years later.
The need for resilience has broadened
Another issue reported back then is just as much a problem today, and maybe even more so, as business users now demand resilience in a much broader range of systems, not just business-critical ones. This relates to the gaps that existed between the levels of risk and resilience that the business demanded, and what was actually in place.
One significant challenge reported included “software that was not ready for routine business operation” or that had just not been adequately tested. Today this still sounds very familiar, indeed some ISVs almost appear to want to have users to do the testing for them! More issues reported back then that still turn up when we run surveys today, sadly include very basic problems such as a lack of effective monitoring tools, and simply not having failover solutions ready.
Recovery testing remains an issue
Lastly for now, something else that is certainly still a major challenge for system resiliency is how little time is given to making sure that recovery systems work. Without testing the ability to recover when something goes wrong, attempting to do so for the first time when any system fails can be fraught with problems. The result, too often, is slow recovery and huge stress.
If you fancy wandering back over a decade to see what, if anything, has changed in systems risk and resilience, please take a look at the report here.