Getting to the Root of the Problem
One of the reasons why many software development companies end up spending so much effort and money on fixing or supporting their software is due to the lack of a proper Root Cause Analysis (RCA). People often confuse a root cause as a mere categorization of where a software defect or issue occurred, but then do nothing with that information and don’t seek to correctly identify what can be done to prevent the issue from arising in future.
RCA is about the 'Why' instead of the 'What', 'When', 'Who' or 'How'
A true root cause analysis should be a pertinent part of any defect or issue that arises in the software development process. What it entails is not just a mere categorization of where in the development process a particular incident arises but should include a clear directive on exactly what caused the incident to arise in the first place (the true root cause) and then what will be done to address this concern. The thing of what you are hoping to achieve through every Root Cause Analysis that a team conducts is the essential 'why' to the issue, rather than the 'what'. Most teams are focusing on what type of defect was found or what category of root cause it falls under and not investing enough time in explaining why incidents occur in the first place.
RCA is also a good way of determining how well you understand what is happening in your system, because if you cannot easily equate any incident in the development process with an action item that can easily address it, there is a chance you don’t fully understand your system or the incident well enough. The truth is everything can be improved upon, even though it might not necessarily be feasible to improve on it in the first place.
It's Not About The Process
Now, on the face of it, I can understand how any team which has to deal with many issues on a daily basis might think this to be an overkill of process or unnecessary considering how many of these incidents are actually a result of our testing and quality processes actually working. However, although it might take a lot of energy to get any team started with true RCA, over time if done correctly it should greatly reduce the number of defects picked up at any phase of the development cycle.
Additionally, RCA should not just be confined to be big production issues or end to end defects, but even to something as minor as a failure to a unit test. Yes, it sounds like overkill and perhaps a little heavy-handed for simple human error that can creep into the code, but there is still merit in perhaps identifying things that can be done to improve code review processes, coding standards or even identify areas for improved training. I’m not saying that every incident will result in something that can be addressed, but it should at least be looked at from that perspective.
It's Not About The Metrics
How do you prevent RCA from just becoming another process that teams have to adhere to? Well, it sounds a little counterintuitive, but I think one of the things helps with this is to not too heavily rely on measuring teams based on the metrics gathered by the RCA process, as it could lead to metrics manipulation, but rather simply challenge teams to constantly reduce the number of defects logged in production and hours spent on maintenance. These are things which teams should be actively doing anyway, but can often be best accomplished through proper RCA. Too often teams rely on capturing RCA metrics to identify a big overbearing gap in their organization and while there is value to this, I wouldn’t make it a metrics target, but more something which can be used to identify prevailing trends in the organization. If there is a metric from any form of RCA which I would encourage a team to be measured against, it would be RCA solutions implemented. This ensures that teams are taking the process seriously and using it not just for tracking, but for actually making a change in the way they work.
I would also ease the team into the process, by focusing on high priority customer-facing issues at first before narrowing in on some of the smaller developmental incidents once you have these under control. Companies and teams often move away from performing true RCA because they throw themselves too readily into it and it becomes this massive time sync activity rather than easing it into their teams organically through culture and habit. Again, it’s helping the teams understand the why of what they are doing rather than what or how they are doing it or being measured against it.
It's About Prevention rather than Mitigation
RCA is a big thing for all software development teams to look into, but an imperative one if you hope to drastically get better at what you do. After all, the whole point behind software quality is not in finding defects, but preventing them and this principle is exactly what RCA is all about. I’m not promising that if you implement a well-executed RCA process in your team that your defect count will ever bottom out - because that would be unlikely as we’re always improving our software - but it should significantly reduce the incidents that do creep into your development and allow the team to instead focusing on adding features to the software rather than needing to maintain the existing features.