Firefighting.jpg

9 ways to reduce IT incidents

Every time there’s an IT incident, no matter how big or small, there’s an impact on business productivity and on IT’s reputation. I’m therefore a big believer in the mantra that the best way to improve IT support is to reduce the need for support in the first place. By pro-actively reducing incident volumes, you increase customer satisfaction with IT services, reduce support costs, and free-up IT staff to do work that’s more rewarding than firefighting.

So, what can you do to reduce IT incident volumes?

1. Implement ITIL Change Management

There’s an oft-quoted ‘fact’ that 80% of incidents are caused by changes made to the IT environment. I’ve never been able to find the source of this statistic (yes, it’s quoted in one of the ITIL books, but the authors didn’t say where they got it from) but it does pretty much line up with our experience. As with all ITIL processes, you don’t have to adopt it lock, stock and barrel, but you definitely do want to stop IT staff from making changes without there being a degree of scrutiny that’s commensurate with the risk of the change.

2. Improve software release processes

This one is so obvious that I almost didn’t include it in the list. However, we’re seeing more and more clients making great improvements in the area of testing and release. The use of testing tools are an obvious starting point, but continuous integration, automated test and automated deployment are becoming (have become?) mainstream and are worthy of consideration as part of your improvement plans. Martin Fowler provides a great introduction to this topic here: http://martinfowler.com/articles/continuousIntegration.html.

3. Proactively identify incident trends (As part of ITIL Problem Management)

A periodic trend analysis of incident records will help you identify and eliminate the source of reoccurring incidents.

Depending on how you categorise incidents, one way of doing this is to look at incident volumes by incident category and gradually drill-down. For example, over 6 months, which Level 1 Incident Category had the highest volume of incidents? Communication Services.
For the Communication Services category, which Level 2 Incident Category had the highest volume of incidents? Email.
For the Email category, which Level 3 Incident Category had the highest volume of incidents? Password.

If your incident data or reporting tool doesn’t support meaningful analysis, an alternative approach is to ask the technical domain experts one simple question: “What can we do to reduce the number of incidents that come to your team?”. The domain experts usually know what recurring incidents they keep having to resolve and have a good idea of what can be done to eliminate them.

Once a particular area has been found to be creating a disproportionately high number of incidents, you should determine what can be done to eliminate that source of incidents, e.g. by automation, a software fix, or replacing a piece of infrastructure. Sometimes you’ll be amazed at how many incidents can be eliminated by changing a business process, providing customer training or updating an FAQ.

Problem Records in your IT Service Management tool provide a means to capture, prioritise and manage incident-reduction work.

4. Identify recurring incidents ‘on the fly’ (As part of ITIL Problem Management)

Get the Service Desk and 2nd level support teams to create a Problem Record in your IT Service Management tool when they notice that an incident keeps reoccurring, e.g. when an engineer notices that this is the third time this month that a server needed to be reboot.

Problem Records can then be assigned to the appropriate staff (a Problem Management team in larger IT departments, technical domain experts in smaller teams) so they can investigate the root cause and determine how it can be eliminated.

5. Prevent the recurrence of major incidents (As part of ITIL Problem Management)

After a major incident, always conduct a root cause analysis to understand what caused the incident and how it can be prevented from happening again. This is usually done as part of the Major Incident Review. Obviously, the important bit is to make sure that whatever actions were identified to stop the major incident from reoccurring are actually carried out. This sounds simple but many organisations seem to stop at minuting the findings of the review.

Once again, Problem Records provide a means to capture, prioritise and manage the work required to prevent major incidents from reoccurring.

6. Monitor for events

Monitoring tools should be used to proactively monitor the network and servers and generate alerts that warn IT operations of looming problems, e.g. when disks start filling up or servers start becoming overloaded. Ideally, these tools should be configured to generate an Incident Record when certain thresholds are breached so that steps can be taken to return things to an acceptable state before service is impacted. The automatic creation of Incident Records means that there is a record of events that have required intervention and these records can be analysed as part of the proactive identify incident trends we discussed in #3.

7. Communicate proactively to your customers

A good Service Desk will proactively notify customers when there is an incident that affects them, e.g. by a recorded message on the phone system, text messages, notifications on an incident logging portal, or a handwritten notice on the broken printer. Proactively notifying customers of an incident will prevent multiple customers from reporting the same incident. While technically this does not reduce the number of incidents, it does reduce the overhead associated with managing all those extra calls.

8. Seek opportunities to use FAQs & self-help.

Encouraging your customers to log their incidents via an online portal (rather than by telephone or email), provides you with an opportunity to  ‘intercept’ the customer before they log an incident. Not only can you let them know if there is a major incident underway that you are already aware of and working on, but you can provide them access to FAQs or knowledge articles that will enable them to help themselves.

9. Promote the right attitudes in IT staff.

Having IT staff who have the right right attitude to quality will go a long way to compensating for not having good processes and quality controls in place. Conversely, good processes will not compensate for cowboys who don’t give a stuff about what happens when their change is made to the production environment. Promoting the right behaviours is about having some good people management practices in place, e.g. hiring on attitude not just aptitude, clearly communicating IT’s values and behavioural expectations, embedding quality-related targets in your performance management/ reward system, having regular one-on-ones.

Are you doing all of these things? Are you doing something not on this list that has had a big impact on your incident volumes?

This Post Has 2 Comments
  1. You commented that the ITIL books quote this figure of 80%. Do you by any chance have that reference in the ITIL books please?

    1. I’ve had a quick look in the ITIL V3 and 2011 manuals and couldn’t find the reference to the quote “80% of unplanned downtime is due to failure caused by people and processes”. It may be in the V2 manuals but I no longer have a copy of those. However, the original figure comes from a 1999 Gartner report titled “Making Smart Investments to Reduce Unplanned Downtime”. A Google search found a copy here: http://www.maoz.com/~dmm/complexity_and_the_internet/downtime.pdf

Leave a Reply

Your email address will not be published. Required fields are marked *