Armageddon

30 Oct, 2018

“The entire Internet is down” the CFO’s admin assistant tells you as you walk past his office.

You stop and turn towards him. “The entire Internet? What’s wrong?”

“I can’t get to any websites and the phones are down and my email is disconnected.”

Crisis mode – time to leap into action. As an IT person, you’re familiar with the feeling. Your mind immediately rips through all the things it could be. The network, the servers, the storage system, an asteroid.

You quickly rule out the asteroid – there doesn’t appear to be any smoke, fire, or smell of sulfur. But as you hurry to your desk, you see everyone is standing up and freaking out. You hurriedly tell people you’re looking into it as you zoom past.

Your first inclination is to get to your computer and start testing. But this situation likely needs more than your technical skills to be resolved expediently.

What are the things you can do to improve this situation?

It’s not just your technical knowledge that will get you through this in the best and most efficient manner. Of course, you could put on your noise canceling headphones and hunker down at your computer and figure it out and maybe even fix it right away, but that will likely have some unintended negative consequences.

Here’s a list of important considerations to prepare for the next major outage:

Communication is key. Appoint someone immediately to be the point person for communication on the current status of the issue. Ensure they keep key stakeholders apprised frequently. Ask them how often they want to be updated – it will likely be more often than you think. Stick to that schedule and be early with it.
Establish communication between all IT disciplines (servers, network, database, applications) immediately. It’s good to have a protocol established ahead of time for notification of an outage. For example, have a conference bridge line dedicated for the use for major outages. Keep this call to be technical staff only. The point person in step 1 should relay updates to the stakeholders. This will help the techs be techs and not be afraid to tackle the problem head on without fear of executives or other management from taking action that could distract from the resolution.
Don’t allow the blame game. Everyone should be checking the health of the systems/processes within their control. Stick to the facts of the situation. Defer anyone else’s blame until the post-incident discussion (see step 6).
Focus on what has changed, if anything. If there is a change management system, have someone review the change log to see if there are any changes around the time of the incident.
Be positive and keep your perspective. This very likely isn’t a life or death scenario, although some will act like it is. Getting fired up will cloud your view of potential solutions.
Always follow through with a post-crisis review meeting with all parties involved. Openly and honestly discuss potential improvements in the crisis management process as well as steps to help prevent future outages.

Following these steps and being disciplined about them will help you run better IT services for your organization. The open communication with those you serve will continue to build trust and rapport which is essential to a great organization.