Friday, November 27, 2009

Practice makes perfect (but perfect practice makes perfect perfect!)

I'm back with another in the series of IT missteps.  Your feedback and stories are welcome. 

If you asked me where my most painful operational missteps came from, I'd have to tell you they were the times when we did something in production without properly rehearsing it elsewhere.  So today's rule is this:

Don't do it for real until you've done it somewhere safe first.

Yes, yes, I know there are exceptions.  I'll discuss that, too. First let me enter the IT Confessional and I can describe a few of these painful experiences.

Examples and lessons learned

Now, I've known this as a generality for a long time, but there are always nuances.  A few years ago we were doing an upgrade to an application, and at the same time we were upgrading the underlying database.  On the positive side, we'd done several dry run upgrades - we knew what we needed to do, and we'd tested the functionality as completely as we could.  So we'd practiced - and we'd perfected the thing we practiced.  Yay team!  Here's the problem: we weren't practicing it the same way we were going to do it "for real."

Basically, we were installing a new copy of the database and then moving the data over in test, and we were upgrading a database in production.  In the end, our functionality was failing in production for a small (but critical) corner of the application because of a parameter that determined how database audit log files should be written.  Now, I'd like to argue that there no way that parameter should matter - it was a fundamental bug in our database.  And that's wonderful, except that after we backed out all the changes and tried to figure out what happened, that's exactly that the evidence showed - clearly and repeatably.

Another painful example we ran into was during another deployment.  Part of the work involved was running a program which performed some update for each user's profile.  Again, we'd tested this several times.  This time the problem was simple and foreseeable.  In our tests, we didn't need to lock out users during the execution of this program - they weren't going to log in to our test environment.  But in production we did lock them out because we wanted to be sure they weren't in during the time the changes were being made.  The catch?  Well, the program which updated the profiles ignored all the locked-out users.  Ooops.  Again, we'd done a great job of perfecting what we practiced, but not a good job of practicing perfection.


Exceptions: how do they change the game?

Let's talk about the exceptions.  Sometimes you simply can't have your rehearsal be exactly the same.  The laboratory of real life is difficult to duplicate.  But there are different facets to rehearsals and we can look at those.
  • People - who is doing the work
  • Processes - how is the work being done
  • Technology - what hardware and software is being used
  • Schedule - when does the work happen
For me, people is one of the most critical elements and often the easiest to control.  I want the person doing the work to have done it at least once in a dry run.  Watching someone else do it just isn't the same.  I've watched Tiger Woods play golf, and I assure you that hasn't raised my game to his level.   Whether this is a large implementation go-live or a new staff member working on a help desk, there's usually some way to simulate the work to be done.

Processes sometimes have to differ between a rehearsal and real execution.  The key is to understand where they're different and understand the risks.   For example, we do periodic tests to ensure we can bring up our disaster recovery alternate site.  We don't actually fail production over, so we have to do a few steps differently.  We understand the risks - perhaps that we're not testing a required DNS change - but we've weighed that again the risks of test-induced failure and decided it was the right approach.

Technology is usually going to differ based on some sort of economic/funding issue.  Some systems are large and expensive enough that you can't fully duplicate them in a practical way.  Perhaps your production and test share the same SAN behind the scenes.  Maybe you've got faster, more expensive hardware for production, but older/slower/cheaper hardware in test.  So when you make the decision to have a 2-node cluster in test and an 8-node cluster in production, just remember that some of that savings is offset with the cost of analysis to understand the differences, and the occasional "ooops" that happens when they behave differently.

Schedule is tricky.  For example, some tasks may require going 24x7 to complete, but while rehearsing them that isn't practical.   Also, there are some things which just can't be duplicated.  My favorite example was during a large deployment we did earlier this year.  We purposefully chose a weekend in January because the following Monday was a holiday and the Tuesday was Inauguration Day in the US, and since we're based near DC, about half our customers had the day off.  That seemed like excellent timing.  We could have the time to do the work properly while minimizing the impact to our customers.  Which was all dandy until we went to have some of our remote customers perform validations.  "The system is too slow to use - we can't validate."  Just as panic was setting in we realized, it was a few minutes before noon and their office network was flooded with people streaming video the ceremonies.   Schedule is probably the most vulnerable facet in terms of unexpected external forces.

Recommendations and summary


Decide what "right" is and then rehearse it as you're planning to actually do it.  If you practice something other than what you plan to do, you'll get really good at doing something other than you want.  

Set the rules, and then understand the exceptions.  Of course there will be exceptions.  Have a process for describing, analyzing and accepting (or rejecting them).  We don't want to live in the wild west, but by the same token, a strict, purist approach is usually not practical.   Do what you do with forethought and intent.

In fact, this leads me to the set of rules I have posted in my office...which I'll cover in an upcoming post.

I'd love to open the door to the IT Confessional to hear what stories you have.  Please feel free to add comments with your best (and worst) related stories.

 








 

Saturday, November 21, 2009

There Are No Temporary Solutions

In my first series of posts, I'll look at common missteps in the IT world. These are things which I've seen over and over again. For each, I'll give some examples and share some ways to see them coming - oh, and I'll throw out some self-defense techniques for avoiding them, too.

So - the first item on the list of IT Missteps is the temporary solution.

Here's the rule - say it with me: There are no temporary solutions.

What is a temporary solution?

Call it a one-off, a work-around, or a temporary solution - these are the things which were meant to be in place as a stop-gap, but end up becoming calcified into the environment. Sure, sometimes they really are temporary, but you should look at them with skepticism and with an eye towards the long-run.

Did you ever get into a situation like this:

You are trying to get data from one system to another. You've heard that someone is working to build out a new architecture for data transfer but that's just a start-up project. But hey, you need to get this data moving now. So what happens? You end up with some horked up batch file transfer to feed the data across. Later, you find out that the data transfer you need is outside the scope of that architecture project. Welcome to the not-so-temporary Temporary Solution.

Or maybe you run into a bug in some internally-supported software. You can't run the report...but it works fine if you exit and restart the application. So everyone who needs to run the report does just that.

Why does it happen and why is it a problem?

These sorts of solutions are often a matter of risk transfer or effort transfer. It's easier for me to have you do extra work. Other times it's a matter of not having a choice at all - you have to do something now, but in doing that, the future isn't considered.

Temporary solutions become permanent for a few reasons. Sometimes they're just good enough to live with. It's like having a minor chronic injury - you can live with it, you can function - just not as well as you should, and it's hard to know when you should stop living with it and seek treatment. And sometimes there is hand-waving about future fixes, but that cavalry doesn't make it over the hill.

What makes these one-offs a problem? Well, if it really were temporary it wouldn't be so bad. But in the long run, one-off solutions are expensive and difficult to support. There's typically a lack of thought about the workaround since it's only temporary anyway.

These one-offs are like stray animals looking for a (support) home. An ex-boss of mine once put it this way: "shoot it, or adopt it." Okay - maybe that's a harsh way to say it - but the point remains the same - you either need to make it part of your business or you need to end it. The in-between state where you feed the stray and it stays constantly on the edge of starving doesn't help anyone.


What's a good self-defense against the temporary solution?

First, realize that having a workaround shouldn't let anyone off the hook for a root cause. In ITIL, this can be a matter of the graduation from incident management to problem management.

Second, being conscious of when something is about to become the way you do business. Do the processes and support mechanisms exist for it? Do the resources exist? What costs does this impose?

More generally, there needs to be a connection between the temporary solution and the long-term solution. We can set up the batch file method along with adding this to the scope of the architecture project. I'll restart the applications in parallel to you looking for the root cause of the report issue.

This is a problem which demands a holistic approach to get the optimal results. You need to have a good problem management process in place. You need to have an analysis which informs a gating decision to adding workarounds and pieces of solution to your business. And finally, you need the governance to ensure you don't get trapped in the next misstep: this is too important to wait. We'll take a look at that in an upcoming post.