If you asked me where my most painful operational missteps came from, I'd have to tell you they were the times when we did something in production without properly rehearsing it elsewhere. So today's rule is this:
Don't do it for real until you've done it somewhere safe first.
Yes, yes, I know there are exceptions. I'll discuss that, too. First let me enter the IT Confessional and I can describe a few of these painful experiences.
Examples and lessons learned
Now, I've known this as a generality for a long time, but there are always nuances. A few years ago we were doing an upgrade to an application, and at the same time we were upgrading the underlying database. On the positive side, we'd done several dry run upgrades - we knew what we needed to do, and we'd tested the functionality as completely as we could. So we'd practiced - and we'd perfected the thing we practiced. Yay team! Here's the problem: we weren't practicing it the same way we were going to do it "for real."
Basically, we were installing a new copy of the database and then moving the data over in test, and we were upgrading a database in production. In the end, our functionality was failing in production for a small (but critical) corner of the application because of a parameter that determined how database audit log files should be written. Now, I'd like to argue that there no way that parameter should matter - it was a fundamental bug in our database. And that's wonderful, except that after we backed out all the changes and tried to figure out what happened, that's exactly that the evidence showed - clearly and repeatably.
Another painful example we ran into was during another deployment. Part of the work involved was running a program which performed some update for each user's profile. Again, we'd tested this several times. This time the problem was simple and foreseeable. In our tests, we didn't need to lock out users during the execution of this program - they weren't going to log in to our test environment. But in production we did lock them out because we wanted to be sure they weren't in during the time the changes were being made. The catch? Well, the program which updated the profiles ignored all the locked-out users. Ooops. Again, we'd done a great job of perfecting what we practiced, but not a good job of practicing perfection.
Exceptions: how do they change the game?
Let's talk about the exceptions. Sometimes you simply can't have your rehearsal be exactly the same. The laboratory of real life is difficult to duplicate. But there are different facets to rehearsals and we can look at those.
- People - who is doing the work
- Processes - how is the work being done
- Technology - what hardware and software is being used
- Schedule - when does the work happen
Processes sometimes have to differ between a rehearsal and real execution. The key is to understand where they're different and understand the risks. For example, we do periodic tests to ensure we can bring up our disaster recovery alternate site. We don't actually fail production over, so we have to do a few steps differently. We understand the risks - perhaps that we're not testing a required DNS change - but we've weighed that again the risks of test-induced failure and decided it was the right approach.
Technology is usually going to differ based on some sort of economic/funding issue. Some systems are large and expensive enough that you can't fully duplicate them in a practical way. Perhaps your production and test share the same SAN behind the scenes. Maybe you've got faster, more expensive hardware for production, but older/slower/cheaper hardware in test. So when you make the decision to have a 2-node cluster in test and an 8-node cluster in production, just remember that some of that savings is offset with the cost of analysis to understand the differences, and the occasional "ooops" that happens when they behave differently.
Schedule is tricky. For example, some tasks may require going 24x7 to complete, but while rehearsing them that isn't practical. Also, there are some things which just can't be duplicated. My favorite example was during a large deployment we did earlier this year. We purposefully chose a weekend in January because the following Monday was a holiday and the Tuesday was Inauguration Day in the US, and since we're based near DC, about half our customers had the day off. That seemed like excellent timing. We could have the time to do the work properly while minimizing the impact to our customers. Which was all dandy until we went to have some of our remote customers perform validations. "The system is too slow to use - we can't validate." Just as panic was setting in we realized, it was a few minutes before noon and their office network was flooded with people streaming video the ceremonies. Schedule is probably the most vulnerable facet in terms of unexpected external forces.
Recommendations and summary
Decide what "right" is and then rehearse it as you're planning to actually do it. If you practice something other than what you plan to do, you'll get really good at doing something other than you want.
Set the rules, and then understand the exceptions. Of course there will be exceptions. Have a process for describing, analyzing and accepting (or rejecting them). We don't want to live in the wild west, but by the same token, a strict, purist approach is usually not practical. Do what you do with forethought and intent.
In fact, this leads me to the set of rules I have posted in my office...which I'll cover in an upcoming post.
I'd love to open the door to the IT Confessional to hear what stories you have. Please feel free to add comments with your best (and worst) related stories.

No comments:
Post a Comment