Ask Slashdot: Unattended Maintenance Windows? 265
grahamsaa writes: Like many others in IT, I sometimes have to do server maintenance at unfortunate times. 6AM is the norm for us, but in some cases we're expected to do it as early as 2AM, which isn't exactly optimal. I understand that critical services can't be taken down during business hours, and most of our products are used 24 hours a day, but for some things it seems like it would be possible to automate maintenance (and downtime).
I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?
I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?
Puppet. (Score:4, Informative)
Learn and use Puppet.
Re:Puppet. (Score:4, Interesting)
Puppet is a great tool for automation but does not solve problems like patching and rebooting systems without downtime.
Re: (Score:3)
Just having a proper IT infrastructure works even better.
Patch and reboot secondary server at 11am. everything checks out, put it online and promote it to primary. All done. Now migrate the changes to the backup, Pack up the laptop and head home at 5pm... not a problem. Our SQL setup has 3 servers we upgrade one and promote it, the upgrade #2 #3 stays at the previous revisions until 5 days have passed so we have a rollback. Yes data is synced across all three, worst case if TWO servers were to expl
Re: (Score:3)
Re: (Score:2)
That's a failure to test* your code-as-infrastructure, not a puppet failure.
*: Exempting a small subset of physical device issues, though even those can be ignored if you're talking about a VM, so that the physical hardware is never actually in a not-live state.
Re: (Score:2)
So... you didn't test... and you have only yourself to blame?
Especially with VMs, it is so easy to snapshot and test things.
Re: (Score:3)
So it's someone else's fault your test environment doesn't match production?
Re:Puppet. (Score:4, Interesting)
So it's someone else's fault your test environment doesn't match production?
People often fail to try hard enough to make the test environment (assuming they even have one) match the production environment, but for some problems test never matches production, and essentially never can: some problems only reveal themselves under production *conditions*. For example, I recently spent a significant amount of time involved in the troubleshooting of a kernel bug that only arose under a very specific (and still not fully characterized) set of disk loads. Test loads including tests involving loads several times higher than the production load did not uncover the bug, which caused kernel faults, and the faults randomly started occurring about a week after the software patch went live.
You should try to keep test as close as possible to production so testing on it has any validity at all, but you should never assume that testing on the test environment *guarantees* success on production. Its for that reason that, responding to the OP, I have never attempted to do any serious production upgrades in an automated and unattended fashion, and not while I'm alive will any such thing happen on any system I have authority over. As far as I'm concerned, if you decide to automate and go to sleep, make sure your resume is up to date before you do because you might not have a job when you wake up, if you guess wrong.
Even if you guess right, I might decide to fire you anyway if anyone working for me decided to do that without authorization.
Re:Puppet. (Score:4, Informative)
How, exactly, do you snapshot and test the production VM before the maintenance window and guarantee you won't affect (and by "affect", I mean anything that changes behavior in any way that is not expected by the users) any services running on that VM?
Clone it. upgrade the clone and make sure it works. If so, wipe the clone, snapshot the production VM and upgrade it. If it fails, roll back. Make sure your infrastructure is set up so the clone CAN be properly tested. Yes, sometimes you will have to do that rollback, but with an adequate test setup, frequently you won't.
Re: (Score:3)
But the solution will be just a more complex variant on this theme. Consider also that you might have allowed complex to become Rube Goldberg.
And if it doesn't work? (Score:5, Insightful)
Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.
And if it doesn't work? (Score:2, Insightful)
Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.
This is the correct answer. I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.
Re: (Score:2, Insightful)
Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.
This is the correct answer. I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.
Use of monitoring and alerting can alleviate this - access to the system through VPN can provide near-immediate access. It also helps if critical services can be made not to be single points of failure.
Re: (Score:2)
I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.
Yeah, but this way, you won't be the one who has to fix it :).
Of course, you might have to start looking at job ads the next day...
Re: (Score:2)
Exactly, and when it comes to maintenance windows one should never forget Murphy. If something can go wrong it will, and being there with a console cable and a laptop or tablet to get into a problem device is a good thing.
Re: (Score:2)
Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.
He might just need a better boss--it sounds like this one expects the guy to stay up all night for maintenance, then come in at 9am sharp, as if he didn't just do a full day's work in the middle of the night.
Rather than automating, he should be lobbying for the right to sleep on maintenance days by shifting his work schedule so that his "maintenance time" IS his workday. "Off-hour work" doesn't mean "Work all day Monday, all night Monday night Tuesday morning, and all day Tuesday." Or, at least, it shouldn'
Murphy says no. (Score:5, Insightful)
Re:Murphy says no. (Score:5, Interesting)
The right answer to this is to have redundant systems so you can do the work during the day without impacting business operations.
Re: (Score:2)
Yup. Very dependent on the business, the application, the usage patterns, etc.
Re: Murphy says no. (Score:5, Insightful)
Re: (Score:3, Insightful)
...Better yet, use Amazon EC2 for your infrastructure so you can spool up as many redundant systems as necessary.
Exactly. Because if Amazon screws up, they won't blame you. That fantasy and a couple bucks will get you a Starbucks latte.
Using someone else's servers is always a bad idea for critical systems. Virtualization is definitely the way to go, but use your own hardware. Yes, that means you need to maintain that hardware, but that's a small (or not so small, in a large environment -- but worth it) price to pay because Murphy was an optimist.
Re:Murphy says no. (Score:4, Insightful)
In general, don't do anything that isn't your core business. Or another way of saying it, Do What Only You Can Do.
If you are an insurance company, is building and maintaining hardware your business? No, not in the slightest. You have no more business maintaining computer hardware as you have maintaining printing presses to print your own claims forms.
Maintaining hardware and the rest of the infrastructure stack however, is the business of Amazon AWS, Windows Azure, etc. The "fantasy" you're referring to is the crazy idea that you, as some kind of God SysAdmin, can out-perform the world's top infrastructure providers at maintaining infrastructure. Even if you were the best SysAdmin alive on the planet, you can't scale very far.
Sure, any of those providers can (and do, frequently) fail. Still, they are better than you can ever hope to be, especially once you scale past a handful of servers. If you are concerned that they still fail, that's good, yet it's still a problem worst addressed by taking the hardware in house. A much better solution is to build your deployments to be cloud vendor agnostic: Be able to run on AWS or Azure (or both, and maybe a few other friends too) either all the time by default or at the flip of a (frequently tested) switch.
Even building in multi-cloud redundancy is far easier, cheaper, and more reliable than you could ever hope to build from scratch on your own. That's just the reality of modern computing.
There are reasons to build on premises still, but they are few and far between. Especially now that cloud providers are becoming PCI, SOX, and even HIPAA capable and certified.
Re:Murphy says no. (Score:5, Informative)
Here is what I have done in the past with network gear:
1. Make sure that you have a test environment that is as close to your production environment as possible. In the case of network gear, I test on the exact same switches with the exact same firmware and configuration. For servers, VMWare is your friend....
2. Build your script, test, and document the process as many times as necessary to ensure that there are no gotchas. This is easier for network gear as there are less prompts and options.
3. Build in a backup job in your script, schedule a backup with enough time to complete before your script runs, or make your script dependent on the backup job completing successfully. A good backup is your friend. Make a local backup if you have the space.
4. Schedule your job.
5. Get up and check that the job complete successfully either when the job is scheduled to be completed or before the first user is expected to start using the system. Leave enough time to perform a restore, if necessary.
As you can probably tell, doing this in an automated fashion would take more time and effort than baby sitting the process yourself. However, it is worth it if you can apply the same process to a bunch of systems (i.e. you have a bunch of UNIX boxes on the same version and you want to upgrade them all). In our environment we have a large number of switches, etc. that are all on the same version. Automation is pretty much the only option given our scope.
Re: (Score:2)
Yes.
Also, this is one of these scenarios, where virtualization pays. You can simply spin up a new set of boxes (ideally via puppet,chef, whatever) and cut over to it once the new cluster has been thoroughly tested and tested some more. Human eye watching/managing the cutover still recommended, if not required.
Re: (Score:2)
say the patch unexpectedly breaks another critical function of the server. It happens, if you have been in IT any time you have seen it happen
Yes, this happens all the time. And really it's a case for doing the upgrade when people are actually using the system. If the patch happens at 2am (chosen because nobody is using it at 2am), nobody is going to notice it until the morning. The morning, when the guy who put in the patch is still trying to recover from having to work at 2am. At the very leas
Re: (Score:2)
Re: (Score:2)
Yup. Although, that said, if you have a proper test environment, like say, a snap-clone of your live environment and an isolated test VLAN, you can do significant testing on copies of live systems and be pretty confident it will work. You can figure out your back-out plan, which may be as simple as rolling back to a snapshot (or possibly not).
Way too many environments have no test environment, but these days with the mass deployment of FAS/SAN and virtualization, you owe it to your team to get that shi
Re: (Score:2)
say the patch unexpectedly breaks another critical function of the server.
When this happens, it usually takes a lot longer to fix than it takes to drive in to work, because the way it breaks is unexpected. The proper method is to have an identical server get upgraded with this automatic maintenance window method the day before while you're at work or at least hours before the primary system so that you can halt the automatic method remotely before it screws up the primary system. If the service isn't important enough, let your monitoring software wake you up if there's a failur
Re: Murphy says no. (Score:5, Insightful)
This guy probably is the tech but is wanting to spend more time with his family or something.
Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.
OR, if you want to have a family life, don't take a job that requires you to do stuff that's not family-life-oriented.
That's the route I've taken - no on-call phone, no midnight maintenance, no work-80-hours-get-paid-for-40 bullshit. Pay doesn't seem that great, until you factor in the wage dilution of those guys working more hours than they get paid for. Turns out, hour-for-hour I make just as much as a lot of the managers around here, and don't have to deal with half the crap they do.
The rivers sure have been nice this year... and the barbecues, the lazy evenings relaxing on the porch, the weekends to myself... yea. I dig it.
Re: (Score:2)
you've just described my life. amen brother.
Re: (Score:2)
At least where I work maintenance is a once a month thing; Im led to believe this is normal by anecdotal evidence on the internet.
Your average work week ends up at like 42 hours if you factor that in; its really not that onerous.
Re: (Score:2)
yup same here, while my yearly salary isn't great I work 35 hour weeks, 4 weeks vacation, 10 sick days, multiple breaks per day, rarely ever any OT (and while we are salaried and don't get OT pay we instead get time in time and a half off). Hour-for-hour I probably make more than a lot of managers as well. Would I like to make double what I am making? Sure, but I would NOT be willing to put in double the work.
Re: (Score:3)
Would I like to make double what I am making? Sure, but I would NOT be willing to put in double the work.
Not for these fuckers, anyway.
Were I to strike out on my own, I don't think I'd mind all the extra hours, but it's easy to see things differently when you're your own boss.
Re: (Score:3)
so once a week you have to get up early and do some work.
big deal.
The benefit is that you get to go home early too - and that mean you're there to pick up little johnny from school instead of seeing him when you drag your sorry arse in from a full day of meetings and emails and stuff.
Frankly, I wouldn't want to do it every day, but I can't see how the occasional early is anything but a good thing for family life.
Re: (Score:2)
I have no idea if once a week is realistic, it sounds far too high. I have around 5-10 such windows a year, some are stuff I can do from home (with support from the guys on shift) and some entail me being physically there, so there have been none of the second kind this year.
Major Outages of one of our production systems have been featured on national news and Slashdot before, although it requires an outage of several hours to cross that threshold. Our windows are at around 02:00 to 03:00 depending on whi
Re: Murphy says no. (Score:5, Funny)
This guy probably is the tech but is wanting to spend more time with his family or something.
Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.
Congratulations! You're management material!
Re: (Score:2)
It's even more fun when the CEO stops by, in person, to see how long it is going to take to get things working again. Though not might fault either time I've been there actually fixing the problem, it certainly is attention getting. Neither CEO was being a jerk, he just really needed to know what was going on without any b/s filters by intermediate management. Try imaging that visit if you had just been running an automated script to apply the patch.
So yeah, if it is important, you need to be there, and if
I've toyed with this concept.. (Score:5, Interesting)
...and while I'm reasonably sure I can execute automated maintenance windows with little to no impact to business operations, I'm not sure. So I don't do it.
If there were more at stake, if the risk vs benefits were tipped more in my company's favor, I might test implement it. But just to catch an extra hour or two of sleep? Not worth it; I want a warm body watching the process in case it goes sideways. 9 times out of 10, that warm body is me.
Re:I've toyed with this concept.. (Score:4, Insightful)
Even on fairly simple things (yum updates from mirrors, AIX PTFs, Solaris patches, or Windows patches released from WSUS), I like babysitting the job.
There is a lot that can happen. A backup can fail, then the update can fail. Something relatively simple can go ka-boom. A kernel update doesn't "take" and the box falls back to the wrong kernel.
Even something stupid as having a bootable CD in the drive and the server deciding it wants to run the OS from that rather than from the FCA or onboard drives. Being physically there so one can rectify that mistake is a lot easier when planned as opposed to having to get up and drive to work at a moment's notice... and by that time, someone else likely has discovered it and is sending scathing E-mails to you, CC:5 tiers of management.
Re: (Score:2)
I always test in advance, have a roll back plan, only automate low risk maintenance, test the results remotely, and have a warm body on back up should the need arise. Saves a little sleep since I don't babysit the entire process just the result. I don't have physical access to most of the equipment since it's scattered across multiple data centers so I do most of my work remotely anyway.
Comment removed (Score:5, Insightful)
Re: (Score:3)
Re: (Score:2)
There are now two distinct phrases in the English language:
There is the logical fallacy of begging the question.
Sometimes, an event happens which begs (for) the question of why nobody planned for it.
You might think you sound all clever and stuff, but you're wrong. They sound similar, but they aren't the same. The second one has been in common usage for decades now, and has nothing to do with the logi
Raises the question (Score:2)
Sometimes, an event happens which begs (for) the question of why nobody planned for it.
This raises the question of why people don't just avoid the pedantic bickering by saying "raises the question".
Re: (Score:3)
Because, generally speaking, pedants are tedious and annoying, and nobody else cares about the trivial minutia they like to get bogged down in because it's irrelevant to the topic at hand.
At least, that's what my wife tells me. ;-)
Re: (Score:2)
Because, generally speaking, pedants are tedious and annoying, and no one else cares about the trivial minutiae in which pedants like to get bogged down. It's irrelevant to the topic at hand.
At least, that's what my wife tells me. ;-)
There. FTFY. Pedantry and grammar nazism all in one pretty package. You're welcome.
Re: (Score:2)
Attended automation (Score:3, Interesting)
Attended automation is the way to go. You gain all the advantages of documentation, testing etc. If the automation goes smooth, you only have to watch it for 5 mins. If it doesn't, then you can fix it immediately.
Schedule some days as offset days (Score:2)
You just need to schedule some of your days as offset days. Work from 4pm to midnight some days so that you can get some work done when others aren't around. Some days require you being around people, some days command you be alone.
Or you can just work 16hour days like the rest of us and wear it with a badge of honor.
If you are your own boss and do this, you can earn enough money to take random weeks off from work with little to no notice so that you can travel the world, and do some recruiting while doing
Re: (Score:2)
Pretty much this. If your company is big enough or drives enough revenue from its IT systems that require routine off hours maintenance they should staff for that.
That is not say that if its just Patch Tuesdays they need to; or the occasional rare major internal code deployment that happens a couple time a year or so. For that you as the admin should suck it up, and roll out of bed early once and while. Hopefully your bosses are nice and let you have some flextime for it. Knock out at 3p on Fridays thos
Re: (Score:2)
Or you can just work 16hour days like the rest of us and wear it with a badge of honor.
IMO, there is no honor in working more hours than you're actually being paid to work. Not only are you hurting yourself, you're keeping someone else from being able to take that job.
If you've got 80 hours worth of work to do at your company, and one guy with a 40-hour-a-week contract, you need to hire another person, not convince the existing guy that he should be proud to be enslaved. Morally speaking.
Re: (Score:2)
Not only that but a company that lets someone do that is shooting themself in the foot. Sooner or later 80 hour a week guy is going to leave, good luck getting someone that is
A: willing to do it coming in
B: not taking the job until something better comes along.
It's not a badge of honor, just an example of rationalization for a crappy job.
Re: (Score:2)
Re: (Score:3)
FTFY
Offshore (Score:5, Insightful)
Offshore your maintenance jobs to someone in the correct timezone!
Re: (Score:2)
Sounds like a bad idea ... (Score:5, Insightful)
You don't monitor maintenance windows for when everything goes well and is all boring. You monitor them for when things go all to hell and someone needs to correct it.
In any organization I've worked in, if you suggested that, you'd be more or less told "too damned bad, this is what we do".
I'm sure your business users would love to know that you're leaving it to run unattended and hoping it works. No, wait, I'm pretty sure they wouldn't.
I know lots of people who work off hours shifts to cover maintenance windows. My advise to you: suck it up, princess, that's part of the job.
This just sounds like risk taking in the name of being lazy.
This is why you need.. (Score:4, Insightful)
Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.
Having someone with little or no sleep doing critical updates is not really the best strategy.
Re:This is why you need.. (Score:5, Insightful)
Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.
Having someone with little or no sleep doing critical updates is not really the best strategy.
First off, you can't mirror everything. Lots of infrastructure and applications are either prohibitively expensive to do in a High Availability (HA) configuration or don't support one. Go around a data center and look at all the Oracle database instances that are single-instance...that's because Oracle rapes you on licensing, and sometimes it's not worth the cost to have a failover just to reach a shorter RTO target that isn't needed by the business in the first place. As for load balancing, it normally doesn't do what you think it does...with virtual machine farms, sure, you can have N+X configurations and take machines offline for maintenance. But for most load balancing, the machines operate as a single entity...maintenance on one requires taking them all down because that's how the balancing logic works and/or because load has grown to require all of the systems online to prevent an outage. So HA is the only thing that actually supports the kind of maintenance activity you propose.
Second, doing this adds a lot of work. Failing from primary to secondary on a high availability system is simple for some things (especially embedded devices like firewalls, switches and routers) but very complicated for others. It's cheaper and more effective to bump the pay rate a bit and do what everyone does, for good reason...hold maintenance windows in the middle of the night.
Third, guess what happens when you spend the excess money to make everything HA, go through all the trouble of doing failovers as part of your maintenance...and then something goes wrong during that maintenance? You've just gone from HA to single-instance, during business hours. And if that application or device is one that warrants being in a HA configuration in the first place, you're now in a bit of danger. Roll the dice like that one too many times, and someday there will be an outage...of that application/device, followed immediately after by an outage of your job. It does happen, it has happen, I've seen it happen, and nobody experienced who runs a data center will let it happen to them.
Re: (Score:2)
Re: (Score:2)
There is also the fact that some failure modes will take both sides down. I've seen disk controllers overwrite shared LUNs, hosing both sides of the HA cluster (which is why I try to at least quiesce the DB or application so RTO/RPO in case of that failure mode is acceptable.)
HA can also be located on different points on the stack. For example, an Oracle DB server. It can be clustered on the Oracle application level (active/active or active/passive), or it can be sitting in a VMWare instance, clustered u
Re: (Score:2)
Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.
Having someone with little or no sleep doing critical updates is not really the best strategy.
Oh my $deity, this!
I've worked in environments with test-to-live setups, and ones without, and the former is always, always a smoother running system than the latter.
Re: (Score:2)
Yeah, don't get me wrong (i've been posting about setting up a test lab using vSphere, vFilters and vlans) - you can't replace the need to have someone on call or watching in case it all fucks up. But you can generally reduce the outage window and risk significantly by actually testing (both the roll out and roll back) first. And if you've got it to the point where you can reliably test, you can work on your automation scripts, test the shit out of them, and having been tested with a copy of live using a
Immutable Servers (Score:3)
Automate Out (Score:2)
Re: (Score:3)
This is why you move the fuck on and adapt. If your job is relying on stuff that can be done by a shell script, you need to up-skill and find another job. Because if you don't do it, someone like myself will.
And we'll be getting paid more due to being able to work at scale (same shit for 10 machines or 10,000 machines), doing less work and being much happier, doing it.
Slashdot is a Bad Place to Ask This (Score:5, Interesting)
Everyone here is going to tell you that a human needs to be there because that is their livelihood. Any task can be automated at a cost. I am guessing that it is not your current task to automate maintenance tasks otherwise you wouldn't be asking. Somewhere up your chain they decided that for the uptime / quality of service it is more cost effective to have a human do it. That does not mean that you can not present a case showing otherwise. I highly suggest that you win approval and backing before taking time to try to automate anything.
Out of curiosity, are they VMs?
Re: (Score:2)
No, many of us will tell you a human needs to be there because we've been in the IT industry long enough to have seen stuff go horribly wrong, and have learned to plan for the worst because it makes good sense.
I had the misfortune of working with a guy once who would make major changes to live systems in the middle of the day because he was a lazy idiot. He once took several servers offline for a few days bec
Re: (Score:2)
Alternatively, perhaps somewhere up the chain they have no idea what can be done (this IT shit isn't their area of expertise), and are not being told by their IT department how to actually fix the problem properly. Rather, they are just applying band-aid after band-aid for breakage that happens.
It is my experience that if you outline the risks, the costs and the possible mitigation strategies to eliminate the risk, most sensible businesses are all ears. At the very least, if they don't agree on the spo
Re: (Score:2)
Snapshots are great, for some things (Score:2)
Snapshots are great, but they assume all your data is on the snapshot. It's harder to roll back if your new version goes ahead and corrupts some database or something on the NAS.
It's even harder to roll back if your data stores are on some multi-clustered beast that wasn't designed to be rolled back.
Of course, you should have caught that in test, right?
I have automated maintenances in the form of ... (Score:2)
Nature of the beast (Score:2)
It depends on the size of your operation... (Score:5, Interesting)
If you really want to automate this sort of thing you should have redundant systems with working and routinely tested automatic fail-over and fallback behavior. With that in place you can more safely setup scheduled maintenance windows for routine stuff and/or pre-written maintenance scripts. But, if you are dealing with individual servers that aren't part of a redundancy plan then you should babysit your maintenance. Now, I say babysit because you should test and automate the actual maintenance with a script to prevent typos and other human errors when you are doing the maintenance on production machines. The human is just there in case something goes haywire with your well-tested script.
Fully automating these sorts of things is out of reach more many small to medium sized firms because they don't want, or can't, invest in the added hardware to build out redundant setups that can continue operating when one participant is offline for maintenance. So, depending on the size of your operation and how much your company is willing to invest to "do it the right way" is the limiting factor in how much you are going to be able to effectively automate this sort of task.
Simmilar experiences ... (Score:5, Insightful)
A friend of mine lost his job over a simmilar "automation" task on windows.
Upgrade script was tested on lab environement who was supposed to be exactly like production (but it turns out it wasn't - someone tested something before without telling anyone and did not reverted). Upgrade script was scheduled to be run on production during the night.
Result - \windows\system32 dir deleted from all the "upgraded" machines. Hundreds of them.
On the Linux side i personally had RedHat doing some "small" changes on the storage side and PowerPath getting disabled at next boot after patching. Unfortunate event, since all Volume Groups were using /dev/emcpower devices. Or RedHat doing some "small" changes in the clustering software from one month to the other. No budget for test clusters. Production clusters refusing to mount shared filesystems after patching. Thankfuly on both cases the admins were up & online at 1AM when the patching started and we were able to fix everything in time.
Then you can have glitchy hardware/software deciding not to come back up after reboot. RHEL GFS clusters are known to randomly hang/crash at reboot. HP Blades have sometimes to be physically removed & reinserted to boot.
Get the business side to tell you how much is going to cost the company for the downtime until:
- Monitoring software detects that something is wrong;
- Alert reaches sleeping admin;
- Admin wakes up and is able to reach the servers.
Then see if you can risk it.
Set alarms (Score:2)
Can't you make some kind of setup that triggers if the update fails and alerts you / wakes you up with noise from your smartphone etc.
Or like the other poster who beat me to it - off-load your work to someone in a country where your 5am is mid-day in their country.
Perception of Necessity (Score:2)
Sure, we all know it's a bad idea to set things on autopilot because eventually something will break badly. But do your managers know that?
Re: (Score:2)
Automating shit that can be automated so that you can actually do thing that benefit the business instead of simply maintaining the status-quo is not a bad thing. Doing automate-able drudge work when it could be automated is just stupid. Muppets who can click next through a Windows installer or run apt-get, etc. are a dime a dozen. IT staff who can get rid of that shit so they can actually help people get their own jobs done better are way more valuable.
The job of IT is to enable the business to conti
Testing. Validation. (Score:3)
Do you plan on automating the end-user testing and validation as well?
Countless system administrators have confirmed the system was operational after change without throwing it to real live testers only to find that, well, it wasn't.
Nope. (Score:2)
Every second you save automating the task, will be taken out of your backside when it goes wrong (see the recent article where a university SCCM server formatted itself and EVERY OTHER MACHINE on campus) and you're not around to stop it or fix it.
Honestly? It's not worth it.
Work out of normal hours, or schedule downtime windows in the middle of the day.
Re: (Score:3)
Think of it a slightly different way (Score:4, Informative)
First: I do something like this all the time, and it's great. Generally, I _never_ log into production systems. Automation tools developed in pre-prod do _everything_. However, it's not just a matter of automating what a person would do manually.
The problem is that your maintenance for simple things like updating a package is requiring downtime. If you have better redundancy, you can do 99% of normal boring maintenance with zero downtime. I say if you're in this situation you need to think about two questions:
1) Why do my systems require downtime for this kind of thing? I should have better redundancy.
2) How good are my dry runs in pre-prod environments? If you use a system like Puppet for *everything* you can easily run through your puppet code as you like in non-production, then in a maintenance window you merge your Puppet code, and simply watch it propagate to your servers. I think you'll find reliability goes way up. A person should still be around, but unexpected problems will virtually vanish.
Address those questions, and I bet you'll find your business is happy to let you do "maintenance" at more agreeable times. It may not make sense to do it in the middle of the business day, but deploying Puppet code at 7 PM and monitoring is a lot more agreeable to me than signing on at 5 AM to run patches. I've embraced this pattern professionally for a few years now. I don't think I'd still be doing this kind of work if I hadn't.
Re: (Score:2)
Re: (Score:2)
1) Why do my systems require downtime for this kind of thing? I should have better redundancy.
True. Last year we upgraded all our servers to a new OS with a wipe and reinstall, and the only people who noticed were the ones who could see the server monitoring screens. The standby servers took over and handled all customer traffic while we upgraded the others.
Convenience in place of Caution (Score:2)
You're trading caution for convenience.
I have automated some things such as patch installation overnight only to wake up to a broken server despite the patches being heavily tested and known to work in 100% of the cases before only to not have them work when nobody was watching.
I urge you to only consider unattended automation overnight when it's for a system that can reasonably incur unexpected downtime without jeopardizing your job and/or the organization. If it's critical -- DO NOT AUTOMATE.
You've been w
Perl (Score:2)
No single points of failure (Score:2)
I don't get paid for things that work right (Score:2)
3 am is better (Score:2)
Automation is necessary (Score:3)
If you want to progress in your IT career, you need to figure out how to automate basic system operations like maintenance and patching. Having to actually be awake at 2:00am to apply patches is rookie status. Sometimes it is unavoidable, but it should not be the default stance.
My environment is virtual, so our workflow is basically snapshot VM, patch, test. If the test fails, rollback the snapshot and try again (if time is available) or delay until later. If the test is successful, we hold onto the snapshot for three days just in case users find something that we missed. If everything is good after three days, we delete the snapshot.
We have a dev environment that mirrors production that we can use for patch testing, upgrade testing, etc. Due to testing, we rarely have problems with production changes. If we do, the junior guys escalate to someone who can sort it out. Our SLAs are defined to give us plenty of time to resolve issues that occur within the allocated window. (Typically ~4 hours)
In the grand scheme of things, my environment is pretty small. We have ~1500 VMs. We manage it with three people and a lot of automation.
Reboot? - Load Balancers and multiple systems (Score:4, Insightful)
The better way to go about it has already been pointed out above. Have several systems, load balance them in a pool, take one node out of the pool, work on it, return it to the pool then repeat for each remaining system. - No outage time and users are none the wiser to the update.
Missing the point (Score:3)
The OP is missing the point. Of *course* you can automate updates. You don't even need an automation system. It can be as simple as writing a bash script.
The point is... what happens when something goes wrong? If all goes well, then there's no problem. But if something does go wrong, you no longer have anyone able to respond because nobody's paying attention. So you come in the next morning with a down server and a clusterf__k on your hands.
Thanks for the feedback (OP response) (Score:3)
A couple clarifications: we do have redundant systems, on multiple physical machines with redundant power and network connections. If a VM (or even an entire hypervisor) dies, we're generally OK. Unfortunately, some things are very hard to make HA. If a primary database server needs to be rebooted, generally downtime is required. We do have a pretty good monitoring setup, and we also have support staff that work all shifts, so there's always someone around who could be tasked with 'call me if this breaks'. We also have a senior engineer on call at all times. Lately it's been pretty quiet because stuff mostly just works.
Basically, up to this point we haven't automated anything that will / could be done during a maintenance window that causes downtime on a public facing service, and I can understand the reasoning behind that, but we also have lab and QA environments that are getting closer to what we have in production. They're not quite there yet, but when we get there, automating something like this could be an interesting way to go. We're already starting to use Ansible, but that's not completely baked in yet and will probably take several months.
My interest in doing this is partly that sleep is nice, but really, if I'm doing maintenance at 5:30 AM for a window that has to be announced weeks ahead of time, I'm a single point of failure, and I don't really like that. Plus, considering the number of systems we have, the benefits of automating this particular scenario are significant. Proper testing is required, but proper testing (which can also be automated) can be used to ensure that our lab environments do actually match production (unit tests can be baked in). Initially it will take more time, but in the long run anything that can eliminate human error is good, particularly at odd hours.
Somewhat related, about a year ago, my cat redeployed a service. I was up for an early morning window and pre staged a few commands chained with &&'s, went downstairs to make coffee and came back to find that the work had been done. Too early. My cat was hanging out on the desk. The first key he hit was "enter" followed by a bunch of garbage, so my commands were faithfully executed. It didn't cause any serious trouble, but it could have under different circumstances. Anyway, thanks for the useful feedback
Re: (Score:2)
Some of us would argue that doing maintenance unattended is preparing for failure -- or at least giving yourself the best possible chance of failure.
I work in an industry where if we did our maintenance badly, and there was an outage it would literally cost millions of dollars/hour.
If what you're doing it so unimportant you can leave the maintenance unattended, there's probably no reason you couldn't do the outage in the middle of the day.
If it is important, you don't leave it to chance.
Re: (Score:2)
Then you're maintaining trivial, boring, and unimportant systems that nobody will notice. If your job is to do that ... well, your job is trivial and unimportant.
The stuff that I maintain, if it was down or under-performing during normal business hours ... we would immediately start getting howls from the users, and the company would literally be losing vast sum
Re: (Score:3)
OS choice is irrelevant. I've seen plenty of critical linux fuck ups in my day, and OS choice doesn't account for human error. And, being human, you WILL make human errors. You need a test environment and a backout plan. If you don't at least have a back-out plan and an estimate of how much the fuckup will cost BEFORE proceeding (and balancing that against the cost/risk of leaving it the fuck alone), you should not be carrying out the work.
Sure, that sounds like management speak, but seriously... cov
Re: (Score:2)
In my experience (personal and professional), those people do a half assed job of building those systems, have no concept of what will be required to maintain them, and are then subsequently unavailable
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)