Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Ask Slashdot: Unattended Maintenance Windows?

Soulskill posted about two weeks ago | from the wake-me-if-there's-fire dept.

IT 265

grahamsaa writes: Like many others in IT, I sometimes have to do server maintenance at unfortunate times. 6AM is the norm for us, but in some cases we're expected to do it as early as 2AM, which isn't exactly optimal. I understand that critical services can't be taken down during business hours, and most of our products are used 24 hours a day, but for some things it seems like it would be possible to automate maintenance (and downtime).

I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?

cancel ×

265 comments

Puppet. (4, Informative)

Anonymous Coward | about two weeks ago | (#47432007)

Learn and use Puppet.

Re:Puppet. (-1, Flamebait)

Anonymous Coward | about two weeks ago | (#47432059)

Yes puppet solves all of your failed reboot issues...

Fucking puppet bullshit, written by ruby retards who can't figure out that sysadmins have been automating this shit for years, with shell scripts.

But no, these ruby-tards think they need to reinvent the world written in ruby, because 90% of them couldn't hack their way out of a paper bag.

Re:Puppet. (1)

Rhys (96510) | about two weeks ago | (#47432103)

That's a failure to test* your code-as-infrastructure, not a puppet failure.

*: Exempting a small subset of physical device issues, though even those can be ignored if you're talking about a VM, so that the physical hardware is never actually in a not-live state.

Re:Puppet. (1)

Anonymous Coward | about two weeks ago | (#47432159)

And a kernel update has never blown up a grub install on a VM...

Nope..never has happened.

And if it doesn't work? (5, Insightful)

Anonymous Coward | about two weeks ago | (#47432021)

Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.

And if it doesn't work? (2, Insightful)

Anonymous Coward | about two weeks ago | (#47432097)

Support for off-hour work is part of the job. Don't like it? Find another job where you don't have to do that. Can't find another job? Improve yourself so you can.

This is the correct answer. I promise you that at some point, something will fail, and you will have failed by not being there to fix it immediately.

just do your job. (0)

Anonymous Coward | about two weeks ago | (#47432027)

quit complaining.

Murphy says no. (5, Insightful)

wbr1 (2538558) | about two weeks ago | (#47432033)

You should always have a competent tech on hand for maintenance tasks. Period. If you do not, Murphy will bite you, and then, instead of having it back up by peak hours you are scrambling and looking dumb. In your current scenario, say the patch unexpectedly breaks another critical function of the server. It happens, if you have been in IT any time you have seen it happen. Bite the bullet and have a tech on hand to roll back the patch. Give them time off at another point, or pay them extra for night hours, but thems the breaks when dealing with critical services.

Re: Murphy says no. (0, Troll)

ModernGeek (601932) | about two weeks ago | (#47432081)

This guy probably is the tech but is wanting to spend more time with his family or something.

Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.

Re: Murphy says no. (4, Insightful)

CanHasDIY (1672858) | about two weeks ago | (#47432209)

This guy probably is the tech but is wanting to spend more time with his family or something.

Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.

OR, if you want to have a family life, don't take a job that requires you to do stuff that's not family-life-oriented.

That's the route I've taken - no on-call phone, no midnight maintenance, no work-80-hours-get-paid-for-40 bullshit. Pay doesn't seem that great, until you factor in the wage dilution of those guys working more hours than they get paid for. Turns out, hour-for-hour I make just as much as a lot of the managers around here, and don't have to deal with half the crap they do.

The rivers sure have been nice this year... and the barbecues, the lazy evenings relaxing on the porch, the weekends to myself... yea. I dig it.

Re: Murphy says no. (1)

hodet (620484) | about two weeks ago | (#47432375)

you've just described my life. amen brother.

Re: Murphy says no. (1)

LordLimecat (1103839) | about two weeks ago | (#47432551)

At least where I work maintenance is a once a month thing; Im led to believe this is normal by anecdotal evidence on the internet.

Your average work week ends up at like 42 hours if you factor that in; its really not that onerous.

Re: Murphy says no. (1)

master_kaos (1027308) | about two weeks ago | (#47432559)

yup same here, while my yearly salary isn't great I work 35 hour weeks, 4 weeks vacation, 10 sick days, multiple breaks per day, rarely ever any OT (and while we are salaried and don't get OT pay we instead get time in time and a half off). Hour-for-hour I probably make more than a lot of managers as well. Would I like to make double what I am making? Sure, but I would NOT be willing to put in double the work.

Re: Murphy says no. (2)

gbjbaanb (229885) | about two weeks ago | (#47432249)

so once a week you have to get up early and do some work.

big deal.

The benefit is that you get to go home early too - and that mean you're there to pick up little johnny from school instead of seeing him when you drag your sorry arse in from a full day of meetings and emails and stuff.

Frankly, I wouldn't want to do it every day, but I can't see how the occasional early is anything but a good thing for family life.

Re: Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432513)

In a lot of places you don't get to go home early. I watch the devops team where I work, they have to stay up late for releases and still be in for 8-9-10 AM depending upon the shift of each one. There is no extra time off. The team also has extremely high turnover. It's unfortunate because I would love to work on that team for a bit I think it would be so fascinating.

Re: Murphy says no. (5, Funny)

PvtVoid (1252388) | about two weeks ago | (#47432307)

This guy probably is the tech but is wanting to spend more time with his family or something.

Probably settled down too fast and can't get a better job now. My advice: don't settle down and quit using your wife and children as excuses for your career failures because they'll grow to hate you for it.

Congratulations! You're management material!

Re: Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432485)

lol

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432127)

As someone who has spent many an all-nighter on hardware and software patches and upgrades for critical telecom and network systems, allow me to introduce you to your co-tech, Mr. Murphy. He will always be by your side, helping you find the tiniest potential for failure in your plans. Do not leave anything to chance. Pre-test and automate to your hearts content, but be there watching and confirming and double-checking everything. It will fail at some point. The difference will be if whether or not any users notice when they start rolling in. Your precious sleep lost is better than your job lost.

Re:Murphy says no. (5, Interesting)

bwhaley (410361) | about two weeks ago | (#47432267)

The right answer to this is to have redundant systems so you can do the work during the day without impacting business operations.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432429)

Yep. Redundancy is the only way to fly with services that have any sort of business significance. Two of everything is good, sometimes three are better. It allows you to do two things -- implement a change and have something to fall back on if it screws up or has unanticipated side effects. And allow change to happen during normal working hours -- provided adequate capacity was provisioned. Unattended patching maybe ok for your PC at home that just runs Facebook, but for any sort of business services Murphy is really in charge.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432431)

Thank you, dare I say it, Windows Server in a cluster or any number of other things can allow this to be done during the day with no down time.

Re:Murphy says no. (1)

mshieh (222547) | about two weeks ago | (#47432459)

Can't agree enough, regular downtime is the root of the problem.

Usually you still want to do it off-peak just in case you're caught with reduced capacity.

Re:Murphy says no. (1)

bwhaley (410361) | about two weeks ago | (#47432493)

Yup. Very dependent on the business, the application, the usage patterns, etc.

Re: Murphy says no. (1)

ranelen (2386) | about two weeks ago | (#47432475)

exactly. it doesn't really matter if you are there or not. eventually something is going to break in a new and interesting way that can't be fixed without a significant amount of work.

generally we try to have at least three systems for any production service so that we can still have redundancy while doing maintenance.

that said, I rarely come in for patching anymore. I just make sure I'm available in case something doesn't come up afterwards. (no binge drinking on patch nights!)

redundancy and proper monitoring make life much, much nicer.

Re:Murphy says no. (1)

Anonymous Coward | about two weeks ago | (#47432519)

Yeah, no shit. I love how everyone is all like: "quit whining and babysit that shit" When babysitting isn't necessary if you have a redundant system in place to recover from failure edge-cases. Get a steady state redundant system then maintain them on alternating weeks. If either one fails during automated maintenance then you can just switch over to the backup until you've had breakfast. Better yet, use Amazon EC2 for your infrastructure so you can spool up as many redundant systems as necessary.

Re:Murphy says no. (0)

Anonymous Coward | about two weeks ago | (#47432355)

I modded in this thread so I'm posting AC. +1 for "Score:6, Insightful"

Re:Murphy says no. (4, Informative)

David_Hart (1184661) | about two weeks ago | (#47432357)

Here is what I have done in the past with network gear:

1. Make sure that you have a test environment that is as close to your production environment as possible. In the case of network gear, I test on the exact same switches with the exact same firmware and configuration. For servers, VMWare is your friend....

2. Build your script, test, and document the process as many times as necessary to ensure that there are no gotchas. This is easier for network gear as there are less prompts and options.

3. Build in a backup job in your script, schedule a backup with enough time to complete before your script runs, or make your script dependent on the backup job completing successfully. A good backup is your friend. Make a local backup if you have the space.

4. Schedule your job.

5. Get up and check that the job complete successfully either when the job is scheduled to be completed or before the first user is expected to start using the system. Leave enough time to perform a restore, if necessary.

As you can probably tell, doing this in an automated fashion would take more time and effort than baby sitting the process yourself. However, it is worth it if you can apply the same process to a bunch of systems (i.e. you have a bunch of UNIX boxes on the same version and you want to upgrade them all). In our environment we have a large number of switches, etc. that are all on the same version. Automation is pretty much the only option given our scope.

Re:Murphy says no. (1)

Vellmont (569020) | about two weeks ago | (#47432433)


  say the patch unexpectedly breaks another critical function of the server. It happens, if you have been in IT any time you have seen it happen

Yes, this happens all the time. And really it's a case for doing the upgrade when people are actually using the system. If the patch happens at 2am (chosen because nobody is using it at 2am), nobody is going to notice it until the morning. The morning, when the guy who put in the patch is still trying to recover from having to work at 2am. At the very least groggy, and not performing at his/her best.

Re:Murphy says no. (1)

wisnoskij (1206448) | about two weeks ago | (#47432449)

This. No matter what you do this maintenance and downtime is hundreds of times more likely to go wrong than normal running times. What is the point of even employing IT if they are not around for this window.

Re:Murphy says no. (1)

smash (1351) | about two weeks ago | (#47432541)

Yup. Although, that said, if you have a proper test environment, like say, a snap-clone of your live environment and an isolated test VLAN, you can do significant testing on copies of live systems and be pretty confident it will work. You can figure out your back-out plan, which may be as simple as rolling back to a snapshot (or possibly not).

Way too many environments have no test environment, but these days with the mass deployment of FAS/SAN and virtualization, you owe it to your team to get that shit set up.

Re:Murphy says no. (1)

Culture20 (968837) | about two weeks ago | (#47432565)

say the patch unexpectedly breaks another critical function of the server.

When this happens, it usually takes a lot longer to fix than it takes to drive in to work, because the way it breaks is unexpected. The proper method is to have an identical server get upgraded with this automatic maintenance window method the day before while you're at work or at least hours before the primary system so that you can halt the automatic method remotely before it screws up the primary system. If the service isn't important enough, let your monitoring software wake you up if there's a failure or ignore it until you get in at your normal time. Most of the time, having a regularly well-rested sysadmin is more important to a company than having "light-switch monitoring server three" running between 4AM and 8AM.

windows (0)

Anonymous Coward | about two weeks ago | (#47432037)

Phew, thought you were going to ask about it on Windows. Linux, go for it!

Re:windows (0)

Anonymous Coward | about two weeks ago | (#47432257)

I agree. Linux is much better than Windows.

I've toyed with this concept.. (5, Interesting)

grasshoppa (657393) | about two weeks ago | (#47432041)

...and while I'm reasonably sure I can execute automated maintenance windows with little to no impact to business operations, I'm not sure. So I don't do it.

If there were more at stake, if the risk vs benefits were tipped more in my company's favor, I might test implement it. But just to catch an extra hour or two of sleep? Not worth it; I want a warm body watching the process in case it goes sideways. 9 times out of 10, that warm body is me.

Automated troubleshooting? (5, Insightful)

HBI (604924) | about two weeks ago | (#47432047)

Maintenance windows are at off-hours to accomodate real work happening. If every action was painless and produced the desired result, you could do it over lunch or something like that. But that's not the real world.

This begs the question of how the hell are you going to fix unexpected problems in an automated fashion? The answer is, you aren't. Therefore, you have to be up at 2am.

Re:Automated troubleshooting? (0)

Anonymous Coward | about two weeks ago | (#47432223)

Well there might not be ways of fixing unexpected problems in an automated fashion, but there are certainly a lot of ways of catching unexpected problems and send out an automated text/email to wake you up. If you carefully handle return codes and set timeouts in your scripts, as well as monitoring the machine from the outside, you should be able to sleep most of the time. Do I do it? No. My clients refuses. Beside there are DBAs and App admins on the a phone bridge waiting for me to hand over the updated server so they can start their DB, Apps and have the regression tests begin.

Automated troubleshooting? (0)

Anonymous Coward | about two weeks ago | (#47432225)

I'm seeing this increasingly often......misuse of the phrase "begs the question". Why don't you look it up?

Re:Automated troubleshooting? (2)

HBI (604924) | about two weeks ago | (#47432293)

How about looking up "pedantry".

Re:Automated troubleshooting? (1)

gstoddart (321705) | about two weeks ago | (#47432381)

I'm seeing this increasingly often......misuse of the phrase "begs the question". Why don't you look it up?

There are now two distinct phrases in the English language:

There is the logical fallacy of begging the question.

Sometimes, an event happens which begs (for) the question of why nobody planned for it.

You might think you sound all clever and stuff, but you're wrong. They sound similar, but they aren't the same. The second one has been in common usage for decades now, and has nothing to do with the logical fallacy.

Automated troubleshooting? (0)

Anonymous Coward | about two weeks ago | (#47432241)

We do some automation for maintenance, but the end result has to be able to be tested thoroughly automatically. If the automated tests succeed, I stay asleep. If they fail, I get paged and wake up to deal with it. 90% of the time, it works and I get to sleep through the night. But we can only really do this for simple maintenance.

Re:Automated troubleshooting? (1)

mshieh (222547) | about two weeks ago | (#47432471)

If you have proper monitoring, you don't need to be up at 2am. You just need to be willing to answer the phone at 2am.

This is why n+2 and Vmware are so useful. (1)

Anonymous Coward | about two weeks ago | (#47432051)

If you have a high availability system with more than one backup node then daytime maintenance becomes very doable.

Attended automation (3, Interesting)

Anonymous Coward | about two weeks ago | (#47432053)

Attended automation is the way to go. You gain all the advantages of documentation, testing etc. If the automation goes smooth, you only have to watch it for 5 mins. If it doesn't, then you can fix it immediately.

Schedule some days as offset days (1)

ModernGeek (601932) | about two weeks ago | (#47432057)

You just need to schedule some of your days as offset days. Work from 4pm to midnight some days so that you can get some work done when others aren't around. Some days require you being around people, some days command you be alone.

Or you can just work 16hour days like the rest of us and wear it with a badge of honor.

If you are your own boss and do this, you can earn enough money to take random weeks off from work with little to no notice so that you can travel the world, and do some recruiting while doing it so that you can write the expenses off on the company.

Re:Schedule some days as offset days (1)

DarkOx (621550) | about two weeks ago | (#47432201)

Pretty much this. If your company is big enough or drives enough revenue from its IT systems that require routine off hours maintenance they should staff for that.

That is not say that if its just Patch Tuesdays they need to; or the occasional rare major internal code deployment that happens a couple time a year or so. For that you as the admin should suck it up, and roll out of bed early once and while. Hopefully your bosses are nice and let you have some flextime for it. Knock out at 3p on Fridays those weeks or something.

If there is a regular maintenance window that is frequently used, say at least twice a week, then they need to make the regular scheduled working hours for some employee(s). Maybe some junior admin who can follow deployment instructions works 3a-10a Tuesdays and Wednesdays; but lets be fair to that person they have a life outside of work a deserve to have a predictable schedule. They should still work those hours even if there is nothing going on that week, and just use the time do whatever else they do; update documentation; test out new software versions etc, inventory, etc.

     

Re:Schedule some days as offset days (1)

CanHasDIY (1672858) | about two weeks ago | (#47432235)

Or you can just work 16hour days like the rest of us and wear it with a badge of honor.

IMO, there is no honor in working more hours than you're actually being paid to work. Not only are you hurting yourself, you're keeping someone else from being able to take that job.

If you've got 80 hours worth of work to do at your company, and one guy with a 40-hour-a-week contract, you need to hire another person, not convince the existing guy that he should be proud to be enslaved. Morally speaking.

Re: Schedule some days as offset days (0)

Anonymous Coward | about two weeks ago | (#47432291)

Yeah, but what if you're making tried as much as the guys working 40 hours?

Re:Schedule some days as offset days (1)

rikkards (98006) | about two weeks ago | (#47432529)

Not only that but a company that lets someone do that is shooting themself in the foot. Sooner or later 80 hour a week guy is going to leave, good luck getting someone that is
A: willing to do it coming in
B: not taking the job until something better comes along.

It's not a badge of honor, just an example of rationalization for a crappy job.

Re:Schedule some days as offset days (2)

QRDeNameland (873957) | about two weeks ago | (#47432319)

Or you can just work 16hour days like the rest of us and wear it with a badge of sucker.

FTFY

Depends on the Application layer / patch applied (1)

slacklinejoe (1784298) | about two weeks ago | (#47432065)

I do this for a lot of clients. Automatic Deployment Rules in Configuration Manager, Scripts, Cron jobs etc. For test / dev, it absolutely makes sense as I usually have a monitoring system that goes into Maintenance Mode during the updates. If things take too long or if services aren't restored post update, the monitoring system gives me a shout that something needs remediated. For production, it varies on the expected impact. If it's something I tested in pilot with zero issues and the application isn't something with an insane SLA, sure, I'll use an automatic deployment. When I'm working on hospital equipment such as servers processing imaging or vitals monitoring for surgery, that gets nix'ed no matter what due to the liability concerns. I usually suggest building up trust / experience by automating the less critical systems and phasing in more sensitive systems until you've both gained a lot of experience with it and have more management support to do so as when crap goes down, it's easier to say this is a tested processed we've been using for years vs yeah, oops, new script sorry that knocked down our ERP system.... Resume generating event right there... So, I guess it depends, just another tool for the toolbox and it's up to the carpenter to know when to pull it out.

Ansible (0)

Anonymous Coward | about two weeks ago | (#47432071)

I like ansible... alot. Chef, salt, something else if that is your preference. In any event, yes, an automated deployment framework allows you to test the maintenance procedure out, throttle the number of servers that get managed at one time, bail (and/or text you) if there is a problem.

Done right it can be run continuously so that you are always confident about the state of your servers and their maintenance procedures.

We do it all the time... (0)

Anonymous Coward | about two weeks ago | (#47432073)

We do it all the time... Schedule a snapshot, push patches, verify thing are up, and if not throw an alarm... Using Shavlik on, horror of horrors, Windows...

I've rolled back 2 or 3 in the past 5 years, usually do to Microsoft's inablilty to consistently write a patch that doesn't break something, and once because the vendor couldn't see to be hyper sensitive to .Net patch levels...

use some configuration management tools (0)

Anonymous Coward | about two weeks ago | (#47432075)

- cfengine
- puppet
- chef
- ansible
- salt

All should be able to do the work.

Offshore (4, Insightful)

pr0nbot (313417) | about two weeks ago | (#47432091)

Offshore your maintenance jobs to someone in the correct timezone!

Re:Offshore (0)

Anonymous Coward | about two weeks ago | (#47432177)

this is what my employer does

Re:Offshore (0)

Anonymous Coward | about two weeks ago | (#47432395)

This is a temporary solution if your business grows worldwide. It's 2014, time to start using virtualization/imaging and distributed services. You can keep everything running while you do maintenance on individual nodes. Some systems can even move a running virtual image to another physical server.

Sounds like a bad idea ... (4, Insightful)

gstoddart (321705) | about two weeks ago | (#47432095)

You don't monitor maintenance windows for when everything goes well and is all boring. You monitor them for when things go all to hell and someone needs to correct it.

In any organization I've worked in, if you suggested that, you'd be more or less told "too damned bad, this is what we do".

I'm sure your business users would love to know that you're leaving it to run unattended and hoping it works. No, wait, I'm pretty sure they wouldn't.

I know lots of people who work off hours shifts to cover maintenance windows. My advise to you: suck it up, princess, that's part of the job.

This just sounds like risk taking in the name of being lazy.

This is why you need.. (3, Insightful)

arse maker (1058608) | about two weeks ago | (#47432101)

Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.

Having someone with little or no sleep doing critical updates is not really the best strategy.

Re:This is why you need.. (5, Insightful)

Shoten (260439) | about two weeks ago | (#47432207)

Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.

Having someone with little or no sleep doing critical updates is not really the best strategy.

First off, you can't mirror everything. Lots of infrastructure and applications are either prohibitively expensive to do in a High Availability (HA) configuration or don't support one. Go around a data center and look at all the Oracle database instances that are single-instance...that's because Oracle rapes you on licensing, and sometimes it's not worth the cost to have a failover just to reach a shorter RTO target that isn't needed by the business in the first place. As for load balancing, it normally doesn't do what you think it does...with virtual machine farms, sure, you can have N+X configurations and take machines offline for maintenance. But for most load balancing, the machines operate as a single entity...maintenance on one requires taking them all down because that's how the balancing logic works and/or because load has grown to require all of the systems online to prevent an outage. So HA is the only thing that actually supports the kind of maintenance activity you propose.

Second, doing this adds a lot of work. Failing from primary to secondary on a high availability system is simple for some things (especially embedded devices like firewalls, switches and routers) but very complicated for others. It's cheaper and more effective to bump the pay rate a bit and do what everyone does, for good reason...hold maintenance windows in the middle of the night.

Third, guess what happens when you spend the excess money to make everything HA, go through all the trouble of doing failovers as part of your maintenance...and then something goes wrong during that maintenance? You've just gone from HA to single-instance, during business hours. And if that application or device is one that warrants being in a HA configuration in the first place, you're now in a bit of danger. Roll the dice like that one too many times, and someday there will be an outage...of that application/device, followed immediately after by an outage of your job. It does happen, it has happen, I've seen it happen, and nobody experienced who runs a data center will let it happen to them.

Re:This is why you need.. (1)

MondoGordo (2277808) | about two weeks ago | (#47432417)

In my experience, if your load-balancing solution requires all your nodes to be available, and you can't remove one or more nodes without affecting the remainder, it's a piss-poor load balancing solution. Good load balancing solutions are fault tolerant up to, and including, absent or non-responsive nodes and any load balanced system that suffers an outage due to removing a single node is seriously under-resourced.

Re:This is why you need.. (1)

CWCheese (729272) | about two weeks ago | (#47432287)

Several posts have alluded to high-availability, mirrored, load balanced, etc etc as being the solution to simply updating systems. The problem from a management point of view is to remain on guard when a patch or upgrade goes bad. Having turned into one of those 'old-guys', I'm quite sobered by the bad maintenance windows I've been a party to and will never consider unattended maintenance windows for my teams. It's better for me to schedule the work and let my folks adjust their work days to get to the maintenance fully alert and aware, and in full attendance for that time when things don't go as planned.

Re:This is why you need.. (1)

CanHasDIY (1672858) | about two weeks ago | (#47432327)

Load balanced or mirrored systems. You can upgrade part of it any time, validate it, then swap it over to the live system when you are happy.

Having someone with little or no sleep doing critical updates is not really the best strategy.

Oh my $deity, this!

I've worked in environments with test-to-live setups, and ones without, and the former is always, always a smoother running system than the latter.

Immutable Servers (2)

skydude_20 (307538) | about two weeks ago | (#47432105)

If these are as critical services as you say, I would assume you have some sort of redundancy, at least a 2nd server somewhere. If so, treat each as "throw away", build out what you need on the alternative server, swing DNS and be done. Rinse and repeat for the next 'upgrade'. Then do your work in the middle of the day. See Immutable Servers: http://martinfowler.com/bliki/... [martinfowler.com]

Automate Out (2)

whipnet (1025686) | about two weeks ago | (#47432115)

Why would you want to automate someone or yourself out of a job? I realized years ago that Microsoft was working hard to automate me out of my contracts. It's almost done, why accelerate the inevitable?

Automate successful execution as well (1)

Boawk (525582) | about two weeks ago | (#47432123)

Setting aside the wisdom (or lack thereof) of automating maintenance, you should also have some process external to the maintained machines that confirms that the maintenance worked. That confirmation could be something like testing that a Web server continues to serve the expected pages, some port provides expected information, etc. If this external process notes a discrepancy, it would page/text/call you.

Great idea! (0)

Anonymous Coward | about two weeks ago | (#47432137)

I am "the guy". The guy that your boss calls when your simple maintenance outage goes all sideways (and I like your idea). Positioning oneself so that any problem becomes a lingering outage that shakes your company's faith in your IT Director's ability to do their job competently is always a great idea. If you can chron it from work, why not chron it from an outsourced location? I mean, either it goes well and they don't need you or it goes sideways and they need me. Either way, you are screwed. PRO TIP: do not store your resume on any system that you chron after-hours updates to.

Fixing the wrong problem (1)

Zanthras (844625) | about two weeks ago | (#47432141)

By far the better solution is to figure out why that one specific server cant be offlined. Its far safer regardless of the tests and validations to work on a server thats not supposed to be running vs one that is. It obviously takes alot of work, but for all your critical/important services they should be running in some sort of HA scenerio. If you cant take a 5 minute outage just after normal business hours, you absolutely cannot take a failure in the service due to any sort of hardware failure(which will happen) This is coming from years of experience in a Software as a Service company/

Slashdot is a Bad Place to Ask This (4, Interesting)

terbeaux (2579575) | about two weeks ago | (#47432147)

Everyone here is going to tell you that a human needs to be there because that is their livelihood. Any task can be automated at a cost. I am guessing that it is not your current task to automate maintenance tasks otherwise you wouldn't be asking. Somewhere up your chain they decided that for the uptime / quality of service it is more cost effective to have a human do it. That does not mean that you can not present a case showing otherwise. I highly suggest that you win approval and backing before taking time to try to automate anything.

Out of curiosity, are they VMs?

Re:Slashdot is a Bad Place to Ask This (0)

Anonymous Coward | about two weeks ago | (#47432505)

snapshots FTW!

Do a risk assessment. (0)

Anonymous Coward | about two weeks ago | (#47432149)

What's the impact if it all goes wrong and you're not there? If impact is huge and you're fired if it all goes bad, be there. If it doesn't matter and it can fail with no consequences, script it.

Disclaimer: Today (Friday) I found out my company is doing a DR exercise from 10PM tonight to 9AM tomorrow. I'm an ITSec manager, and they wanted to know if they could make a "few firewall changes if they need to". I said no, and told them I would stay up late to review and approve any emergency changes they want but they were NOT getting "carte blanche" with no ITSec oversight, as that would be really irresponsible and break SOX, etc... You do what you have to in order to get the job done properly!

(Posting Anonymous so !humblebragging.)

How about using a remote console? (0)

Anonymous Coward | about two weeks ago | (#47432155)

If you were to set up a hardware remote console, you could do it from home. So yeah, it's 15 minutes out of bed, but then it's right back to bed.

I have automated maintenances in the form of ... (2)

spads (1095039) | about two weeks ago | (#47432157)

...service bounces that are happening all the time. When it occurs and/or if any other issues, I can send myself a mail. My blackberry has filters which allow an alarm to go off which can wake me during the night. That would seem to meet your needs.

Nature of the beast (1)

Danzigism (881294) | about two weeks ago | (#47432175)

Although I do feel this is the nature of the beast when working in a true IT position where businesses rely on their systems nearly 100% of the time, there are some smart ways to go about it. I'm not exactly sure what type of environment you're using, but if you use something like VMware's vSphere product, or Microsoft's Hyper-V, both allow for "live migrations". Why not virtualize all of your servers first of all, make a snapshot, perform the maintenance, and live migrate the VMs? You could do it right in the middle of the day and nobody would even know. This kind of setup takes a lot of planning however. I personally wouldn't want any maintenance performed on my servers without manual approval. Unattended maintenance sounds a bit too scary for my likes, and in my experience with even small security updates for both Linux and Windows servers, there's bound to be a point where something would fail and you could potentially get in a lot of legal trouble if you fail to meet you SLA, or cause a loss-of-profit due to downtime with a business.

It depends on the size of your operation... (4, Interesting)

jwthompson2 (749521) | about two weeks ago | (#47432183)

If you really want to automate this sort of thing you should have redundant systems with working and routinely tested automatic fail-over and fallback behavior. With that in place you can more safely setup scheduled maintenance windows for routine stuff and/or pre-written maintenance scripts. But, if you are dealing with individual servers that aren't part of a redundancy plan then you should babysit your maintenance. Now, I say babysit because you should test and automate the actual maintenance with a script to prevent typos and other human errors when you are doing the maintenance on production machines. The human is just there in case something goes haywire with your well-tested script.

Fully automating these sorts of things is out of reach more many small to medium sized firms because they don't want, or can't, invest in the added hardware to build out redundant setups that can continue operating when one participant is offline for maintenance. So, depending on the size of your operation and how much your company is willing to invest to "do it the right way" is the limiting factor in how much you are going to be able to effectively automate this sort of task.

Simmilar experiences ... (4, Insightful)

psergiu (67614) | about two weeks ago | (#47432187)

A friend of mine lost his job over a simmilar "automation" task on windows.

Upgrade script was tested on lab environement who was supposed to be exactly like production (but it turns out it wasn't - someone tested something before without telling anyone and did not reverted). Upgrade script was scheduled to be run on production during the night.

Result - \windows\system32 dir deleted from all the "upgraded" machines. Hundreds of them.

On the Linux side i personally had RedHat doing some "small" changes on the storage side and PowerPath getting disabled at next boot after patching. Unfortunate event, since all Volume Groups were using /dev/emcpower devices. Or RedHat doing some "small" changes in the clustering software from one month to the other. No budget for test clusters. Production clusters refusing to mount shared filesystems after patching. Thankfuly on both cases the admins were up & online at 1AM when the patching started and we were able to fix everything in time.

Then you can have glitchy hardware/software deciding not to come back up after reboot. RHEL GFS clusters are known to randomly hang/crash at reboot. HP Blades have sometimes to be physically removed & reinserted to boot.

Get the business side to tell you how much is going to cost the company for the downtime until:
- Monitoring software detects that something is wrong;
- Alert reaches sleeping admin;
- Admin wakes up and is able to reach the servers.
Then see if you can risk it.

Re:Simmilar experiences ... (0)

Anonymous Coward | about two weeks ago | (#47432351)

I just modded you up on this, but I'd love to know more details on how an upgrade script can wipe out the core windows directory!

Re:Simmilar experiences ... (0)

Anonymous Coward | about two weeks ago | (#47432457)

I'd hate to have been the guy who wiped \windows\system32 on hundreds of servers...wow.

Prepare for failure (1)

davidwr (791652) | about two weeks ago | (#47432193)

One way to prepare for failure is to have someone there who can at least recognize the failure and wake someone up in time to fix it.

Another way to prepare for failure is to have a system that is redundant enough that a part could go down and it wouldn't be more than a minor annoyance to users or management.

There are other ways to prepare for failure, but these are two common ones.

Re:Prepare for failure (1)

gstoddart (321705) | about two weeks ago | (#47432325)

Some of us would argue that doing maintenance unattended is preparing for failure -- or at least giving yourself the best possible chance of failure.

I work in an industry where if we did our maintenance badly, and there was an outage it would literally cost millions of dollars/hour.

If what you're doing it so unimportant you can leave the maintenance unattended, there's probably no reason you couldn't do the outage in the middle of the day.

If it is important, you don't leave it to chance.

Sometimes the reasons aren't technical (1)

davidwr (791652) | about two weeks ago | (#47432499)

Maybe back when the maintenance window was created it was created for a valid technical reason, BUT technology moved on and management didn't.

In other words, in some environments, the technical people won't have a sympathetic ear if they ask to cancel the off-hours maintenance window simply because of local politics or the local management, BUT if the maintenance gets botched and services are still down or under-performing through normal business hours, nobody outside of IT will notice.

Re:Sometimes the reasons aren't technical (1)

gstoddart (321705) | about two weeks ago | (#47432555)

BUT if the maintenance gets botched and services are still down or under-performing through normal business hours, nobody outside of IT will notice

Then you're maintaining trivial, boring, and unimportant systems that nobody will notice. If your job is to do that ... well, your job is trivial and unimportant.

The stuff that I maintain, if it was down or under-performing during normal business hours ... we would immediately start getting howls from the users, and the company would literally be losing vast sums of money every hour. Because our stuff is tied into every aspect of the business, and is deemed to be necessary for normal operations.

Sorry, but some of us actually maintain stuff which is mission critical to the core business, and people would definitely notice it.

As one of the technical people who does cover after hours maintenance ... if a technical person suggested we automate our changes and not monitor them, they wouldn't get a sympathetic ear from me either.

There may be systems like you describe. And, as I said before, if that's the case, do your maintenance windows in the middle of the day.

Set alarms (1)

MrL0G1C (867445) | about two weeks ago | (#47432203)

Can't you make some kind of setup that triggers if the update fails and alerts you / wakes you up with noise from your smartphone etc.

Or like the other poster who beat me to it - off-load your work to someone in a country where your 5am is mid-day in their country.

Security Availability vs Availability Security (0)

Anonymous Coward | about two weeks ago | (#47432219)

While just about everybody does availability over security, it just depends on what you sell your clients.
We do at least all Security-Updates fully-automatically without review or maintainance, just whenever cron-apt picks up a new Debian security update it gets installed automatically. If something goes wrong, our customers understand as that is what they want or need: Security - even if Availability stays behind.

Perception of Necessity (1)

bengoerz (581218) | about two weeks ago | (#47432229)

By proving that your job can be largely automated, you are eroding the reasons to keep you employed.

Sure, we all know it's a bad idea to set things on autopilot because eventually something will break badly. But do your managers know that?

Re:Perception of Necessity (0)

Anonymous Coward | about two weeks ago | (#47432347)

Well, it could be that you'd actually spend a lot more time (though during regular hours) testing/tweaking the automated solution. So it might actually mean more work overall. Having said that, we feel happier having one or two people do most stuff like that, and that's partly due to not having the luxury of an exact replica as a test environment, nor the time to do it.

Depends; and not like the adult diaper (0)

Anonymous Coward | about two weeks ago | (#47432251)

This has always been a contention. Some systems can be automated through SMS, System Center or even a vb script. However, I've had windows updates corrupt IIS web servers before requiring me to uninstall all .net frameworks, reinstall IIS, and reinstall the .net framework. This is one of those situations you don't want to wake up in Monday morning with customers down. For critical systems, I always manually test on test systems, push to production and test after updates applied to make sure everything is running as intended. For low impact updates like ccleaner, automated pushes are much more viable because of the impact to the system is relatively low. So as the subject says, "Depends". Hope this helps with your inquiry.

No. Do your maintenance *in* working hours. (0)

Anonymous Coward | about two weeks ago | (#47432261)

If you do your maintenance out of hours and something goes wrong, who's going to fix it? Some bleary-eyed administrator that's has 2 hours sleep? If they need to escalate it, who are they going to call at 2am? Also, are these guys being paid double time to work these hours, given time off to compensate, or just expected to suck it up and work a normal day shift after working half the night? Whichever way you look at it, it's full of problems.

Instead, rearchitect your solution. If you care about a service enough to not take planned downtime in working hours, you probably care enough that unplanned downtime in working hours should not be business affecting either. So you should double up on servers (which should be pretty cheap if you're running everything in a virtualised environment) and arrange for services to fail over to a secondary if the primary is unavailable. If you're doing this in Windows (my condolences to you) it should mostly support this anyway. If you're doing it in something unix-like you can use things like keepalived to fail a service from one node to another.

Once you have a solution like this, maintenance is easy - you patch/upgrade/reboot your backup server, check that it's OK, then promote it to primary, then do the other server, and then promote it back again. You do it *all* in working hours so that (a) people get a decent nights sleep and (b) if something goes wrong you can call on your support provider without having to pay over the odds for 24x7 support.

Testing fails (0)

Anonymous Coward | about two weeks ago | (#47432275)

In my experience testing, no matter how thorough you think it is, will fail to account for all possibilities. That one possibility you missed will bite you in the ass when you automate your maintenance.

Don't automate yourself out of a job. (0)

Anonymous Coward | about two weeks ago | (#47432283)

It's wise to maintain human-on-site because it maintains the employers idea that you are worth keeping rather than outsourcing your position to someone who can do the job from a distance.

I work for money, not some blithering ideal of efficiency which may not include me.

When I trim my trees I don't cut off the branch I'm sitting on.

Good for the Goose (1)

Cigamit (200871) | about two weeks ago | (#47432317)

Simple.

You stipulate that for every maintenance, there has to be a full regression testing of any affected applications. You will require the application owner, QA folks, and any other affected personnel online during and after the maintenance to test and ensure everything is working. Bonus points, require them to be on a conference call, and breathe heavily into the mic the entire time (maybe occassionally says "Oops"). When you have enough other people complaining about the 2 am times instead of just you, they magically get moved to move sensible times in the late afternoon.

Your best is to get out of Managed Services and into Professional Services. You just build out new environments / servers / apps and hand them off to the MS guys. Once its off your hands, you never have to worry about a server crashing, maintenance windows, or being on call. Plus, you are generally paid more.

Its your network (1)

sasquatch989 (2663479) | about two weeks ago | (#47432335)

I think automating maintenance is a smart move but still requires you be awake and available for it. The question is do you want to be awake at work for 10 minutes or 2 hours? Plan accordingly.

Testing. Validation. (2)

mythosaz (572040) | about two weeks ago | (#47432353)

Do you plan on automating the end-user testing and validation as well?

Countless system administrators have confirmed the system was operational after change without throwing it to real live testers only to find that, well, it wasn't.

Nope. (1)

ledow (319597) | about two weeks ago | (#47432365)

Every second you save automating the task, will be taken out of your backside when it goes wrong (see the recent article where a university SCCM server formatted itself and EVERY OTHER MACHINE on campus) and you're not around to stop it or fix it.

Honestly? It's not worth it.

Work out of normal hours, or schedule downtime windows in the middle of the day.

Think of it a slightly different way (3, Informative)

thecombatwombat (571826) | about two weeks ago | (#47432369)

First: I do something like this all the time, and it's great. Generally, I _never_ log into production systems. Automation tools developed in pre-prod do _everything_. However, it's not just a matter of automating what a person would do manually.

The problem is that your maintenance for simple things like updating a package is requiring downtime. If you have better redundancy, you can do 99% of normal boring maintenance with zero downtime. I say if you're in this situation you need to think about two questions:

1) Why do my systems require downtime for this kind of thing? I should have better redundancy.
2) How good are my dry runs in pre-prod environments? If you use a system like Puppet for *everything* you can easily run through your puppet code as you like in non-production, then in a maintenance window you merge your Puppet code, and simply watch it propagate to your servers. I think you'll find reliability goes way up. A person should still be around, but unexpected problems will virtually vanish.

Address those questions, and I bet you'll find your business is happy to let you do "maintenance" at more agreeable times. It may not make sense to do it in the middle of the business day, but deploying Puppet code at 7 PM and monitoring is a lot more agreeable to me than signing on at 5 AM to run patches. I've embraced this pattern professionally for a few years now. I don't think I'd still be doing this kind of work if I hadn't.

Having problems with this on windows. (0)

Anonymous Coward | about two weeks ago | (#47432439)

We have many thousands of linux and windows desktop clients hosted in data centers accessed by thin client protocols. With linux, no problem. We have our update schedule and everything pretty much works.

However on the windows side we need to use a bunch of custom tools to try and beat the systems into line. We often have things blocked by pending reboots, windows updates and advertised software being pushed to the systems. So we get such a different mix of systems to deal with, it's not always working well.

Please note we also do not have access to the SCCM backend (this has been outsourced). Any suggestions? Except for maybe having a monthly window where we disable wsus and sms host agent, reboot, do our updates, re-enable and reboot again. It's clunky.

GF

If the machine is virtual.... (0)

Anonymous Coward | about two weeks ago | (#47432453)

If the machine is a VM, why not bring it down, take a snapshot, boot it up and do your update, etc and then reboot. If the machine is not up by 10 minutes or so, boot up the snapshot you made. You can do all of this via an external machine and use the Perl API to vmware or use the standard KVM/Xen virt tools. This way, if your maintenance fails, you can come in the next morning and figure out what went wrong. I think VMWare actually provides a script called called "snapshotmanager.pl" in it's Perl SDK so you don't need to write your own. (If you're using VMWare)

Convenience in place of Caution (1)

div_2n (525075) | about two weeks ago | (#47432495)

You're trading caution for convenience.

I have automated some things such as patch installation overnight only to wake up to a broken server despite the patches being heavily tested and known to work in 100% of the cases before only to not have them work when nobody was watching.

I urge you to only consider unattended automation overnight when it's for a system that can reasonably incur unexpected downtime without jeopardizing your job and/or the organization. If it's critical -- DO NOT AUTOMATE.

You've been warned.

Perl (1)

Murdoch5 (1563847) | about two weeks ago | (#47432511)

Just write a simply perl script to handle it, it would take about 1 hour to develop and test and you'd be good to go.

Lean Sigma Six to the rescue (0)

Anonymous Coward | about two weeks ago | (#47432561)

Hello OP,

    First things first, we need to discuss some of the less the fun stuff about maintenance windows in production environments. For starters what is the process that your company follows exactly? Do you test on dev box before hand? Do you have a roll back plan in case it goes to hell?

    Now for the less the fun part... you will want to be on-site, or at least have someone on-site, whenever you are doing any type of maintenance work on a production system. Not so much a problem now a days but a good example for the olden days was, what if someone left a disk in the A drive? Well that means your box won't boot most likely because it's most likely set to boot from A before C. Just an example but goes to show that having hands an eyes on-site is almost next to non-negotiable (there are some exceptions like completely virtualized environments where you can use the controller to do whatever you need but even then... like another poster said... Murphy will hunt you down and make you regret not having someone on site).

    Now moving on to the next part, Lean Sigma Six... if you have no training in it ask your company to pay for your courses. It's a really fun course and it applies to EVERYTHING. I know of a major VoIP solution that uses LSS (Lean Sigma Six) approaches to update systems that are on the 5 9s level (99.999% up-time). you end up breaking down your maintenance into a couple of different steps. Step one is like 7 days before the update (moving any required files to the machine putting them in the right spots, blah blah blah), step 2 is the day before generally making sure no updates where released for the files that you moved the previous week (most likely will not apply to you but still a step never the less) this step normally also includes checking and documenting system health to make sure everything is up to snuff (example wouldn't want to start an update if the raid array is down and dirty or various other things like low disk space, high mem/cpu usage, missing user accounts for maintenance). Step 3 is the actual update and checking for to make sure everything got the update and started back up, step 4 is confirming the end results works as intended.

    Really you can make these into as many steps as you want but the goal is to have as much ready and done before the actual update so that your work load at 4am in the morning is as small as possible.

   

Somewhat (0)

Anonymous Coward | about two weeks ago | (#47432581)

While you can't quite afford to do it fully unattended, you can spin up another (presumably close to identical) machine with everything that needs to be on there, prepare a last sync, let it sit until maintenance, do the last sync, swap out the boxes. Test, if not ok swap back until next time. That way all the hard stuff gets moved out of the dark hours and the rest you can do whenever.

Of course, this requires extra hardware or at least allocated virtualised resources, but since these are regarded as close to free these days, and spun-down instances can be re-used next time or for some other task, well, you know.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...