[Infrastructures] managing remote jobs called centrally...
David Ulevitch
davidu@everydns.net
Fri, 29 Jul 2005 23:58:29 -0700
Infratects,
I currently have a system (visualize a hub server with spoke nodes
for processing chunks of data) where the hub machine calls many jobs
on remote machines via ssh.
Much like: for i in $hosts; do ssh foo@$i "time /usr/bin/baz"; done;
When jobs fail in their logic the program is set to email the
operations team and we investigate. When a process exists abnormally
we have no clue. To fix this we've started doing things like:
for i in $hosts; do ssh foo@$i "time /usr/bin/baz" || mail -s 'job:
baz on $i failed" ops@everydns.net; done;
This is sort of working but we've run into some cases where a few
jobs are still not doing what we expect and the above "fix" for
finding errors is not helpful enough.
I was wondering what tools or techniques people use to call jobs on
remote machines. We have some scripts that spawn a bunch of
processes on remote boxes in the background and then the script loops
while checking for a pid file or using ps to see if they are all done
before continuing (so it's sending jobs out "concurrent" rather than
"batched" in method). This also makes shell scripts complicated and
I prefer keeping them simple and doing real code in another
language. I know shell can be powerful so I'm asking you all, if you
use it, how so? If not, what do you use or how do you model your
setup differently?
Thanks,
David Ulevitch
ps: I don't think I've seen infratects used before, seems mostly self-
explanatory. ;-)