[Infrastructures] managing remote jobs called centrally...
Michael Marziani
mdmarziani@yahoo.com
Mon, 1 Aug 2005 12:20:21 -0700 (PDT)
David,
I've had good luck with Sun Grid engine. I set it up on Linux and the
installation was extremely easy and well-guided thanks to the excellent
installation program.
The long and the short is you can configure one or more hosts as "submit hosts"
and one or more hosts as "exec hosts", you simply create a shell script that
runs the commands you want and then submit the script to the grid engine. You
determine how many times to run your job, which machines to run it on, etc.
The nice part is that you can leave most of this to the grid engine itself, and
simply tell it to run jobs x, y, z, p, d, and q, in which case the grid engine
will automatically pick the server with the lowest load to spawn the next job
on.
As far as output you get all the command output as well as error that would
have been sent to STDOUT and STDERR in files with the job-id number in your
submission or home directory. That should help you trace any issues you have
with failed jobs.
That's just the tip of the iceberg as far as configurability is concerned, but
I think this will do what you want it to. It also has an X-gui which makes
configuration even easier. And it's free:
http://gridengine.sunsource.net/
Best regards,
-Michael
P.S. Sorry for gushing a bit, but I had been searching high and low for a solid
batch submission engine when I stumbled on SGE, and I was very impressed with
it.
--- David Ulevitch <davidu@everydns.net> wrote:
> Infratects,
>
> I currently have a system (visualize a hub server with spoke nodes
> for processing chunks of data) where the hub machine calls many jobs
> on remote machines via ssh.
> Much like: for i in $hosts; do ssh foo@$i "time /usr/bin/baz"; done;
>
> When jobs fail in their logic the program is set to email the
> operations team and we investigate. When a process exists abnormally
> we have no clue. To fix this we've started doing things like:
> for i in $hosts; do ssh foo@$i "time /usr/bin/baz" || mail -s 'job:
> baz on $i failed" ops@everydns.net; done;
>
> This is sort of working but we've run into some cases where a few
> jobs are still not doing what we expect and the above "fix" for
> finding errors is not helpful enough.
>
> I was wondering what tools or techniques people use to call jobs on
> remote machines. We have some scripts that spawn a bunch of
> processes on remote boxes in the background and then the script loops
> while checking for a pid file or using ps to see if they are all done
> before continuing (so it's sending jobs out "concurrent" rather than
> "batched" in method). This also makes shell scripts complicated and
> I prefer keeping them simple and doing real code in another
> language. I know shell can be powerful so I'm asking you all, if you
> use it, how so? If not, what do you use or how do you model your
> setup differently?
>
> Thanks,
> David Ulevitch
>
> ps: I don't think I've seen infratects used before, seems mostly self-
> explanatory. ;-)
> _______________________________________________
> Infrastructures mailing list
> Infrastructures@mailman.terraluna.org
> http://mailman.terraluna.org/mailman/listinfo/infrastructures
>