[Infrastructures] managing remote jobs called centrally...

David Ulevitch davidu@everydns.net
Fri, 29 Jul 2005 23:58:29 -0700


Infratects,

I currently have a system (visualize a hub server with spoke nodes  
for processing chunks of data) where the hub machine calls many jobs  
on remote machines via ssh.
Much like: for i in $hosts; do ssh foo@$i "time /usr/bin/baz"; done;

When jobs fail in their logic the program is set to email the  
operations team and we investigate.  When a process exists abnormally  
we have no clue.  To fix this we've started doing things like:
for i in $hosts; do ssh foo@$i "time /usr/bin/baz" || mail -s 'job:  
baz on $i failed" ops@everydns.net; done;

This is sort of working but we've run into some cases where a few  
jobs are still not doing what we expect and the above "fix" for  
finding errors is not helpful enough.

I was wondering what tools or techniques people use to call jobs on  
remote machines.  We have some scripts that spawn a bunch of  
processes on remote boxes in the background and then the script loops  
while checking for a pid file or using ps to see if they are all done  
before continuing (so it's sending jobs out "concurrent" rather than  
"batched" in method).  This also makes shell scripts complicated and  
I prefer keeping them simple and doing real code in another  
language.  I know shell can be powerful so I'm asking you all, if you  
use it, how so?  If not, what do you use or how do you model your  
setup differently?

Thanks,
David Ulevitch

ps: I don't think I've seen infratects used before, seems mostly self- 
explanatory. ;-)