[Infrastructures] How do you manage 1000+ systems

George Georgalis george@galis.org
Sun, 12 Jun 2005 22:51:05 -0400


On Sun, Jun 12, 2005 at 10:37:01PM -0400, Rodrick Brown wrote:
>On 6/12/05, Mark Ferlatte <ferlatte@cryptio.net> wrote:
>> Wesley Craig said on Sun, Jun 12, 2005 at 08:00:03PM -0400:
>> > If we're sharing, please check out:
>> >
>> >     http://radmind.org/
>> 
>> radmin is on the short list of apps to check out next, actually; at
>> high-level it appears like it would be a good replacement for
>> part/most/all of my home grown one that I'm currently using, since they
>> operate on similar principles (filesets).
>> 
>> It probably scales better than what I'm doing currently, which is built
>> on top of rsync.
>> 
>How does anyone here find using rcs for config file change documentation ? 

Well I use cvs.... I actually wrote this earlier before I got permission
to send.  I'd be especially interested in comparisons people want to make
with my system and a radmind enterprise deployment -- I've not tried
radmind yet.

On Sun, Jun 12, 2005 at 12:44:25PM -0700, Mark Ferlatte wrote:
>Bob Proulx said on Sun, Jun 12, 2005 at 10:16:12AM -0600:
>> There are actually many aspects of system management beyond the login
>> part.  I would say the login system for me is probably the simplest
>> part.  I have my own homebrew system for doing the rest unfortunately
>> works for me but not in a good state to share.
> 
>Don't we all have our own homebrew system for doing the rest that works
>for me but unfortunately is not in a good state to share?  I wonder if
>we shouldn't just start sharing anyway...

Well I use a pretty complex system that I'm not at liberty to publish
here but I can talk about it a bit...

everything maintained is in a cvs repository, there are basically two
other directories, distfile and iso. Distfile being archives of source
compiled as part of a host install, and iso being various iso images (I
generally don't use plurals in paths).

At the top of the cvs repository is a Makefile which has various
targets, it can install documentation from the pository, make scripts
"mktarget.sh" and "mkimager.sh" or push out host configurations to the
whole site, it can also make an iso of the repository, docs, distfiles
and links to the downloadable ISOs.

At some point we'll have a bootable PXE server cdrom, but for now, to
bootstrap the site, we do a (very) base OS install per local docs,
then mount the cdrom and run the make imager script from it, which,
changes the base install into a "site base install" (adds about 35 of my
favorite admin programs), and creates an archive of it all, putting it
on a webserver, and starts a dhcpd. That's the imager.

Next we boot a target with a live cdrom, on the imager network, set
the hostname, and download the make target script.  Through the magic
of the makefile that generated it, it knows where the web server is,
downloads the configuration, appropriately partitions and formats the
disk, downloads the "site base" archive (as pipe) and extracts it to
disk, installs the appropriate kernel and bootloader, extracts the cvs
archive and pushes the host configuration to the target root (disk).  It
also installs a firstboot script in an rc directory.

The admin is then instructed to plug the target into the appropriate
switch after it shuts down, then start it up.  That first boot script
does some final local setup and moves itself out of the rc directory.

Repeat for each host. When the site is all up, changes can be made by
modifying the master repository and pushing the files out (special
target for that). Typically that is done by with a strict umask, rsync,
ssh root@, and immediately followed by script which reads a flat file
based db containing mode and user:group of each file and dir in the
repository for that host. there is a directory host/all and a directory
for each host called host/{HOSTNAME}.

Working the bugs out of the entire process was very time consuming. In
many places decisions where made to just make it work vs spend another
36 hrs to get one part perfect. There are also parts which are
unnecessarily complex because they where deployed when the entire system
was not so complex -- they could be made simpler now but writing the
simpler code, testing and debugging it would not be brief...

If there is any lesson I've learned, it's that a self archiving,
portable site imaging system is very complex, and when it's almost
finished, each *little* bug takes :45 to 3 hours to fix -- but don't be
surprised if I *big* bug shows up right when when you thought you where
doing your last boot!

// George


-- 
George Georgalis, systems architect, administrator Linux BSD IXOYE
http://galis.org/george/ cell:646-331-2027 mailto:george@galis.org