[Infrastructures] using IA methodologies to build network element
configuration
Andrew Fort
afort@choqolat.org
Fri, 01 Apr 2005 12:04:33 +1000
Steve Traugott wrote:
> On Tue, Mar 29, 2005 at 02:11:14PM +1000, Andrew Fort wrote:
>
> The best I've ever been able to think of so far is one of the following:
>
> - only use network hardware on which you can install your own agents
> (yeah, right)
I mentioned in the noise somewhere that some network elements (hey even
cisco) have an embedded agent. This has interesting implications to
your discussion below, since it is part of the firmware (at least in the
case of IOS).
Their marketing/documentation is highly vague, and the product name
changes regularly, which I don't count as a good sign - but, the code is
in pretty much every shipping device we're using of theirs.. see:
http://www.cisco.com/en/US/products/sw/netmgtsw/ps4617/products_qanda_item09186a008033947d.shtml
> Once we're all done grokking ordered changes on a given UNIX host and
> what that means, I think the next problem is one which both systems and
> network devices have in common: distributed synchronization, the barrier
> problem, whatever you want to call it. In other words, how do you
> synchronize changes on two or more nodes such that they take place
> simultaneously?
If I/O latencies of a second or two here and there are not of concern,
then the external agent (ala linux dongle) using multiple serial ports
is suitable. If not, embedded agents seem to be the go. The deeper you
look, the more questions...
> The really hard problem is synchronization when the change taking place
> is a change of IP address, netmask, etc. -- again, with the IP address
> reconfigurations needed during those HA cluster builds I was able to use
> barrierd to toss handshakes over the wall; "see you on the other side",
> have both nodes change IP simultaneously and hope for the best; it
> worked more often than I thought it would.
In large scale network architecture changes, this hope for the best
approach is presently the rule (with asynchronous serial access for when
it doesn't go as planned) -- devices are reconfigured and reloaded to
match the expected bitstate presented by our known good configuration file.
> I think some sort of global transaction concept might be right here --
> if *all* of the above 'management proxy' hosts don't re-aquire network
> connectivity after a change to the network, then they *all* back out
> the transaction, returning the *entire* network to its previous
> configuration and then yelling for human help.
> I've always said rollbacks can't be made reliable for changes to UNIX
> hosts, because it's self-modifying code and you might have broken your
> rollback mechanism; i.e. the turing paper. I think a transaction
> rollback might be able to be made safe for most classes of network
> device changes though, since it's the management proxy changing the
> network device, rather than the network device changing itself (hmmm...
> that's an argument *against* using network devices on which you can
> install your own agents...)
I'd agree with agents you install (non-kernel bits), but not for
firmware embedded agents. If you consider the embedded agent in your
virtual machine as your 'B' tape, your 'other' firmware the 'A' tabe,
and the firmware dictates AB; then I believe the agent is not affected
by the change in configuration (the configuration bits are dealt with
after the agent is already available). Obviously from there you're
compromised (without transactional capabilities). Simplifying this,
your firmware (with your agent) is really just 'A', since you can
represent those tapes just as 'A' (since ordering never changes).
> What this all means is anyone's guess, but I'd bet for managing network
> devices we're going to wind up with some combination of management
> proxies, barrier mechanisms, global transactions, and generating
> configuration changes based on current firmware version.
Agreed, I'd include embedded agents in there with management proxies.
The major two router vendors are making steps in this direction.
Considering you cannot even bootstrap many cisco devices by TFTP
(neither firmware nor config) any longer, I guess this is progress.
With transactions, how do you determine if the change was successful?
Oops, Godel is knocking! We can at best assume (safely enough) that
things didn't work unless we are told they did, so is a watchdog timer
(kill unless success ack'd) the only suitable solution to this problem?
> Brent and I have sort of divided Gaul over this, so I'd encourage anyone
> who's interested in working on this problem to go give him your brain
> cells at http://www.greatcircle.com/blog/network_automation/. That
> shouldn't stop anyone from discussing it here though; I just now
> understood a few new things about UNIX host management from writing
> this, and I betcha the whole barrier/transaction thing is going to pay
> off big for everyone.
>
> Steve
Thanks, very illuminating!
-Andrew