[Infrastructures] Isconf: Fetching blocks

Steve Traugott stevegt@TerraLuna.Org
Sat, 10 Dec 2005 23:58:13 -0800


Hi Jordan,

On Fri, Dec 09, 2005 at 08:54:43PM -0700, Jordan Curzon wrote:
> I have been getting the following error frequently. The error occurs
> after several other hosts have updated with no problems. The problem
> is that if I run isconf up again it starts up again but not from where
> it left off. Any ideas about debuging this?
> 
> 
> isconf: error: missing block:
> /var/is/fs/cache/internal.curzons.net/block/814/814338f5b4c910e35a55d101d972998f7b6bd949-eeb84b2ef12f9232f90d15457136d992-1:
> Operation not permitted

What this means is that the machine showing this error is not getting
that file from any other machine.  It means that the machine *is*,
however, getting the journal file from some other machine, so we know
they have seen each other at least once on the net.  Just for a double
check, you should be able to see the journal at this path:

    /var/is/fs/cache/internal.curzons.net/volume/{branchname}/journal

Inside the journal, you should be able to find the 'snap' transaction in
question by looking for the entry with the 814338f5b4c91... block ID.  

When this transaction fails, the next 'isconf up' on the same machine
should retry the same (previously failed) transaction first.

Here's what I need to know:

- What isconf version are you running?

- Are these machines on the same subnet?  

- What's the network load average look like?  (Since current versions
  still use UDP for the 'whohas' messages, it's *possible* that we're
  just dropping all of the 'whohas' packets when they hit the net.) 

- How big is the 81433... file?  (I've been concerned about some
  implied but fuzzy timeouts when transferring large files, but
  haven't prioritized this so far because they will go away with the
  TCP mesh code.)

- Is it always the same file?

- Is it always the same host?

- The next time this happens, can you send me the /tmp/isconf.* log
  files on the machine where this has happened?   

- You say "it starts up again but not from where it left off" -- are
  you sure it's not retrying the failed transaction, quickly
  succeeding this time, then continuing? (I hafta ask.)  Regardless,
  the next time you have a failed transfer, do this:
   
  - save a copy of the journal and of /var/is/conf/history
  - run 'isconf up' the second time
  - if it looks like it didn't restart where you think it should have,
    then copy the display contents, grab another copy of the journal
    and history, and send me the display contents, the "before' and
    "after" copies of the journal and history, and the /tmp/isconf.*
    log files.

  The reason I ask for the journal and history is that the history's
  whole purpose in life is to track what's been executed.  It would
  be, well, very strange for the journal replay to start from anywhere
  else, so now you've got me all paranoid and stuff.  ;-)

As far as debugging this yourself, you can try 'tail -f /tmp/isconf.log',
matching the debug messages in there with the debug() calls in the
code, to see if you can divine the flow of what's happening while you
run 'isconf up' etc. on the victim machine.  You'll see debug messages
from several microtasks interleaved at the same time, but once you get
used to that it's pretty straightforward.  If this is a networking
issue, then you'll probably be spending some time in Cache.py, and
maybe HTTPServer.py.  See the comments in Kernel.py for more
information about the whole microtasks thing, and feel free to edit or
add pages in the wiki as you go.  There's some information in there
that I wrote while thinking through the architecture and so on, but we
really need to start a hacking howto.

The wire protocol flow may not be apparent from looking at the
Cache.py code; in general think "arp + http".  Host A, running 'isconf
up' sends a "whohas" broadcast asking for the file.  You will see the
complete text of these messages in the isconf.log file; I have full
debug logging on by default right now.  Any host which has the file
sends an "ihave" response, but host A will ignore all but the first.
Let's say host A hears the "ihave" from host B first; host A then
sends an HTTP GET to host B, and host B returns the file in the HTTP
response.  

In 4.3 we're deprecating HTTP in favor of "sendme/hereis" transfers
via the TCP mesh.  The same "whohas" and "ihave" messages will still
be around, carried by the mesh rather than by UDP; this gives us
reliable message passing and removes the need for the many retries
you'll see when you dig into the logs.  In later 4.3 releases we add
signatures and encryption.

Steve