Named Pipe Mystery

<h1>Named Pipe Mystery</h1>

<p>If you ask me, I'm pretty good at troubleshooting.  After all, I get lots of
practice (hey, I never said I was smart).  However, on July 27, 2006, I wrote a
bug so simple yet so baffling that I had no hope of finding a solution.  I put
up this page out of desparation.</p>


<h2>The Scientific Method</h2>

<p>Most bugs are easy to track down with a simple application of the
Scientific Method:</p>

<ol>

	<li>Run some tests to figure out the symptoms of the bug.  Make sure
	you really know what the problem is.</li>

	<li>Make a guess about what might be the cause.  Look at your code.
	Play the part of the computer--step through the procedure in your head
	and make sure that you're really saying what you thought you were
	saying.  If everything looks correct, find a place in your code where
	you don't quite trust your understanding of what's going on and where
	being wrong might cause the symptoms you're seeing.  Make a
	hypothesis.</li>
	
	<li>Design an experiment or experiments to test your hypothesis.  This
	may be as simple as printing out status or fixing something that looks
	wrong logically and seeing if the problem is fixed.  On the other
	hand, the experiment may be as complicated as writing a separate
	program that tests the hypothesis in controlled circumstances.</li>

</ol>


<h2>My Most Annoying Bug Ever</h2>

<p>The above method, with its numerous variants, usually works really well.
Sometimes, though, you just can't figure out what's going on.  It's extremely
irritating to whittle down your bug into a five line program that looks
<b>perfect</b>.  You have absolutely know clue of what could possibly be wrong
in there.</p>

<p>To understand the bug, you need some UNIX background.  I assume you know
all about named pipes (fifos) and shells.  Anyone who knows what's going on in
this test program certainly knows more about this stuff than I do.  Anyway,
here's the broken code (<a href="mystery.sh">mystery.sh</a>):</p>

<hr>

<pre>
#!/bin/sh

FIFO1=/tmp/testfifo1
FIFO2=/tmp/testfifo2

rm -f $FIFO1 $FIFO2
mkfifo $FIFO1 || exit
mkfifo $FIFO2 || exit

nc -v www.mcnabbs.org 80 &lt;$FIFO1 &gt;$FIFO2 &amp;
./httpget.sh &lt;$FIFO2 &gt;$FIFO1
</pre>

<hr>

<p>Where <a href="httpget.sh">httpget.sh</a> is as follows:</p>

<hr>

<pre>
#!/bin/sh

echo "GET / HTTP/1.1"
echo "Host: www.mcnabbs.org"
echo

while read line
do
        echo $line &gt;&amp;2
done
</pre>

<hr>

<p>The output of mystery.sh <b>should</b> be something like:</p>

<pre>
www.mcnabbs.org [64.62.190.91] 80 (http) open
HTTP/1.1 200 OK
Date: Fri, 28 Jul 2006 02:51:44 GMT
Server: Apache/2.0.52 (CentOS)
...
</pre>

<p>Instead, the script just hangs.  No output at all.  Netcat doesn't even say
that it's opened up a TCP connection.  I've tested this code on two different
Linux distributions and on Mac OS X, and I've tried both Bash and ZSH as the
interpreter.  The results are identical in all of my tests.</p>


<h2>The Twist</h2>

<p>Before you start making up stuff about blocking and deadlocks without
thinking things through, look at an example of a similar script that works.
Only one line has changed (<a href="mystery2.sh">mystery2.sh</a>):</p>

<hr>

<pre>
#!/bin/sh

FIFO1=/tmp/testfifo1
FIFO2=/tmp/testfifo2

rm -f $FIFO1 $FIFO2
mkfifo $FIFO1 || exit
mkfifo $FIFO2 || exit

nc -v www.mcnabbs.org 80 &lt;$FIFO1 &gt;$FIFO2 &amp;
./httpget.sh &gt;$FIFO1 &lt;$FIFO2
</pre>

<hr>

<p>What the heck?  It works if you redirect standard output first, but it
doesn't work if you redirect standard input first??  This is in the shell,
before httpget.sh is execed!  Everything I know about shells and file
descriptors says this shouldn't make any difference.</p>

<p>Something consistent is happening, but I have no clue what it is.  I've
tried several completely different kernels and completely different shells.
I've switched the order of executing Netcat and httpget.sh, and I've tried
rewriting httpget.sh in Python.  Something important is happening here.</p>


<h2>The Solution</h2>

<p>The first person to give me a satisfying explanation of the situation was...
Byron Clark.  I guess I'm not too surprised. :)  Don't read the explanation
until you've been thoroughly stumped by the problem.  If you cheat you won't
appreciate the solution.  Anyway, here are Byron's comments:</p>

<p>First, another example, then the explanation:</p>

<pre>
$ mkfifo foo.fifo
$ strace cat foo.fifo
</pre>

<p>Note that cat hangs while trying to open foo.fifo.  In another shell:</p>

<pre>
$ echo foo &gt; foo.fifo
</pre>

<p>Note that cat in the first shell finally succeeds in opening the fifo and
catting the contents.</p>

<p>It appears that open(2) for reading <b>or</b> writing on a pipe will block
until open(2) is called on the other end of the pipe.</p>

<p>So, here's what happens in the non-working version of mystery.sh and
httpget.sh on the webpage you linked to:</p>

<pre>
nc -v www.mcnabbs.org 80 &lt;$FIFO1 &gt;$FIFO2 &amp;
</pre>

<ul>
<li>open($FIFO1, O_RDONLY) -- $FIFO1 hasn't been opened for writing, this blocks</li>
</ul>

<pre>
./httpget.sh &lt;$FIFO2 &gt;$FIFO1
</pre>

<ul>
<li>open($FIFO1, O_RDONLY) -- $FIFO2 hasn't been opened for writing, this blocks</li>
</ul>

<p>Hence, deadlock.</p>

<p>The version that works:</p>

<pre>
nc -v www.mcnabbs.org 80 &lt;$FIFO1 &gt;$FIFO2 &amp;
</pre>

<ul>
<li>open($FIFO1, O_RDONLY) -- $FIFO1 hasn't been opened for writing, this blocks</li>
</ul>

<pre>
./httpget.sh &gt;$FIFO1 &lt;$FIFO2
</pre>

<ul>
<li>open($FIFO1, O_WRONLY) -- success</li>
<li>nc now unblocks on the open($FIFO1, O_RDONLY)</li>
<li>$FIFO2 is opened in a similar manner.</li>
</ul>

<p>Here's the relevant text from fifo(4):</p>

<blockquote>The  kernel  maintains  exactly one pipe object for each FIFO
	special file that is opened by at least one process.  The FIFO must be
	opened on both ends (reading  and  writing) before data can be passed.
	Normally, opening the FIFO blocks until the other end is opened
	also.</blockquote>

<blockquote>A process can open a FIFO in non-blocking mode. In this case,
	opening for read only  will  succeed  even if noone has opened on the
	write side yet; opening for write only will fail with ENXIO (no such
	device or address) unless the other end has already been
	opened.</blockquote>

<h2>Closing Comments</h2>

<p>Note that the issue isn't opening for reading before opening for writing.
The problem is that the first open will block until the open on the other end
finishes.  So the following is still wrong:</p>

<pre>
nc -v www.mcnabbs.org 80 &gt;$FIFO2 &lt;$FIFO1 &amp;
./httpget.sh &gt;$FIFO1 &lt;$FIFO2
</pre>

<p>Thanks again, Byron.</p>