Named Pipe Mystery

If you ask me, I'm pretty good at troubleshooting. After all, I get lots of practice (hey, I never said I was smart). However, on July 27, 2006, I wrote a bug so simple yet so baffling that I had no hope of finding a solution. I put up this page out of desparation.

The Scientific Method

Most bugs are easy to track down with a simple application of the Scientific Method:

  1. Run some tests to figure out the symptoms of the bug. Make sure you really know what the problem is.
  2. Make a guess about what might be the cause. Look at your code. Play the part of the computer--step through the procedure in your head and make sure that you're really saying what you thought you were saying. If everything looks correct, find a place in your code where you don't quite trust your understanding of what's going on and where being wrong might cause the symptoms you're seeing. Make a hypothesis.
  3. Design an experiment or experiments to test your hypothesis. This may be as simple as printing out status or fixing something that looks wrong logically and seeing if the problem is fixed. On the other hand, the experiment may be as complicated as writing a separate program that tests the hypothesis in controlled circumstances.

My Most Annoying Bug Ever

The above method, with its numerous variants, usually works really well. Sometimes, though, you just can't figure out what's going on. It's extremely irritating to whittle down your bug into a five line program that looks perfect. You have absolutely know clue of what could possibly be wrong in there.

To understand the bug, you need some UNIX background. I assume you know all about named pipes (fifos) and shells. Anyone who knows what's going on in this test program certainly knows more about this stuff than I do. Anyway, here's the broken code (mystery.sh):


#!/bin/sh

FIFO1=/tmp/testfifo1
FIFO2=/tmp/testfifo2

rm -f $FIFO1 $FIFO2
mkfifo $FIFO1 || exit
mkfifo $FIFO2 || exit

nc -v www.mcnabbs.org 80 <$FIFO1 >$FIFO2 &
./httpget.sh <$FIFO2 >$FIFO1

Where httpget.sh is as follows:


#!/bin/sh

echo "GET / HTTP/1.1"
echo "Host: www.mcnabbs.org"
echo

while read line
do
        echo $line >&2
done

The output of mystery.sh should be something like:

www.mcnabbs.org [64.62.190.91] 80 (http) open
HTTP/1.1 200 OK
Date: Fri, 28 Jul 2006 02:51:44 GMT
Server: Apache/2.0.52 (CentOS)
...

Instead, the script just hangs. No output at all. Netcat doesn't even say that it's opened up a TCP connection. I've tested this code on two different Linux distributions and on Mac OS X, and I've tried both Bash and ZSH as the interpreter. The results are identical in all of my tests.

The Twist

Before you start making up stuff about blocking and deadlocks without thinking things through, look at an example of a similar script that works. Only one line has changed (mystery2.sh):


#!/bin/sh

FIFO1=/tmp/testfifo1
FIFO2=/tmp/testfifo2

rm -f $FIFO1 $FIFO2
mkfifo $FIFO1 || exit
mkfifo $FIFO2 || exit

nc -v www.mcnabbs.org 80 <$FIFO1 >$FIFO2 &
./httpget.sh >$FIFO1 <$FIFO2

What the heck? It works if you redirect standard output first, but it doesn't work if you redirect standard input first?? This is in the shell, before httpget.sh is execed! Everything I know about shells and file descriptors says this shouldn't make any difference.

Something consistent is happening, but I have no clue what it is. I've tried several completely different kernels and completely different shells. I've switched the order of executing Netcat and httpget.sh, and I've tried rewriting httpget.sh in Python. Something important is happening here.

The Solution

The first person to give me a satisfying explanation of the situation was... Byron Clark. I guess I'm not too surprised. :) Don't read the explanation until you've been thoroughly stumped by the problem. If you cheat you won't appreciate the solution. Anyway, here are Byron's comments:

First, another example, then the explanation:

$ mkfifo foo.fifo
$ strace cat foo.fifo

Note that cat hangs while trying to open foo.fifo. In another shell:

$ echo foo > foo.fifo

Note that cat in the first shell finally succeeds in opening the fifo and catting the contents.

It appears that open(2) for reading or writing on a pipe will block until open(2) is called on the other end of the pipe.

So, here's what happens in the non-working version of mystery.sh and httpget.sh on the webpage you linked to:

nc -v www.mcnabbs.org 80 <$FIFO1 >$FIFO2 &
./httpget.sh <$FIFO2 >$FIFO1

Hence, deadlock.

The version that works:

nc -v www.mcnabbs.org 80 <$FIFO1 >$FIFO2 &
./httpget.sh >$FIFO1 <$FIFO2

Here's the relevant text from fifo(4):

The kernel maintains exactly one pipe object for each FIFO special file that is opened by at least one process. The FIFO must be opened on both ends (reading and writing) before data can be passed. Normally, opening the FIFO blocks until the other end is opened also.
A process can open a FIFO in non-blocking mode. In this case, opening for read only will succeed even if noone has opened on the write side yet; opening for write only will fail with ENXIO (no such device or address) unless the other end has already been opened.

Closing Comments

Note that the issue isn't opening for reading before opening for writing. The problem is that the first open will block until the open on the other end finishes. So the following is still wrong:

nc -v www.mcnabbs.org 80 >$FIFO2 <$FIFO1 &
./httpget.sh >$FIFO1 <$FIFO2

Thanks again, Byron.