Wednesday, September 27, 2006

An idiosyncratic protocol

Something happened recently that once again got me thinking quite deeply. I actually tend to like these "little revelations", as I call them. Now I've traditionally thought that a write() or a send() across a connected TCP socket would return an error if the error occured locally (such as permission rights, insufficient buffers etc). A write() would also be expected to report an error for a deceased connection ; isn't that what ECONNRESET and EPIPE are for? Well...yes, but not always. Let's imagine a normal scenario where one end of a TCP connection is always writing (A) and the other end always reading (B). Now suddenly, out of the blue, B freaks out and kills the connection, thereby sending a FIN to A. So far, so good. But A's TCP stack does not handle this FIN and instead forwards it onto the receive queue of A. This poor chap, having never gotten a decent education his entire life, never reads(). Thus the FIN is as good as ignored. This leads us to the interesting part, which is that B is closed and A does not even know of it. The next write call from A therefore happily succeeds and the data is sent over the wire. But B's TCP stack realizes that the connection is already closed on its end and hence sends an RST, indicating the connection no longer exists. Now, A's stack gets the RST and updates its socket error-status to indicate the connection is closed. So, all subsequent writes give us a SIGPIPE/EPIPE. Thus, the first write always goes through succesfully even though the other end is already closed, leading to some frantic nail-biting and hair-pulling....The way to solve this? A select() and a check to see if the connection is already closed before writing to it. But why does the kernel handle only RST (and update the error status) and not a FIN? Because a FIN only means that no writes will happen from that end...but any number of reads might still be done. This is the reason why shutdown(...SHUT_RD) will NOT send a FIN, but shutdown(...SHUT_WR) will issue a FIN. This is also the reason why the kernel does not handle a FIN, since its end might still want to write to it!

No comments: