Wednesday, September 23, 2009

The Case of the Replayed Queue

A number of years ago, I wrote a Linux device driver for a host bus adapter (HBA). The interface to the HBA consisted of two circular buffers, one to send information to the HBA (TxQ) and another to receive information from the HBA (RxQ). In this particular project, the HBAs were being used in a non-standard way to implement a bi-directional communication link between two Linux systems. As a result, the driver sent both commands and responses via the TxQ and received both commands and responses via the RxQ. In both cases, the driver communicated with the HBA using fixed-size data structures called IO control blocks (IOCBs).

Another aspect of this non-traditional use of the HBAs was that a proprietary upper layer protocol was used instead of the standard protocol used with the HBA. In this case, the protocol command packets did not fit within a single IOCB. As a result, two IOCBs were needed to send and receive command packets between the driver and HBA. These IOCB pairs were required to be adjacent in the circular buffers.

As the project reached the beta-testing phase, a perplexing driver bug began manifesting during QA testing. The failure was extremely rare happening only once a day while under heavy load. Unfortunately, the unit-test I had created to test the driver ex-situ continued to run without error so the problem was only occurring in-situ in QA'a full system configuration and under their heavy load. Oh bother.

Faced with a difficult bug and a nearing deadline I got straight to work on figuring out the problem. To start, I reviewed the QA logs to determine a pattern to the failures. The evidence suggested that the driver was re-processing protocol packets that had already been handled. To me this indicated a potential problem with the RxQ circular buffer but a code-review didn't reveal any obvious bugs.

I proceeded on to debugging the driver in-situ. Unfortunately, for reasons that I can't recall, I had little success using a kernel debugger. As a result, I initially tried to use print statements but the verbose system messages generated by the failure caused the information to immediately scroll off the screen. I tried sending debug information out a serial port but this slowed the system down enough to significantly increase the time between failures beyond practicality. Hmph.

After a week of frustration and an increasing pressure from the team to resolve the bug, I decided to take a fresh view of the problem. First, I considered what could cause the driver to replay the RxQ and quickly concluded that erroneously putting the tail-pointer past the head-pointer could cause the problem. A quick review of the head-tail pointer management code didn't reveal any mistakes. Hmph.

Next, I considered how the head-tail pointer code could be tricked into putting the tail pointer beyond the head. I reasoned that one way this could happen was if the driver thought the contents of the circular buffer were longer than its actual size. If this were the case, then when the driver updated the tail pointer it would put it past the head which would erroneously make the circular appear full when the driver re-examined it. But what could make the driver think that the buffer contents were longer than they actually were?

After some deep pondering the problem hit me, it must those multi-IOCB commands! My thought was that if for some reason the second of the two IOCBs wasn't posted to the queue when the driver processed it and if the driver didn't check for this condition, that the driver would adjust the tail pointer as if the second IOCB was present which would put it after the head pointer. The timing window for this seemed to be impossibly small but the failure was taking a day to manifest under heavy load so it had to be a rare sequence of events. If there is one thing that I have learned from my years working on large-scale parallel computer systems it's that highly improbably events occur far more frequently than one suspects.

I checked the code and confirmed that I had failed to account for incomplete multi-IOCBs. I added the necessary guards to the code, gave a new binary to QA, and crossed my fingers. A few days later success was declared and from then on the driver performed without error.

I never figured out why the HBA was delayed in posting the second half of the protocol command IOCBs to the RxQ. But you can be sure that I'll never forget to check for circular buffer fragments again!