Thursday, November 12, 2009

The Case of the Deadlocked Queues

Updated on 12.4.2009 with a diagram.

In a previous post I described my first industry job where I got to work on the coherency subsystem for an enterprise class ccNUMA server. For my "Bug Hunts" series of posts I thought I would describe a complex problem that I root-caused during pre-production testing.

During the alpha silicon testing, the systems started hanging under heavy load. Our initial investigations discovered that operations on the processor (FSB) and inter-ASIC (F2CBus) busses appeared to get stuck waiting for various pregrant signals to activate. The situation looked bad, it appeared that we had a multi-ASIC deadlock on our hands.

In simple terms, the coherency subsystem on each node had two responsibilities:

  • accept requests from the local processors, send requests to other nodes to perform the necessary coherency & memory operations on the remote processor busses, and return responses to the local processors.
  • accept requests from other nodes on behalf of remote processors, perform the necessary coherency & memory operations on the local processor bus, and return responses to the other nodes for the remote processors.

To guarantee forward progress, the coherency subsystem had one golden rule - requests shall never block responses. At the transport layer this rule was enforced through the use of two independent virtual channels with priority given to the response channel. Within the ASICs, the golden rule was maintained by carefully designed scheduling and priority logic. Theoretically the golden rule should have prevented deadlocks from occurring - so much for theory.

Since this was a real system, our view was limited to the activity observed on the busses - we had no visibility into what was happening inside the ASICs. To get around the visibility problem, I worked closely with a friend on the verification team to create a script that reproduced, in simulation, the activity observed in the lab. Through these simulation tests I gained an insight into what was happening inside the ASICs. In the meantime, I consulted with the ASIC designers to better understand some of the unexpected behaviors discovered during the investigation.

I can still remember well the meeting when everything suddenly "clicked" in my head. We had brought together both ASIC teams (which were geographically distributed) to discuss the problem and attempt an explanation. To focus the conversation, I asked a series of very specific "what if" questions that finally led to an "ah ha" moment.

As with most complex failures, a number of smaller problems combined to cause the deadlock. The diagram included in my prior post may provide more context for the explanation that follows.

The first problem was that the FSB ASIC violated the golden rule by strictly ordering, in a single queue, requests from remote processors and responses to local processors. During the investigation it was discovered that this design change was made to resolve a race condition between evicting modified data from the L3 cache (accomplished via a remote processor request) and installing the replacing data into the cache (accomplished via a local processor response). Oops.

The second problem was much more subtle, there was no ordering rule between local and remote responses! The microsequencer inside the Coherency ASIC was responsible for processing all coherency requests, both from this and other nodes, and generating the associated responses. The microsequencer treated all out-bound responses equally and used a common queue to buffer them for transmission. This single out-bound response queue essentially created a cycle in the path of queues between the Coherency and FSB ASICs - a response blocked at the head of this queue would prevent all other microsequencer responses from making forward progress.

The third problem was that the microsequencer could issue more requests than it had buffers to queue out-bound responses. This meant that the microsequencer relied completely on out-bound responses making forward progress for it to be able to receive responses for the requests it issued.

Together, these three problems resulted in a deadlock when:

  1. The coherency microsequencer sent a burst of local processor responses to the FSB ASIC. These actions freed coherency engine resources but consumed FSB ASIC resources preventing it from accepting more responses.
  2. The coherency engine's response queue got full with an FSB ASIC bound local processor response at the head of the queue. At this point, the resource consumption described in (1) prevented the response at the head of the queue from proceeding.
  3. Shortly thereafter, the FSB ASIC completed a number of Coherency ASIC requests and sent responses to the microsequencer. Because the microsequencer's out-bound response queue was block by (2), it couldn't send the remote responses generated by the FSB ASIC responses. This condition prevented the microsequencer from accepting more responses and caused the FSB ASIC's single request/response queue to stall.
  4. The local processor responses already in the FSB ASIC queue from (1) were unable to make progress due to the stall described in (3). As a result, the FSB ASIC could not accept more responses to alleviate the stall described in (2).

The result, neither ASIC could make progress until the other could - a classic deadlock.

In reality, the actual deadlock scenario was much more complicated than this - it required a complex sequence of events, specific event timings, and an interplay with non-coherent inter-node accesses (i.e. accesses to another node's PCI space). However, the simplified explanation above captures the salient points - in particular the need for an additional ordering rule between local and remote responses.

Once the problem was understood, the fix was relatively easy. Luckily we had programmatic control over the depth of various ASIC queues that allowed us to configure the system to avoid the deadlock condition without having to re-spin silicon - and then there was much rejoicing.

Although the deadlock had been solved the story didn't end there for me. I was so fascinated by this problem that I spent a significant amount of time over the next few months writing a program capable of analyzing a system of queues to discover such deadlocks. 10K lines of perl code later, I had a program that took as input a queuing model expressed in a domain specific language that I designed for the task. The model language was rich enough to represent such things as different kinds of tokens, tokens spawning new tokens, global token limits, conditional routing rules for tokens, and the sharing of limited resources. The program used this information to find patterns of tokens in queues that resulted in deadlock conditions. Although I didn't know it at the time, I essentially used a constraint propagation method to reduce the search space to a reasonable size. Using the program, I was able to "predict" the deadlock condition found in the lab and validate the configuration changes made to avoid it. Although the tool never got used again, the sense of personal accomplishment from developing it made the effort worthwhile.

My history with this deadlock finally came to an end when, for the third generation system, I redesigned the coherency engine to fix this and other troublesome bugs. The opportunity to fix this problem myself really brought a fulfilling sense of closure to the experience.

This deadlock bug is one of my favorite bug hunts of all time. It was a complex problem that required a deep understanding of the system, a holistic view of its behavior, and the ability to see how the many micro-behaviors interacted to cause the deadlock. Furthermore, it led to the simulation and ASIC re-design activities which were equally rewarding experiences.

This, however, wasn't the hardest bug that I ever root caused. That will be the topic of the next Bug Hunt post once I figure out how to describe it.

UPDATE

To clarify this post I decided to add a simple diagram of the deadlock condition. The actual failure was much more complex but hopefully this diagram captures the salient points from the description above - the impact of the shared queue for requests+responses and the need to route local and remote responses differently.