JCardente: The Case of the Corrupted Cache Line

To culminate (for now) my "Bug Hunts" series of posts I will attempt to describe the hardest problem that I have solved to date. This was a very complicated problem which I found challenging to describe without going into excessive detail but I think the discussion below does an adequate job.

In prior posts, I described the ccNUMA system that I worked on during my first industry job. After the (relative) success of the first system, a second generation system was developed that consisted of numerous incremental improvements - faster processors, better memory controllers, faster interconnect ASICs, a few coherency protocol optimizations, etc. In general, the pre-production testing went smoothly and the system started shipping with the fullest of confidence.

Shortly after going GA, however, a coherency protocol problem appeared in the field. The problem only seemed to occur on the largest (eight node) second generation systems and had a long mean time between failure even under very heavy workloads. After ruling out the obvious suspects (those coherency protocol optimizations I mentioned), we instrumented an eight node system in the lab and started the bug hunt.

To describe the failure, I need to further explain the SCI coherency protocol used in these systems. Like other distributed shared memory protocols, the SCI protocol segmented the system's memory into fixed sized chunks called "lines" (64 bytes in SCI's case). Being a ccNUMA machine, the lines were distributed across all of the nodes in the system. For each line of memory in the system there was a single "home-node" that contained the associated physical memory and served as the coordination point for all accesses to it. Mapping tables initialized at boot time identified the home-node for each line of memory which remained constant while the machine was running. "Remote-nodes" could access the lines from home-nodes and cache them via the SCI protocol and interconnect.

One big challenge in using the SCI protocol was that it assumed an architecture where the memory and processors were located on separate nodes. In this model, home-nodes were passive and simply responded to protocol requests from remote nodes - the protocol didn't provide a means for home-nodes to obtain access to their own lines cached elsewhere in the system. Unfortunately, our system used a different architecture with each node containing both processors and memory which put it at odds with the standard SCI model. As a result, a workaround was needed in our system to allow home-nodes to obtain access to their own memory lines.

The chosen workaround was to essentially allow home-nodes to temporarily adopt multiple personalities. In order to re-gain access to a local memory line cached elsewhere in the system, a home-node would:

Temporarily create a fictitious remote node persona
Participate in the coherency protocol as a remote node to obtain the desired access rights
Act as a remote-node performing an eviction to remove the fictitious-remote-persona from the coherency sharing-chain (SCI used a distributed linked list to identify the remote-nodes with cached copies of each line).

While this dual-personality solution worked, the need to simultaneously track both the home and fake-remote-persona protocol states significantly increased the complexity of our implementation (foreshadow alert!).

Another important aspect of the system was the way that the per-line access-rights information was maintained. The Coherency ASIC was responsible for maintaining the SCI state (which indicated the access rights) for each line of local memory and remote memory cached in the L3 cache. Because of the amount of main-memory (4GB per node) and L3 cache (512MB per node), the SCI state information was stored in an external DRAM attached to the Coherency ASIC. Storing the SCI information in this way presented a problem in that it would have taken too long to query it for each transaction on the processor bus (which was necessary to maintain coherency). To get around this problem, an SRAM attached to the FSB ASIC was used to cache the access-rights information for recently accessed lines (both local and remote). Using the SRAM, the FSB ASIC could quickly determine if a processor bus operation could be allowed to continue (node had sufficient access) or needed to be postponed (node didn't have sufficient access). Although the SRAM allowed the majority of processor bus operations to complete un-delayed, it burdened the Coherency ASIC with the additional task of keeping the access-rights cached in SRAM consistent with the SCI tags (another foreshadow alert!).

With those additional explanations out of the way…

Our initial investigation into the failure revealed that the primary symptom was an inconsistency between the SCI and SRAM states for a local memory line. Specifically:

The SCI state indicated that the line was not cached anywhere else in the system - the local node had full access rights to the line.
The SRAM state indicated that the line was cached in a modified state elsewhere in the system - the local node had no access rights to the line.

Oh bother. This inconsistency was obviously incorrect and the ASICs were designed to generate a panic when it occurred to prevent data corruption (which access-rights information was correct??).

At first, we couldn't think of how such an inconsistency could occur. After much effort we managed to get a synthetic workload running in the lab that could induce the failure approximately once per day. However, because of the system's size (eight nodes), large number of busses-of-interest (3 per node, 24 total), and a limitation of only having two logic analyzers it was extremely difficult to get a clear picture of the failure. At best, all we could get (if we were lucky) was a trace of a couple of bus operations being performed against the failing line. It was clear that we stood little chance of capturing a definitive, complete trace of the failure - it was time to put on our thinking caps.

As luck would have it, this failure emerged at the end of a financial quarter and there was a substantial order being prepared for shipment. Management was unwilling to ship the order until the problem was resolved which meant there was a large amount of revenue at risk. All of sudden this problem got upper-management's full attention - not good.

Given the problem's complexity, and the substantial amount of executive attention my mentor (the MicroKid) and I isolated ourselves in his office and began brainstorming how the state inconsistency could occur. After a while, we identified a couple of areas in the microcode that could lead to such a problem but neither of us could think of a sequence of events to manifest it. After a lot of joint discussion we decided to brainstorm separately to parallelize the effort.

For three days I sat in my cube listening to music, staring at trace fragments, and assembling in my mind all the possible interactions that could cause the state inconsistency we were observing. I don't think I have ever concentrated on a single thing as hard for that long at any other time in life. Finally, everything "clicked" and I knew how the state inconsistency was occurring.

The failure involved a complex interaction between three nodes with one being the associated memory-line's home node. The following is a (simplified!) description of the failure:

The initial condition was that the line was cached in a modified state in a remote node (Node1) and the home-node was attempting to get read access. To accomplish this task, the home-node created its fake-remote-persona and performed the necessary operations to obtain a read-copy of the line. Upon getting a copy of the line, the SRAM access rights were set to "read-only" and the local processor's request was completed.
In order to remove the fake-remote-persona from the SCI sharing chain, the home-node began an eviction operation to get rid of the now unneeded second personality by sending a request to Node1.
Just as the home-node began performing the eviction operation for its fake-remote-persona, Node1 also began an operation to evict the line from its L3 cache. According to the SCI protocol, such eviction races were resolved by allowing the "older" remote node (Node1) to proceed ahead of the "younger" remote node (which in this case was the home-node's fake-persona).
While the eviction race condition between the home-node and Node1 was being resolved, a third node (Node2) started an operation to obtain read access to the same line which completed successfully.
Upon completing the eviction operation with Node1, the home-node detected that Node2 had obtained read-access to the line (Step 4) which meant that it now had to perform an eviction operation with Node2 in order to finish removing the fake-remote-persona from the SCI sharing chain. This required the home-node to downgrade the SRAM access rights to "no-access" to guard against potential SCI operations that could invalidate the local copy. The home-node then resumed its eviction effort by informing Node2 that it was leaving the sharing chain.
Node2 received the home-node's fake-remote-persona's eviction request and sent a response.
Immediately after it sent the response in step 6, Node2 began an operation to evict the line from its L3 cache.
Due to an extremely unlikely sequence of events, Node2's eviction request from step 7 got received and processed by the home-node while Node2's response from step 6 was still in-fight. In order for this to have happened, a request had to bypass a response in the interconnect which was highly improbable due to the priority given to sending responses over requests.

At the end of this sequence, the SCI state was correct in indicating that the line was not cached anywhere else in the system. The problem was that the microcode didn't explicitly check for the request-bypassing-response order of events that occurred in step 8. As a result, the microcode failed to update the SRAM information with the correct access rights thus leading to the problem. Over a decade later, I still get a brain cramp from thinking about this sequence of events.

Once the microcode bug was identified it took less than five minutes to fix. Luckily, I figured out and fixed the problem on the Friday afternoon before the end of the quarter so the big order sitting on the docks was able to ship, the company got to recognize the revenue in the quarter, and the account's sales team got a hefty commission. For my efforts, my management compensated me with, wait for it, a cup of coffee. Seriously, after "saving" millions in revenue I got a $2.00 cup of coffee from my Director. Oh well, luckily I wasn't in the role for the money :-).

And that, to date, is the hardest problem that I have ever solved.

The following diagram summarizes the failure sequence described above. Note that to simplify the diagram I avoided using the actual SCI commands and states involved as their obscure nature would have required too much explanation. Recall that each lab trace contained, at-best, only one or two of the included operations. To solve the problem, I more or less had to imagine all of the possible interactions that could be occurring and then identify this sequence as a failing case.

Saturday, November 21, 2009

The Case of the Corrupted Cache Line