To culminate (for now) my "Bug Hunts" series of posts I will attempt to
describe the hardest problem that I have solved to date. This was a
very complicated problem which I found challenging to describe without
going into excessive detail but I think the discussion below does an
adequate job.
In prior posts, I described the ccNUMA system that I worked on during
my first industry job. After the (relative) success of the first
system, a second generation system was developed that consisted of
numerous incremental improvements - faster processors, better memory
controllers, faster interconnect ASICs, a few coherency protocol
optimizations, etc. In general, the pre-production testing went
smoothly and the system started shipping with the fullest of
confidence.
Shortly after going GA, however, a coherency protocol problem appeared
in the field. The problem only seemed to occur on the largest (eight
node) second generation systems and had a long mean time between
failure even under very heavy workloads. After ruling out the obvious
suspects (those coherency protocol optimizations I mentioned), we
instrumented an eight node system in the lab and started the bug hunt.
To describe the failure, I need to further explain the SCI coherency
protocol used in these systems. Like other distributed shared memory
protocols, the SCI protocol segmented the system's memory into fixed
sized chunks called "lines" (64 bytes in SCI's case). Being a ccNUMA
machine, the lines were distributed across all of the nodes in the
system. For each line of memory in the system there was a single
"home-node" that contained the associated physical memory and served
as the coordination point for all accesses to it. Mapping tables
initialized at boot time identified the home-node for each line of
memory which remained constant while the machine was
running. "Remote-nodes" could access the lines from home-nodes and
cache them via the SCI protocol and interconnect.
One big challenge in using the SCI protocol was that it assumed an
architecture where the memory and processors were located on separate
nodes. In this model, home-nodes were passive and simply responded to
protocol requests from remote nodes - the protocol didn't provide a
means for home-nodes to obtain access to their own lines cached
elsewhere in the system. Unfortunately, our system used a different
architecture with each node containing both processors and memory
which put it at odds with the standard SCI model. As a result, a
workaround was needed in our system to allow home-nodes to obtain
access to their own memory lines.
The chosen workaround was to essentially allow home-nodes to
temporarily adopt multiple personalities. In order to re-gain access
to a local memory line cached elsewhere in the system, a home-node
would:
-
Temporarily create a fictitious remote node persona
-
Participate in the coherency protocol as a remote node to obtain
the desired access rights
-
Act as a remote-node performing an eviction to remove the
fictitious-remote-persona from the coherency sharing-chain (SCI
used a distributed linked list to identify the remote-nodes with
cached copies of each line).
While this dual-personality solution worked, the need to
simultaneously track both the home and fake-remote-persona protocol
states significantly increased the complexity of our implementation
(foreshadow alert!).
Another important aspect of the system was the way that the per-line
access-rights information was maintained. The Coherency ASIC was
responsible for maintaining the SCI state (which indicated the access
rights) for each line of local memory and remote memory cached in the
L3 cache. Because of the amount of main-memory (4GB per node) and L3
cache (512MB per node), the SCI state information was stored in an
external DRAM attached to the Coherency ASIC. Storing the SCI
information in this way presented a problem in that it would have
taken too long to query it for each transaction on the processor bus
(which was necessary to maintain coherency). To get around this
problem, an SRAM attached to the FSB ASIC was used to cache the
access-rights information for recently accessed lines (both local and
remote). Using the SRAM, the FSB ASIC could quickly determine if a
processor bus operation could be allowed to continue (node had
sufficient access) or needed to be postponed (node didn't have
sufficient access). Although the SRAM allowed the majority of
processor bus operations to complete un-delayed, it burdened the
Coherency ASIC with the additional task of keeping the access-rights
cached in SRAM consistent with the SCI tags (another foreshadow
alert!).
With those additional explanations out of the way…
Our initial investigation into the failure revealed that the primary
symptom was an inconsistency between the SCI and SRAM states for a
local memory line. Specifically:
-
The SCI state indicated that the line was not cached anywhere else
in the system - the local node had full access rights to the line.
-
The SRAM state indicated that the line was cached in a modified
state elsewhere in the system - the local node had no access
rights to the line.
Oh bother. This inconsistency was obviously incorrect and the ASICs
were designed to generate a panic when it occurred to prevent data
corruption (which access-rights information was correct??).
At first, we couldn't think of how such an inconsistency could
occur. After much effort we managed to get a synthetic workload
running in the lab that could induce the failure approximately once
per day. However, because of the system's size (eight nodes), large
number of busses-of-interest (3 per node, 24 total), and a limitation
of only having two logic analyzers it was extremely difficult to get
a clear picture of the failure. At best, all we could get (if we were
lucky) was a trace of a couple of bus operations being performed
against the failing line. It was clear that we stood little chance of
capturing a definitive, complete trace of the failure - it was time to
put on our thinking caps.
As luck would have it, this failure emerged at the end of a financial
quarter and there was a substantial order being prepared for
shipment. Management was unwilling to ship the order until the problem
was resolved which meant there was a large amount of revenue at
risk. All of sudden this problem got upper-management's full
attention - not good.
Given the problem's complexity, and the substantial amount of
executive attention my mentor (the MicroKid) and I isolated ourselves
in his office and began brainstorming how the state inconsistency
could occur. After a while, we identified a couple of areas in the
microcode that could lead to such a problem but neither of us could
think of a sequence of events to manifest it. After a lot of joint
discussion we decided to brainstorm separately to parallelize the
effort.
For three days I sat in my cube listening to music, staring at trace
fragments, and assembling in my mind all the possible interactions
that could cause the state inconsistency we were observing. I don't
think I have ever concentrated on a single thing as hard for that long
at any other time in life. Finally, everything "clicked" and I knew
how the state inconsistency was occurring.
The failure involved a complex interaction between three nodes with one
being the associated memory-line's home node. The following is a (simplified!)
description of the failure:
-
The initial condition was that the line was cached in a modified
state in a remote node (Node1) and the home-node was attempting to
get read access. To accomplish this task, the home-node created its
fake-remote-persona and performed the necessary operations to
obtain a read-copy of the line. Upon getting a copy of the line, the
SRAM access rights were set to "read-only" and the local processor's
request was completed.
-
In order to remove the fake-remote-persona from the SCI sharing
chain, the home-node began an eviction operation to get rid of the
now unneeded second personality by sending a request to Node1.
-
Just as the home-node began performing the eviction operation for
its fake-remote-persona, Node1 also began an operation to evict the
line from its L3 cache. According to the SCI protocol, such
eviction races were resolved by allowing the "older" remote node
(Node1) to proceed ahead of the "younger" remote node (which in
this case was the home-node's fake-persona).
-
While the eviction race condition between the home-node and Node1 was being
resolved, a third node (Node2) started an operation to obtain read
access to the same line which completed successfully.
-
Upon completing the eviction operation with Node1, the home-node
detected that Node2 had obtained read-access to the line (Step 4)
which meant that it now had to perform an eviction operation with
Node2 in order to finish removing the fake-remote-persona from the
SCI sharing chain. This required the home-node to downgrade the
SRAM access rights to "no-access" to guard against potential SCI
operations that could invalidate the local copy. The home-node then
resumed its eviction effort by informing Node2 that it was leaving
the sharing chain.
-
Node2 received the home-node's fake-remote-persona's eviction
request and sent a response.
-
Immediately after it sent the response in step 6, Node2 began an
operation to evict the line from its L3 cache.
-
Due to an extremely unlikely sequence of events, Node2's eviction
request from step 7 got received and processed by the home-node
while Node2's response from step 6 was still in-fight. In order for
this to have happened, a request had to bypass a response in the
interconnect which was highly improbable due to the priority given
to sending responses over requests.
At the end of this sequence, the SCI state was correct in indicating
that the line was not cached anywhere else in the system. The problem
was that the microcode didn't explicitly check for the
request-bypassing-response order of events that occurred in step 8. As
a result, the microcode failed to update the SRAM information with the
correct access rights thus leading to the problem. Over a decade
later, I still get a brain cramp from thinking about this sequence of
events.
Once the microcode bug was identified it took less than five minutes
to fix. Luckily, I figured out and fixed the problem on the Friday
afternoon before the end of the quarter so the big order sitting on
the docks was able to ship, the company got to recognize the revenue
in the quarter, and the account's sales team got a hefty
commission. For my efforts, my management compensated me with, wait
for it, a cup of coffee. Seriously, after "saving" millions in revenue
I got a $2.00 cup of coffee from my Director. Oh well, luckily I
wasn't in the role for the money :-).
And that, to date, is the hardest problem that I have ever solved.
The following diagram summarizes the failure sequence described
above. Note that to simplify the diagram I avoided using the actual SCI
commands and states involved as their obscure nature would have
required too much explanation. Recall that each lab trace contained,
at-best, only one or two of the included operations. To solve the
problem, I more or less had to imagine all of the possible
interactions that could be occurring and then identify this sequence
as a failing case.