Saturday, September 19, 2009

The Case of the Errant DMA

The C++ puzzle that I posted about the other day got me thinking of various bugs that I've helped root-cause in the past. I thought it would be fun to post about some of the most memorable ones as a series of posts tagged as BugHunts.

About a year ago, a programmer that I was mentoring ran into a problem with errant DMAs in a ATA device driver that he was writing for a proprietary operating system (mostly as a training exercise). Said programmer did his initial development in a virtualized environment (i.e. virtual machine running on a hypervisor) and the device driver worked just fine. However, when run on a real platform the second and subsequent DMA operations would access "random" memory locations instead of those specified in the supplied scatter-gather lists.

After confirming that the basic scatter-gather code was working properly, the developer asked if I could help identify the problem. I started off by asking him to walk me through how the DMA operations were supposed to occur. Next, we walked through the code to ensure that it matched his description. Not seeing any obvious bugs, we then proceeded on to running the code, breaking into the debugger, and inspecting the scatter-gather lists to make sure that they contained the correct source/destination addresses. We observed the first DMA operation, which always worked, and the subsequent operations, which always failed, but saw no differences between the two. Humph.

At this point, I suggested that we focus on the fact that the first operation always succeeded but the subsequent operations always failed. I asserted that there must be a difference between how the first and subsequent operations were set up. So again we inspected the code and watched the execution in the debugger but didn't see any differences. Double humph.

After some thought, I then hypothesized that perhaps the first DMA operation was unexpectedly altering settings that were setup during the device driver's initialization. Since these setup operations were performed outside of the DMA setup code, we wouldn't observe any execution differences between the first and subsequent operations. So again we walked through the DMA operation but this time considered the additional configuration settings that they depended on. Bingo.

The chipset in question used a memory resident scatter-gather list. Since only a single DMA operation was outstanding at any time (per ATA channel) the developer pre-allocated the memory region during the device driver's initialization and wrote the region's base address to the appropriate chipset register. Since the same memory region was re-used to hold the scatter-gather list for each operation, the developer didn't re-write the base address to the register for each operation.

It turned out that on the virtualized platform the emulated ATA chipset left this register unchanged between operations, hence the device driver worked fine. However, on the real platform the chipset modified the register to point to the active scatter-gather element as the DMA operation was performed. As a result, the first DMA operation completed OK while all subsequent DMA operations walked off the end of the scatter-gather table and misdirected the DMA accesses based on the random data present in memory above the scatter-gather table. Once the developer added the code to re-initialize the scatter-gather base address register for each operation everything worked fine.

After the bug was fixed, the developer stated that he found this behavior surprising. I explained that having worked on designing ASICs in the past, I wasn't at all surprised that the hardware reused the register to hold the address of each scatter-gather element as the operation progressed; all's fair in ASIC design when it means saving some gates!

He then wondered how he could find such bugs in the future without having similar low-level hardware experience. I told him that the important lesson from the exercise was that once the possible errors have been ruled out you need to suspend disbelief and begin considering the impossible errors. Often our understanding of complicated systems is incomplete or flawed, therefore things we assume to be impossible may in fact be possible and therefore must be considered. I explained that this is a skill that must be hard-won over time through solving progressively harder bugs. I encouraged him to seek out hard bugs, spend time trying to resolve them on his own, but if stuck to utilize the already hard-won experiences of his mentors to accelerate his own skill development. This is the advice that has worked for me during my career and I am confident that the developer in question will do just fine.

Anyway, this particular bug was fun because it required thinking about how the system affected the program rather than how the program affected the system. These kinds of puzzles are always fun… after you've figured out the problem that is.