Thursday, December 31, 2009

Recovering Deleted JPEGs from a FAT File System - Part 6

Part 6 in a series of posts on recovering deleted JPEG files from a FAT file system.

In this post we'll tackle the last major FAT file system data structure - the file allocation table. Compared to the previous post on the root directory, this post is much less complex. Follow the "read more" link for an overview of the file allocation table and the code required to process it.

Tuesday, December 29, 2009

Book Review: The Planiverse

The Planiverse: Computer Contact with a Two Dimensional World by A.K. Dewdney

Written in 1984, this is a fictional story of a computer science professor and his students making contact with a two-dimensional universe - The Planiverse - through a simulation program developed as a class project. More specifically, a telepathic connection is established with one of the universe's inhabitants called YNDRD otherwise known as "Yendred". The professor and students accompany Yendred on a journey and in the process discover the workings of the two-dimensional world.

Essentially, the book can be decomposed into two concurrent streams:

  • The story of Yendred's journey across his world in search of a mystic - Drabk - said to possess "knowledge of the beyond".
  • A series of expositions on the scientific, technological, and social aspects of Yendred's two-dimensional universe.

Basically, Yendred's journey provides the context and segues for the expositions. This structure reflects the book's derivation from two prior publications - the monograph Two-Dimensional Science and Technology and book Symposium on Two-Dimensional Science and Technology. Material from these prior works formed the basis for the expositions around which the fictional Yendred plot was wrapped to create a coherent, entertaining, and intellectually stimulating story.

Although I thought the Yendred story was only "OK", I really enjoyed the expositions - clearly a lot of thought was given to how a two dimensional world would work. For example, some of the scientific exposition topics were:

  • Thermodynamics and the extremely low melting point that two-dimensions lead to.
  • The mechanics of two-dimensional turbulence.
  • Electromagnetism and the lack of magnetic forces.
  • The "Global" weather patterns of a two-dimensional "planet"
  • Energy diminishing according to the inverse of distance (instead of the inverse-square as in three dimensions).
  • The biological structure of two-dimensional organisms.
  • Two-dimensional chemistry and the limited number of possible elements.

The technical expositions discussed the design of two-dimensional objects such as:

  • Buildings that avoided obstructing surface travel.
  • Doors that allowed privacy and structural integrity while remaining passable.
  • Functional steam engines, foundries, space rockets, and fishing boats.
  • Balloons for travel and the shipment of freight.
  • Gears capable of varying torque in the absence of axles.
  • Subterranean elevators.
  • Electronic circuits suitable for constructing rudimentary computers.
  • A wind powered generator for recharging batteries - the only viable source of electricity as wires create impassable obstacles.

The social expositions discussed topics such as:

  • How the world's inhabitants "passed" each other while traveling (they walk over each other according to social conventions).
  • How traveler's deal with surface events like a sports game (they wait until it's done).
  • Two dimensional warfare and its influence on politics.

In summary, if you enjoy thinking differently about science, and technology then I think you'll enjoy reading this book. The ability to think about problems from different perspectives is an important and powerful skill - I think reading books like The Planiverse is one way to help develop that skill.

Thursday, December 24, 2009

Recovering Deleted JPEGs from a FAT File System - Part 5

Part 5 in a series of posts on recovering deleted JPEG files from a FAT file system.

UPDATE: 2010/1/28: it's come to my attention that the long file name support code described in this post is broken - it only works for files that utilize a single LFN entry. For the time being I plan to work around this bug for the FATRecover series but I do intend to post a fix eventually. I'll be sure to add another update with a link to the fix when it is available.

The topic of this installment is the root directory and the code required to process it. Although strictly speaking the file allocation table is the next major on-disk area, I think covering the root directory first will make the overall discussion clearer.

Note that this is a watershed post as the "hack" level of the example code increases dramatically. This was unfortunate but necessary to accommodate certain FAT functionality while limiting the discussion's length. That said, readers with sufficient skill should be able to use the example code to create more robust implementations.

Follow the "read more" link for a detailed description of how to read and interpret the root directory.

Monday, December 21, 2009

Oh Christmas Tree, Oh Christmas Tree

The other day my son, age four, noticed that a string of lights were out on our Christmas tree. Last year, we switched to the new LED lights so I wasn't sure how long they should last or how to fix them.

I assumed that, like incandescent tree lights, a single failing LED would prevent the others from lighting. Unfortunately, we discarded the original box and therefore did not have any replacement bulbs (assuming that they came with some).

I picked up a new set of lights but apparently all "white" LEDs are not exactly the same - the new set had a bluer tint that didn't match the remaining sets on the tree. Oh bother, suddenly this became a "project". With the beauty of our tree at stake failure was not an option.

A Google search led me to this extremely informative web page. It turns out these LED lights use a much simpler circuit than I first imagined. No rectifiers, no smoothing capacitors - just 30 LEDs in series with a current limiting resistor. That explains the blinking that I sometimes find irritating.

My plan was to replace the faulting LED with one from the new set. First, however, I had to find it. I used a 9V battery and resistor to build a test rig, removed each LED, and tested it. Eventually, I found the broken one and of course it was the first one in the string that had the current limiting resistor soldered to it. So much for a quick replacement.

After a quick trip to the workshop to solder the resistor onto the new LED, we once again had a working set of really-white lights - except for one bluish one. Oh well, good enough is good enough.

An unexpected bonus from having one bluish LED is that my son now considers it a game to "find" the different one - it doesn't move but that doesn't appear to matter to a four year old :-).

Saturday, December 19, 2009

Recovering Deleted JPEGs from a FAT File System - Part 4

Part 4 in a series of posts on recovering deleted JPEG files from a FAT file system.

Now that we've created a test data set and covered the high-level structure of a FAT file system, it's time to get our hands dirty with some code. Enough talk, more action!!!

Follow the "read more" link for a detailed description of how to read and interpret a FAT file system boot sector.

Tuesday, December 15, 2009

An Incredible Library

I have never been more jealous in my life than I am of Jay Walker's personal library.

If I had copious amounts of money, this is what I would do with it - books, interesting gadgets, and lots of wood. I don't know if I would ever leave the house :-).

Even just working in such an aesthetic environment would be wonderful.

Recovering Deleted JPEGs from a FAT File System - Part 3

Part 3 in a series of posts on recovering deleted JPEG files from a FAT file system.

The FAT file system was invented in 1980 by Tim Paterson while developing 86-DOS, the precursor of MS-DOS. Since its creation, FAT has gone through a number of evolutions to accommodate growing disk sizes, and provide more capabilities. Although FAT is nearly 30 years old, it remains in common use due to its ubiquitous OS support and simplicity - both important factors for embedded consumer devices like digital cameras.

FAT's history is complex, a retelling could consume an entire series of posts itself. For this project, it is sufficient to know that there are three main variants - FAT12, FAT16, and FAT32 - each successively supporting larger maximum disk sizes. Generally speaking, the on-disk format of all three variants is very similar, so much so that code developed to grok one format can, with reasonable effort, be hacked to grok another1. In this project, we'll be working with a FAT16 file system which hdiutil can be explicitly requested to create by using the argument -fs "MS-DOS FAT16"2.

The following diagram depicts the high-level, on-disk structure of a FAT16 file system (not drawn to scale).

+---------+-------+--------+-------+---------------------+
|   BOOT  | RESV  |   FAT  | ROOT  |  DATA ............. |
|  SECTOR | (OPT) |        | DIR   |  AREA ............. |
+---------+-------+--------+-------+---------------------+
0                                                        N

In computer storage, capacity is divided into fixed sized units called sectors - typically 512 bytes long3. The sector is the fundamental unit for addressing storage and smallest amount of data that can be transferred to/from a disk drive. While older disk drives used an awkward addressing scheme based on physical geometry, newer drives use a linear series of sector addresses.

The first sector (offset 0) of a FAT16 file system contains the BOOT SECTOR which provides critical information about the file system's organization. Optionally, a number of sectors may be reserved after the boot sector (RESV).

The next major data structure is the File Allocation Table (FAT) from which the file system gets its name. The FAT is essentially an array of 16bit elements 4 each representing a fixed size portion of the DATA AREA. To minimize the FAT's relative size, each element represents a cluster5, a power-of-two multiple number of sectors. In addition to tracking the allocation status of each DATA AREA cluster, FAT elements are also used to store pointers forming linked-lists describing the on-disk location of files and sub-directories spanning multiple clusters.

After the FAT comes the root directory area (ROOT DIR) used to hold the information describing the files and sub-directories at the top of the file system namespace tree. In FAT12 and FAT16, the root directory area is a fixed size which limits the numbers of files and sub-directories that can be stored in it.

The remainder of the disk capacity (DATA AREA) is essentially a heap allocated as needed to store file and sub-directory data. As mentioned previously, use of the DATA AREA is managed by the FAT.

In the next few posts, I'll describe each of these major areas in greater detail and present the code needed to interpret them. By the end, we should essentially have a read-only FAT file system implementation that will serve as the basis for the remainder of the project.

Footnotes:

1 Here I am ignoring many, many subtle differences and complications. A thorough discussion of the technical details for all of FAT's variants is outside the scope of this series. The curious reader is recommended to read the bountiful information available via a Google search. Brian Carrier's book, File System Forensic Analysis, is another valuable resource (which sadly wasn't available when I first did this project in 2004).

2 Note that FAT16 file systems have a minimum supported size - usually 16MB but some tools may allow smaller sizes.

3 Some enterprise level disk drives use a 520 byte sector - the additional 8 bytes are often used to store error checking and recovery information for RAID systems.

4 The size of the FAT elements is one of the principal differences between the file system variants. FAT12 uses 12bit entries while FAT32 uses 32bit entries. The FAT element size is one of the factors that determines each variant's maximum supported disk size.

5 Other file systems may use the term block instead of cluster.

Monday, December 14, 2009

Recovering Deleted JPEGs from a FAT File System - Part 2

Part 2 in a series of posts on recovering deleted JPEG files from a FAT file system.

Whenever I develop an analysis program I first create a dataset to test and experiment against. In part 1, I mentioned that I used my digital camera to create a test data set for the original project in 2004. For this series of posts, I thought it would be better to use a data set reproducible by others. Two things are required to accomplish this goal:

A Google search for test images led me to this wikipedia page from which I selected the University of Southern California Signal & Image Processing Institute's data set - specifically the miscellaneous corpus. This collection consists of 44 TIFF images of various resolutions ranging in size from 64KB to 1MB (17MB in total). Since JPEG images are needed for this project, I used the following ImageMagick command to do the conversion.

mogrify -format jpg -quality 90 *.tiff

After the conversion, the file sizes ranged from 4.8KB to 350KB (3.4MB in total) - a conveniently sized collection for this project.

In the 2004 project, I used the UNIX dd command to make a disk image of my camera's memory card. However for this project, I wanted a way to create disk images without requiring a physical device. One option considered was to use Linux loop devices but I wanted to use my machine-of-choice, an Apple MacBookPro, for this project - an artificial but important constraint since this is for fun.

After some research, I discovered that Apple's hdiutil utility can create a suitable image using the following command:

hdiutil create \
        -fs "MS-DOS" \
        -megabytes <SIZE> \
        -layout NONE \
        -volname <VOLNAME> <IMAGENAME>

Executing this command creates the file <IMAGENAME>.dmg that is an image of a FAT file system called <VOLNAME> with a total capacity of <SIZE> megabytes. The remaining argument, -layout NONE, prevents a partition table from being added to the image - an unnecessary complication for this project.

The following example illustrates the OS X terminal commands needed to create a 16MB disk image with a single JPEG file on it1

$ hdiutil create \
          -fs "MS-DOS" \
          -megabytes 16 \
          -layout NONE \
          -volname FRTEST1 frtest1
..................................................
created: /<PATH>/frtest1.dmg

$ hdiutil attach frtest1.dmg 
/dev/disk1             /Volumes/FRTEST1

$ cp images/4.2.04.jpg //Volumes/FRTEST1//

$ hdiutil detach //Volumes/FRTEST1//

Using the selected image corpus and above hdituil terminal commands, a wide variety of test disk images can be created programmatically via shell scripts - just what is needed for this project.

In the next post, we'll start down the path of grokking a FAT file system layout.

Footnotes:

1 Minor changes were made to the example output to remove machine specific information (i.e. paths) and fit it within the limited size of the post area.

Friday, December 11, 2009

Recovering Deleted JPEGs from a FAT File System - Part 1

Part 1 in a series of posts on recovering deleted JPEG files from a FAT file system.

In 2004, my in-laws returned from a vacation and had all of the pictures accidentally deleted from their digital camera. When I heard the news my first thought was that the data may still be there!

Often, deleting a file only erases its associated file system metadata (i.e. name, owner, etc) - the file's data is left unchanged. As a result, a deleted file can sometimes be recovered by finding the unchanged data fragments and recombining them in the correct order. I reasoned that if their camera took this metadata-only approach then there was a good chance that some of the pictures could be recovered.

Unfortunately, my in-laws lived in another state so I couldn't examine their camera's memory card right away. I suggested that they remove the card from the camera and bring it with them on their next visit so that I could analyze it then.

My initial plan was to try one of the many file recovery tools that already existed. However, I soon began to wonder how hard it would be to write a program to recover the pictures myself. Finding the challenge exciting, I immediately began spending all of my spare time coding up a recovery program.

After one week, and a lot of caffeine, I had a C program capable of recovering deleted JPEG files from a FAT file system - the typical file system used by digital cameras. Using my own camera as a test bed, the program could reliably recover pictures after performing an "erase all" operation. Cool!

Unfortunately, the story didn't end as well for my in-laws. When I finally received their memory card I found that all of the file system data blocks had the value 0xFF - a clear sign that the entire FLASH memory had been erased1. The data was gone.

Despite the unfortunate ending, this project was one of my favorite spare-time hacks. So much so in fact that I thought it would be fun to recreate it as a series of blog posts. Over the next few weeks I plan to write a series of posts under the label "FATrecover" describing how to develop such a program. By the end of the series, hopefully anyone with sufficient coding experience will be able to write their own FAT file system recovery program.

Footnotes:

1 FLASH memory, being a form of EEPROM, cannot be re-written directly. Instead, the FLASH memory cells must first be returned to an unwritten state - an operation typically called an "erase". After being erased, the affected memory cells have a value of 0xFF.

Tuesday, December 1, 2009

Behind the Code: Jim Gray

One of my engineering role models is the late Jim Gray. I happened across a Microsoft Channel9 interview with him and decided to watch it.

I most appreciated Jim's comments on the following three points:

  • The importance of first principles
  • The "risk" that research presents to established products
  • The role writing played in his getting recognition for what was actually joint work with others

One of the most influential papers that I've read was Jim's (and Shenoy's) Rules of Thumb in Data Engineering. This relatively short paper is packed with insights derived from simple trend data and the application of first principles. Since that time I've tried to rely on first-principles while researching new technologies and communicating the results. I've found this approach to be both highly effective and differentiating as, unfortunately, the practice doesn't seem to be as common as it should be.

Although my current position doesn't compare to the one Jim held at Microsoft, I do run a small research team inside of a large company. As a result, I found his comments on this topic interesting. In particular, I was pleased to hear that I have come to a point-of-view in common with Jim's - research innovations represent increased risk to an established line of business that is trying to minimize risk. It's a classic example of the "change vs. control" conflict that Kotter describes in his seminal paper, What Leaders Really Do. In realizing this, I have come to the conclusion that it is my group's responsibility to reduce the risk of research innovations to a level that allows the main-line managers to predict with confidence the cost/time/effort required to bring the new product/feature to market. As a result, creating new technologies is not sufficient - my team must also make those technologies consumable to an existing line of business. This has been perhaps one of the most important realizations of my career and it was encouraging to hear Jim make similar comments.

Towards the end of the interview, Jim made it a point to say that much of "his" work was actually done in collaboration with others. However, much of the credit has been given to Jim because he took the additional effort to document the work in the form of papers, articles, books, etc. This statement matches my own observations of authors/presenters earning outsized credit and highlights the importance of communicating effectively in regards to career growth.

Monday, November 30, 2009

400 Years of the Telescope

If you're interested in astronomy then you might enjoy this PBS documentary, 400 Years of the Telescope.

This beautifully filmed show summarizes the history of the telescope beginning with Galileo's first experiments with looking glasses up to the Hubble and radio astronomy. Along the way it (briefly) describes various recent technologies like adaptive optics and techniques for building very large mirrors.

While not a deeply technical resource, it is a nice way to spend an hour. In my case, I watched this on my iPhone while waiting for my son at a doctor's appointment and it was much better than reading out-of-date magazines - well worth the $1.99 cost.

Watching this made me want to work in an observatory someday, maybe I've seen the movie Contact too many times (if that is even possible) :-).

Sunday, November 29, 2009

N is a Number

Earlier this year I read the book "Linked: The New Science of Networks" which prominently featured the mathematician Paul Erdős. The book mentioned a documentary about Erdős entitled "N is a Number" that I was curious to see. Thanks to Netflix, I finally got the chance last weekend.

Most importantly, this movie is about Erdős "the person" and not his mathematics. While the film discusses his mathematical activities and accomplishments, it makes no material attempt to explain them.

With that focus in mind, I thought this was a nice documentary. The footage of Erdős was very entertaining and I suspect conveyed an accurate sense of his personality. I found the interviews of his various acquaintances equally entertaining - especially the elderly English women who recalled Erdős as a young man. Such focus on Erdős' human qualities really helped cast him as an "exceptional person" instead of an "un-relatable genius".

As is often the case with people of substance, Erdős suffered hardship and had a number of significant life experiences - especially during WWII. Perhaps this explains his generosity and thoughtfulness of others that was recounted a number of times in the documentary.

I was most surprised to learn of Erdős' nomadic lifestyle - for most of his life he had no permanent home or job. Instead, he followed his interests around the world attending conferences and staying with (very) supportive friends. If an acquaintance on a different continent came up with an interesting problem, Erdős would hop on a plane with all of his worldly possessions in two small suitcases and often only a small amount of money in his pockets. I found his nomadic and minimalistic lifestyle both impressive and foreign as I can't imagine being without a "home" to return to. I wonder if this is an insight into the personal sacrifice needed to achieve greatness beyond having raw-talent alone.

Clearly Erdős was a genius which I already understood from reading other references before watching this film. The depth and breadth of his work was unique and will likely be unmatched for a long time to come. But, putting his mathematical achievements aside, Erdős still seemed like an impressive, inspirational, and great man - definitely a role model worth reflecting on.

Tuesday, November 24, 2009

Bug Hunt Wrap-up

To wrap up my "Bug Hunts" series, I thought it worthwhile to quickly reflect on what made these particular bugs so special.

In the series I posted about five bugs:

Since these five represent only a small subset of all the bugs that I have diagnosed during my career I have to ask myself - what makes these bugs so special?

To me, the common thread among them is that their resolution involved a leap of thought based on skill, imagination, concentration, and an intuitive understanding of the associated technologies. In each case the answer was not self-evident from the information available and became clear only after deep thought produced an incredibly rewarding "ah-ha" moment. The short-term elation from these "ah-ha" moments are one reason why these bugs hold a special place in my memories.

However, I think the deeper answer to my question is that these bugs were milestone events that, to varying degrees, demonstrated and validated the skills that I had worked hard to acquire. The longer-term feeling of "mastery" and confidence produced by these moments were the reward required to make my past efforts satisfying and motivate further developing my skills.

In many ways, this conclusion correlates well with Malcom Gladwell's comments on satisfying work in his book Outliers. From page 49:

Those three things - autonomy, complexity, and a connection between effort and reward - are, most people agree, the three qualities that work has to have if it is to be satisfying.

The lesson that I draw from this reflection is that it is important to periodically seek out validating experiences to demonstrate acquired skill, obtain satisfaction, and refuel a continuous self-development effort.

Saturday, November 21, 2009

The Case of the Corrupted Cache Line

To culminate (for now) my "Bug Hunts" series of posts I will attempt to describe the hardest problem that I have solved to date. This was a very complicated problem which I found challenging to describe without going into excessive detail but I think the discussion below does an adequate job.

In prior posts, I described the ccNUMA system that I worked on during my first industry job. After the (relative) success of the first system, a second generation system was developed that consisted of numerous incremental improvements - faster processors, better memory controllers, faster interconnect ASICs, a few coherency protocol optimizations, etc. In general, the pre-production testing went smoothly and the system started shipping with the fullest of confidence.

Shortly after going GA, however, a coherency protocol problem appeared in the field. The problem only seemed to occur on the largest (eight node) second generation systems and had a long mean time between failure even under very heavy workloads. After ruling out the obvious suspects (those coherency protocol optimizations I mentioned), we instrumented an eight node system in the lab and started the bug hunt.

To describe the failure, I need to further explain the SCI coherency protocol used in these systems. Like other distributed shared memory protocols, the SCI protocol segmented the system's memory into fixed sized chunks called "lines" (64 bytes in SCI's case). Being a ccNUMA machine, the lines were distributed across all of the nodes in the system. For each line of memory in the system there was a single "home-node" that contained the associated physical memory and served as the coordination point for all accesses to it. Mapping tables initialized at boot time identified the home-node for each line of memory which remained constant while the machine was running. "Remote-nodes" could access the lines from home-nodes and cache them via the SCI protocol and interconnect.

One big challenge in using the SCI protocol was that it assumed an architecture where the memory and processors were located on separate nodes. In this model, home-nodes were passive and simply responded to protocol requests from remote nodes - the protocol didn't provide a means for home-nodes to obtain access to their own lines cached elsewhere in the system. Unfortunately, our system used a different architecture with each node containing both processors and memory which put it at odds with the standard SCI model. As a result, a workaround was needed in our system to allow home-nodes to obtain access to their own memory lines.

The chosen workaround was to essentially allow home-nodes to temporarily adopt multiple personalities. In order to re-gain access to a local memory line cached elsewhere in the system, a home-node would:

  1. Temporarily create a fictitious remote node persona
  2. Participate in the coherency protocol as a remote node to obtain the desired access rights
  3. Act as a remote-node performing an eviction to remove the fictitious-remote-persona from the coherency sharing-chain (SCI used a distributed linked list to identify the remote-nodes with cached copies of each line).

While this dual-personality solution worked, the need to simultaneously track both the home and fake-remote-persona protocol states significantly increased the complexity of our implementation (foreshadow alert!).

Another important aspect of the system was the way that the per-line access-rights information was maintained. The Coherency ASIC was responsible for maintaining the SCI state (which indicated the access rights) for each line of local memory and remote memory cached in the L3 cache. Because of the amount of main-memory (4GB per node) and L3 cache (512MB per node), the SCI state information was stored in an external DRAM attached to the Coherency ASIC. Storing the SCI information in this way presented a problem in that it would have taken too long to query it for each transaction on the processor bus (which was necessary to maintain coherency). To get around this problem, an SRAM attached to the FSB ASIC was used to cache the access-rights information for recently accessed lines (both local and remote). Using the SRAM, the FSB ASIC could quickly determine if a processor bus operation could be allowed to continue (node had sufficient access) or needed to be postponed (node didn't have sufficient access). Although the SRAM allowed the majority of processor bus operations to complete un-delayed, it burdened the Coherency ASIC with the additional task of keeping the access-rights cached in SRAM consistent with the SCI tags (another foreshadow alert!).

With those additional explanations out of the way…

Our initial investigation into the failure revealed that the primary symptom was an inconsistency between the SCI and SRAM states for a local memory line. Specifically:

  • The SCI state indicated that the line was not cached anywhere else in the system - the local node had full access rights to the line.
  • The SRAM state indicated that the line was cached in a modified state elsewhere in the system - the local node had no access rights to the line.

Oh bother. This inconsistency was obviously incorrect and the ASICs were designed to generate a panic when it occurred to prevent data corruption (which access-rights information was correct??).

At first, we couldn't think of how such an inconsistency could occur. After much effort we managed to get a synthetic workload running in the lab that could induce the failure approximately once per day. However, because of the system's size (eight nodes), large number of busses-of-interest (3 per node, 24 total), and a limitation of only having two logic analyzers it was extremely difficult to get a clear picture of the failure. At best, all we could get (if we were lucky) was a trace of a couple of bus operations being performed against the failing line. It was clear that we stood little chance of capturing a definitive, complete trace of the failure - it was time to put on our thinking caps.

As luck would have it, this failure emerged at the end of a financial quarter and there was a substantial order being prepared for shipment. Management was unwilling to ship the order until the problem was resolved which meant there was a large amount of revenue at risk. All of sudden this problem got upper-management's full attention - not good.

Given the problem's complexity, and the substantial amount of executive attention my mentor (the MicroKid) and I isolated ourselves in his office and began brainstorming how the state inconsistency could occur. After a while, we identified a couple of areas in the microcode that could lead to such a problem but neither of us could think of a sequence of events to manifest it. After a lot of joint discussion we decided to brainstorm separately to parallelize the effort.

For three days I sat in my cube listening to music, staring at trace fragments, and assembling in my mind all the possible interactions that could cause the state inconsistency we were observing. I don't think I have ever concentrated on a single thing as hard for that long at any other time in life. Finally, everything "clicked" and I knew how the state inconsistency was occurring.

The failure involved a complex interaction between three nodes with one being the associated memory-line's home node. The following is a (simplified!) description of the failure:

  1. The initial condition was that the line was cached in a modified state in a remote node (Node1) and the home-node was attempting to get read access. To accomplish this task, the home-node created its fake-remote-persona and performed the necessary operations to obtain a read-copy of the line. Upon getting a copy of the line, the SRAM access rights were set to "read-only" and the local processor's request was completed.
  2. In order to remove the fake-remote-persona from the SCI sharing chain, the home-node began an eviction operation to get rid of the now unneeded second personality by sending a request to Node1.
  3. Just as the home-node began performing the eviction operation for its fake-remote-persona, Node1 also began an operation to evict the line from its L3 cache. According to the SCI protocol, such eviction races were resolved by allowing the "older" remote node (Node1) to proceed ahead of the "younger" remote node (which in this case was the home-node's fake-persona).
  4. While the eviction race condition between the home-node and Node1 was being resolved, a third node (Node2) started an operation to obtain read access to the same line which completed successfully.
  5. Upon completing the eviction operation with Node1, the home-node detected that Node2 had obtained read-access to the line (Step 4) which meant that it now had to perform an eviction operation with Node2 in order to finish removing the fake-remote-persona from the SCI sharing chain. This required the home-node to downgrade the SRAM access rights to "no-access" to guard against potential SCI operations that could invalidate the local copy. The home-node then resumed its eviction effort by informing Node2 that it was leaving the sharing chain.
  6. Node2 received the home-node's fake-remote-persona's eviction request and sent a response.
  7. Immediately after it sent the response in step 6, Node2 began an operation to evict the line from its L3 cache.
  8. Due to an extremely unlikely sequence of events, Node2's eviction request from step 7 got received and processed by the home-node while Node2's response from step 6 was still in-fight. In order for this to have happened, a request had to bypass a response in the interconnect which was highly improbable due to the priority given to sending responses over requests.

At the end of this sequence, the SCI state was correct in indicating that the line was not cached anywhere else in the system. The problem was that the microcode didn't explicitly check for the request-bypassing-response order of events that occurred in step 8. As a result, the microcode failed to update the SRAM information with the correct access rights thus leading to the problem. Over a decade later, I still get a brain cramp from thinking about this sequence of events.

Once the microcode bug was identified it took less than five minutes to fix. Luckily, I figured out and fixed the problem on the Friday afternoon before the end of the quarter so the big order sitting on the docks was able to ship, the company got to recognize the revenue in the quarter, and the account's sales team got a hefty commission. For my efforts, my management compensated me with, wait for it, a cup of coffee. Seriously, after "saving" millions in revenue I got a $2.00 cup of coffee from my Director. Oh well, luckily I wasn't in the role for the money :-).

And that, to date, is the hardest problem that I have ever solved.

The following diagram summarizes the failure sequence described above. Note that to simplify the diagram I avoided using the actual SCI commands and states involved as their obscure nature would have required too much explanation. Recall that each lab trace contained, at-best, only one or two of the included operations. To solve the problem, I more or less had to imagine all of the possible interactions that could be occurring and then identify this sequence as a failing case.

Thursday, November 12, 2009

The Case of the Deadlocked Queues

Updated on 12.4.2009 with a diagram.

In a previous post I described my first industry job where I got to work on the coherency subsystem for an enterprise class ccNUMA server. For my "Bug Hunts" series of posts I thought I would describe a complex problem that I root-caused during pre-production testing.

During the alpha silicon testing, the systems started hanging under heavy load. Our initial investigations discovered that operations on the processor (FSB) and inter-ASIC (F2CBus) busses appeared to get stuck waiting for various pregrant signals to activate. The situation looked bad, it appeared that we had a multi-ASIC deadlock on our hands.

In simple terms, the coherency subsystem on each node had two responsibilities:

  • accept requests from the local processors, send requests to other nodes to perform the necessary coherency & memory operations on the remote processor busses, and return responses to the local processors.
  • accept requests from other nodes on behalf of remote processors, perform the necessary coherency & memory operations on the local processor bus, and return responses to the other nodes for the remote processors.

To guarantee forward progress, the coherency subsystem had one golden rule - requests shall never block responses. At the transport layer this rule was enforced through the use of two independent virtual channels with priority given to the response channel. Within the ASICs, the golden rule was maintained by carefully designed scheduling and priority logic. Theoretically the golden rule should have prevented deadlocks from occurring - so much for theory.

Since this was a real system, our view was limited to the activity observed on the busses - we had no visibility into what was happening inside the ASICs. To get around the visibility problem, I worked closely with a friend on the verification team to create a script that reproduced, in simulation, the activity observed in the lab. Through these simulation tests I gained an insight into what was happening inside the ASICs. In the meantime, I consulted with the ASIC designers to better understand some of the unexpected behaviors discovered during the investigation.

I can still remember well the meeting when everything suddenly "clicked" in my head. We had brought together both ASIC teams (which were geographically distributed) to discuss the problem and attempt an explanation. To focus the conversation, I asked a series of very specific "what if" questions that finally led to an "ah ha" moment.

As with most complex failures, a number of smaller problems combined to cause the deadlock. The diagram included in my prior post may provide more context for the explanation that follows.

The first problem was that the FSB ASIC violated the golden rule by strictly ordering, in a single queue, requests from remote processors and responses to local processors. During the investigation it was discovered that this design change was made to resolve a race condition between evicting modified data from the L3 cache (accomplished via a remote processor request) and installing the replacing data into the cache (accomplished via a local processor response). Oops.

The second problem was much more subtle, there was no ordering rule between local and remote responses! The microsequencer inside the Coherency ASIC was responsible for processing all coherency requests, both from this and other nodes, and generating the associated responses. The microsequencer treated all out-bound responses equally and used a common queue to buffer them for transmission. This single out-bound response queue essentially created a cycle in the path of queues between the Coherency and FSB ASICs - a response blocked at the head of this queue would prevent all other microsequencer responses from making forward progress.

The third problem was that the microsequencer could issue more requests than it had buffers to queue out-bound responses. This meant that the microsequencer relied completely on out-bound responses making forward progress for it to be able to receive responses for the requests it issued.

Together, these three problems resulted in a deadlock when:

  1. The coherency microsequencer sent a burst of local processor responses to the FSB ASIC. These actions freed coherency engine resources but consumed FSB ASIC resources preventing it from accepting more responses.
  2. The coherency engine's response queue got full with an FSB ASIC bound local processor response at the head of the queue. At this point, the resource consumption described in (1) prevented the response at the head of the queue from proceeding.
  3. Shortly thereafter, the FSB ASIC completed a number of Coherency ASIC requests and sent responses to the microsequencer. Because the microsequencer's out-bound response queue was block by (2), it couldn't send the remote responses generated by the FSB ASIC responses. This condition prevented the microsequencer from accepting more responses and caused the FSB ASIC's single request/response queue to stall.
  4. The local processor responses already in the FSB ASIC queue from (1) were unable to make progress due to the stall described in (3). As a result, the FSB ASIC could not accept more responses to alleviate the stall described in (2).

The result, neither ASIC could make progress until the other could - a classic deadlock.

In reality, the actual deadlock scenario was much more complicated than this - it required a complex sequence of events, specific event timings, and an interplay with non-coherent inter-node accesses (i.e. accesses to another node's PCI space). However, the simplified explanation above captures the salient points - in particular the need for an additional ordering rule between local and remote responses.

Once the problem was understood, the fix was relatively easy. Luckily we had programmatic control over the depth of various ASIC queues that allowed us to configure the system to avoid the deadlock condition without having to re-spin silicon - and then there was much rejoicing.

Although the deadlock had been solved the story didn't end there for me. I was so fascinated by this problem that I spent a significant amount of time over the next few months writing a program capable of analyzing a system of queues to discover such deadlocks. 10K lines of perl code later, I had a program that took as input a queuing model expressed in a domain specific language that I designed for the task. The model language was rich enough to represent such things as different kinds of tokens, tokens spawning new tokens, global token limits, conditional routing rules for tokens, and the sharing of limited resources. The program used this information to find patterns of tokens in queues that resulted in deadlock conditions. Although I didn't know it at the time, I essentially used a constraint propagation method to reduce the search space to a reasonable size. Using the program, I was able to "predict" the deadlock condition found in the lab and validate the configuration changes made to avoid it. Although the tool never got used again, the sense of personal accomplishment from developing it made the effort worthwhile.

My history with this deadlock finally came to an end when, for the third generation system, I redesigned the coherency engine to fix this and other troublesome bugs. The opportunity to fix this problem myself really brought a fulfilling sense of closure to the experience.

This deadlock bug is one of my favorite bug hunts of all time. It was a complex problem that required a deep understanding of the system, a holistic view of its behavior, and the ability to see how the many micro-behaviors interacted to cause the deadlock. Furthermore, it led to the simulation and ASIC re-design activities which were equally rewarding experiences.

This, however, wasn't the hardest bug that I ever root caused. That will be the topic of the next Bug Hunt post once I figure out how to describe it.

UPDATE

To clarify this post I decided to add a simple diagram of the deadlock condition. The actual failure was much more complex but hopefully this diagram captures the salient points from the description above - the impact of the shared queue for requests+responses and the need to route local and remote responses differently.

Friday, November 6, 2009

A Lucky Career Start

Updated on 12.4.2009 with a better diagram of the system's architecure

I have two more bugs in mind for my "bug hunt" series but before I can post about them I need to provide a little background information about my first real industry job.

After graduate school, I had the rare opportunity to help build a large-scale, enterprise-class ccNUMA server. At the time, 1996, ccNUMA architectures were leading-edge technology and ours was one of the first to utilize the Intel platform.

The system's architecture was based on "building block" nodes each containing four PentiumPro processors, 4GB of main memory, and 12 PCI slots. Up to eight nodes could be connected together using a high-speed interconnect to create a system with 32 processors, 32GB of main memory, and 96 PCI slots. The system ran a proprietary version of UNIX carefully tuned to work efficiently on its specific architecture. At its maximum size the system consumed two 19 inch racks - big iron indeed.

The high-speed interconnect was based on the IEEE Scalable Coherent Interconnect (SCI) standard and responsible for creating a single, cache coherent memory space out of the combined node resources. Through the interconnect, any processor could access the memory on any of the nodes - albeit at variable latency based on their relative locations in the system. In addition to the processor caches, each node contained a large capacity "L3" cache to store data fetched from other nodes to avoid, as much as possible, the significantly greater "remote" access latency. The SCI interconnect inter-operated with the processor and "L3" caches to ensure that accesses to cached data were handled correctly. Together, these capabilities are what made the system a cache-coherent non-uniform memory (ccNUMA) architecture. SCI was a ring based protocol and our system used two counter rotating rings to halve the average node-to-node latency, double the bandwidth, and provide some measure of redundancy (not fault-tolerance, just reboot level redundancy if a ring failed).

The ccNUMA subsystem consisted of four custom ASICs, two designed in-house and the others jointly developed with a partner company. One of those ASICs was dedicated to maintaining the SCI coherency protocol and used a microsequencer and microcode combination to provide a programmable, high-performance coherency engine. The microsequencer utilized a VLIW-like instruction set to maximize parallel operation and minimize the control store's size.

I joined the project just after its start and my initial responsibilities were to become an SCI expert, be able to program the coherency microsequencer, and optimize the SCI coherency protocol implementation to match our machine's specific architecture. After completing those tasks ahead of schedule, I went on to do a wide variety of things including writing verification scripts for the coherency ASIC, implementing the BIOS responsible for initializing the SCI subsystem, serving as a systems engineer for the coherency subsystem, root-causing complicated bugs in all four ASICs, and writing a variety of programs to automatically identify failure root causes from simulation, emulation, and production logs. In the subsequent two follow on projects, my responsibilities continued to grow and culminated in an architect role for the third generation system.

As a computer-obsessed young engineer just starting a career this job was a dream come true. On a daily basis, I had direct interaction with ASICs, logic board design, exotic computer architectures, complicated caching protocols, low-level coding, and advanced operating systems. Stuff that I had only read about in text books was now literally at my finger tips to be studied, understood, and played with.

It was the people that I got to work with, though, that really made this job special. Everyone on the project shared an all-hands-on-deck, add-value-where-you-can, succeed-at-all-costs attitude that created an invigorating environment to work in. My principal mentor was a MicroKid from The Soul of a New Machine from whom I learned a great, great deal (and still do!). Thanks to his guidance and support I got to "touch" nearly every part of the system and felt encouraged to continuously take on new challenges. To this day I still recall the feeling of endless possibilities and excitement of knowing that my skills were improving daily.

Truly a lucky way to start my career.

UPDATE

To clarify this post, I decided to add a simplified diagram of the system's architecture. The actual architecture was more complicated, for example some of the ASICs depicted were actually multiple devices, but the diagram should sufficiently convey the overall design.

Wednesday, October 28, 2009

C Bitfields and HW Registers

When I need a quick low-level programming "fix", I browse the archives at PageTable.com. Last week I read the post on Readable and Maintainable Bitfields in C which argued the merits of using bitfields over bitmasks+macros. Although I agree with the post's points I think it omitted one important detail - the danger of using bitfields with hardware registers.

Hardware registers can be mapped into a processor's memory space and accessed with standard memory read/write instructions. Therefore the temptation is to define a bitfield type representing a register's structure and set a pointer to its base memory address. For example, assuming register bar is four bytes wide, has five bit-fields, and is located at memory address 0xBAADF00D one might be tempted to do the following:

typedef struct {
    unsigned int f1:8;
    unsigned int f2:4;
    unsigned int f3:8;
    unsigned int f4:4;
    unsigned int f5:8;
  } registerBar_t;


// Set pointer to register's memory address
registerBar_t *pBar = 0xBAADF00D;

// Use pointer to access register
pBar->f1 = 0xFE;

One problem with this approach is that many registers are designed to be accessed only at their full size - all accesses must be aligned to the register's base address and read/write the whole thing. Unfortunately, the setting of f1 in the above example may produce a partial register write that can lead to unexpected and unintended results.

Another challenge when dealing with registers is that, unlike main memory, register accesses can have side effects. Even register reads can cause the hardware to initiate action or clear information. Consider the following example:

...
tmpf1 = pBar->f1;
tmpf2 = pBar->f2;
...

If reads of register bar are destructive causing its contents to be cleared then the read of f1 may clear the contents of f2 before it is read by the subsequent pointer dereference. The resulting loss of information could cause the driver, hardware, or both to behave unexpectedly.

Alternatively, if reads of register bar cause the hardware to initiate action then spurious activity may occur if the f1 and f2 accesses are done separately.

Because of these granularity and side effect issues, I was advised early in my career to avoid using bitfields with hardware registers. This is, I think, an important point that is captured in the PageTable.com post's comments but not in the post itself which is an unfortunate omission.

After reading the PageTable.com post, I realized that I always took this advice on faith and never looked at the instructions generated by bitfield accesses. So I decided to do a quick experiment, below is a short program that accesses a bitfield with fields of varying size and alignment.

#include <stdio.h>

typedef union {
  struct {
    unsigned int f1:8; // Bits 07:00
    unsigned int f2:4; // Bits 11:08
    unsigned int f3:8; // Bits 19:12
    unsigned int f4:4; // Bits 23:20
    unsigned int f5:8; // Bits 31:24
  };
  unsigned int raw;
} bitfield_t;


int main()
{
  bitfield_t bitfield;
  unsigned int tmp;

  bitfield.raw = 0x0;

  // Set bit field f1
  bitfield.f1 = 0xEF;
  tmp = bitfield.f1;
  printf(" After f1: F1(0x%02x) RAW(0x%08x)\n", 
         tmp,
         bitfield.raw);

  // Set bit field f2
  bitfield.f2 = 0xE;
  tmp = bitfield.f2;
  printf(" After f2: F2(0x%02x) RAW(0x%08x)\n", 
         tmp,
         bitfield.raw);

  // Set bit field f3
  bitfield.f3 = 0xDB;
  tmp = bitfield.f3;
  printf(" After f3: F3(0x%02x) RAW(0x%08x)\n", 
         tmp,
         bitfield.raw);

  // Set bit field f4
  bitfield.f4 = 0xA;
  tmp = bitfield.f4;
  printf(" After f4: F4(0x%02x) RAW(0x%08x)\n", 
         tmp,
         bitfield.raw);

  // Set bit field f5
  bitfield.f5 = 0xDE;
  tmp = bitfield.f5;
  printf(" After f5: F5(0x%02x) RAW(0x%08x)\n", 
         tmp,
         bitfield.raw);

  // Set with raw
  bitfield.raw = 0xDECAFBAD;
  tmp = bitfield.raw;
  printf("After raw: RAW(0x%08x)\n", 
         tmp);

  return 0;
}

Compiling and running this program on an Ubuntu system results in the expected output.

jcardent@ubuntu:~/tmp$ gcc -g -o foo foo.c 
jcardent@ubuntu:~/tmp$ ./foo 
 After f1: F1(0xef) RAW(0x000000ef)
 After f2: F2(0x0e) RAW(0x00000eef)
 After f3: F3(0xdb) RAW(0x000dbeef)
 After f4: F4(0x0a) RAW(0x00adbeef)
 After f5: F5(0xde) RAW(0xdeadbeef)
After raw: RAW(0xdecafbad)

Running the command

jcardent@ubuntu:~/tmp$ objdump -d -S foo 

reveals the instructions generated to access the bit-fields. Looking at the f1 write and read sequence shows:

  // Set bit field f1
  bitfield.f1 = 0xEF;
 80483dc:       c6 45 f8 ef        movb   $0xef,-0x8(%ebp)
  tmp = bitfield.f1;
 80483e0:       0f b6 45 f8        movzbl -0x8(%ebp),%eax
 80483e4:       0f b6 c0           movzbl %al,%eax
 80483e7:       89 45 f4           mov    %eax,-0xc(%ebp)

The first thing to note from this disassembly fragment is that bitfield is located on the stack eight bytes below %ebp. Likewise, tmp is located at offset 0xC.

From this example it's clear that the write to f1 uses a single byte move instruction. If bitfield had been mapped to a hardware register, this would have resulted in an aligned but too short write access that could have produced unintended behavior.

The read of f1 is less clear until the movzbl instruction is understood to be a move from a single byte to a word, four bytes in this case. So here again, if bitfield had been mapped to a register the single-byte access may have resulted in unintended behavior like dataloss (top three bytes cleared) or spurious action (if subsequent reads are done to other fields for the same operation).

Looking at the f2 write and read sequence shows:

 // Set bit field f2
  bitfield.f2 = 0xE;
 8048404:       0f b6 45 f9        movzbl -0x7(%ebp),%eax
 8048408:       83 e0 f0           and    $0xfffffff0,%eax
 804840b:       83 c8 0e           or     $0xe,%eax
 804840e:       88 45 f9           mov    %al,-0x7(%ebp)
  tmp = bitfield.f2;
 8048411:       0f b6 45 f9        movzbl -0x7(%ebp),%eax
 8048415:       83 e0 0f           and    $0xf,%eax
 8048418:       0f b6 c0           movzbl %al,%eax
 804841b:       89 45 f4           mov    %eax,-0xc(%ebp)

In this case, setting the four bit-wide field f2 results in a byte-wide read-modify-write sequence aligned with the second byte of bitfield, evidenced by the offset of 0x7 instead of 0x8. Similarly, reading f2 results in a byte-wide read aligned with the second byte of bitfield. Both accesses are too short and misaligned.

Since f3 spans bytes 2 and 3 of bitfield, its access sequence results in aligned, four byte-wide mov instructions.

  bitfield.f3 = 0xDB;
 8048438:       8b 45 f8           mov    -0x8(%ebp),%eax
 804843b:       25 ff 0f f0 ff     and    $0xfff00fff,%eax
 8048440:       0d 00 b0 0d 00     or     $0xdb000,%eax
 8048445:       89 45 f8           mov    %eax,-0x8(%ebp)
  tmp = bitfield.f3;
 8048448:       8b 45 f8           mov    -0x8(%ebp),%eax
 804844b:       c1 e8 0c           shr    $0xc,%eax
 804844e:       80 e4 ff           and    $0xff,%ah
 8048451:       0f b6 c0           movzbl %al,%eax
 8048454:       89 45 f4           mov    %eax,-0xc(%ebp)

Although the accesses themselves are well-formed, unintended behaviors can still result if f3 is only one of multiple fields that must be set for a single operation.

Since the structure of bitfield is symmetrical, the accesses to fields f4 and f5 produce instructions similar to those for f2 and f1 respectively albeit with different offsets.

Finally, the accesses to raw produce aligned, full-width instructions as expected.

  // Set with raw
  bitfield.raw = 0xDECAFBAD;
 80484cd:       c7 45 f8 ad fb ca de movl   $0xdecafbad,-0x8(%ebp)
  tmp = bitfield.raw;
 80484d4:       8b 45 f8           mov    -0x8(%ebp),%eax
 80484d7:       89 45 f4           mov    %eax,-0xc(%ebp)

This last example illustrates a tempting workaround for "safely" using bitfields to manage register accesses. Consider:

registerBar_t *pBar = 0xBAADF00D;
registerBar_t tmpBar;

// Set field f1 to 0xff
tmpBar.raw = pBar->raw;
tmpBar.f1  = 0xff;
pBar->raw  = tmpBar.raw;

While this approach works, it suffers the risk of an uninformed future maintainer "optimizing out" the temporary variable and just using the bitfield method directly. In this regard, it may be more maintainable to use bitmasks and macros for register accesses.

Of course, problems can arise regardless of the method used if "uninformed" developers are allowed to change the code. The only prevention here is to make sure there is suitable training and disciplined code reviews.

Thursday, October 22, 2009

Book Review: Coders At Work

Coders At Work: Reflections on the Craft of Programming by Peter Seibel

I really enjoyed this book. Peter Seibel did an outstanding job in selecting a group of interviewees that are both legendary and represent a wide range of programming domains. In addition, Peter's questions were similar enough to allow comparisons between responses but maintained a conversational flow that prevented the interviews from feeling formulaic. Ehud Lamm's quote from the font cover perhaps says it best, "reading this book may be the next best thing to chatting with these illustrious programmers in person".

I read the interviews out of order based on my interests. All were good but my favorite ones were (in no particular order):

One striking observation was how many of the interviewees fit into Malcolm Gladwell's definition of an Outlier. Nearly all of them had opportunities to get a significant amount of programming experience at a young age and serendipitously found jobs that provided the right environment for them to succeed.

I also found interesting the "organic" approach most of the interviewees took to programming. None seemed to rely on formalized best practices but instead used their intuition to tailor their approach based on the task at hand. To me this suggests that you just can't codify great programming which matches my experiences with the great programmers that I have known.

Finally, it's notable how many of the people interviewed either starting out hacking Lisp or were heavily influenced by it. Of course this may be due to selection bias given that the author, Peter Seibel, is a a Lisp hacker himself and wrote the outstanding book Practical Common Lisp. As a Lisp fan, I don't mind the potential bias; it's a feature, not a bug!

Wednesday, October 14, 2009

Favorite Innovation Resources

My company's annual innovation conference is today and I'll be participating in a panel of innovators. As a result, innovation is on my mind so I thought I would post my favorite innovation resources. Unfortunately, this is only a quick list of links as I don't have time to discuss in detail what I like about each resource. Perhaps this will be a theme for a future series of posts.

Innovation

  • The Innovator's Solution: sequel to The Innovator's Dilemma, I prefer this book as it discusses many secondary factors related to disruption and how to avoid it. Key points to me were transitions between tightly-integrated and modular architectures and avoiding products that are incremental innovations for incumbents.
  • Drucker's Innovation and Entrepreneurship: simply a great book that should be required reading for every innovator, entrepreneur, and executive manager. Based on my years of experience as an innovator within a large company I found Drucker's insights and advice to be relevant, pragmatic, and actionable. Given its publication date, it appears to me that this book served as the foundation for the "popular" innovation books of the 90's and beyond. Reading Drucker's book helped to provide more context for those later works. This book is well worth the time to read it.
  • The Doblin Group's Ten Types of Innovation: a great cheat-sheet summarizing the various forms of innovation. I have this hanging on my office wall at work.
  • Skate to Where the Money Will Be: perhaps one of the most important papers for any innovator to read, especially those involved in long term strategic planning for an established company.
  • Creating New Market Space: a great HBS article on systematic ways of identifying new market opportunities.
  • What is a Strategy: to some degree innovating is deeply tied to defining a strategy. As a result, it's important to understand what a strategy is and Porter is a leading thinker in this area.
  • The Myths of Innovation: a thought provoking book that dispels some of the myths surrounding innovation.
  • Serious Play: a decent book that makes the important point that sometimes a model or prototype is needed to serve as the catalyst and focal point for exploratory innovation. I've spent most of my engineering career building prototypes so I found this book to be a thoughtful resource.
  • The Power of Product Platforms: written by one of my MBA professors, this book introduces the concept of using modular architectures to build derivative products to serve different market segments. A powerful concept in my experience.
  • Eureka! It Really Takes Years of Hard Work: a nice New York Times article discussing the fact that innovation is often the result of a lot of hard work and not an instantaneous eureka moment.

Execution

  • Crossing the Chasm: Geoffrey Moore's classic book about making the transition from the early to mainstream markets. A lot of great advice for innovators.
  • Dealing with Darwin: another Moore classic, this one discusses how to structure an organization to maintain a pipeline of innovations and sustain corporate growth. A good read for anyone involved in maintaining an established business.
  • That Art of the Start: a great and inspiring book on starting a new venture. I particularly appreciated the no nonsense advice and strong focus on fundamentals. My bias towards the latter has always made me feel like a "pseudo-MBA" so Guy's advice increased my confidence in my own opinions in this area.

Creative Problem Solving

  • ThinkerToys: a great resource for creative thinking techniques.
  • Back of the Napkin: sometimes a whiteboard or pen+paper are the most powerful tools for communicating ideas and innovations. This book provides great tips on visual thinking and communication.
  • The Mind Map Book: I have found mind-mapping to be a very effective and powerful technique for creative thinking. Written by the one of the original proponents of mind-mapping, this is a good resource for learning how to create and use mind-maps.
  • How to think Like Einstein: despite the cheeky title this book contains useful tips on creative thinking.
  • How to Think Like Leonardo Da Vinci: same as the above, cheeky title but good advice.

As I mentioned, this is just a short list of the resources that I've found most useful. In future posts I may discuss these and other resources in more detail.

Friday, October 9, 2009

Book Review: Fortune's Formula

Fortune's Formula: The Untold Story of the Scientific Betting System that Beat the Casinos and Wall Street by William Poundstone

[Read in August of 2009 but was delayed in writing the review.]

I absolutely loved this book, in fact I more or less devoured it as it involved three topics that I find very interesting:

How can a book involving these three topics be anything but good!

OK, enough gushing, the book is essentially about the following three people:

  • Claude Shannon: Bell Labs super-genius most famous for creating the field of Information Theory but who also invented the use of boolean binary logic in digital computers, and made material contributions to the fields of Mendelian genetics, cryptography, and computational logistics.
  • John L. Kelley: Bell Labs scientist famous amongst geeks for programming an IBM 704 to sing the song Daisy Bell which inspired HAL-9000's song at the end of the movie 2001: A Space Odyssey. Kelly is also famous for using Shannon's Information Theory to create the Kelly Criterion method for optimal betting or investing given inside information.
  • Edward O. Thorpe: M.I.T. professor famous for writing the book Beat the Dealer which proved card counting could be used to win at Black Jack. Thorpe is also famous for pioneering the use of computers in hedge fund investing (a.k.a quantitative finance). Thorpe's fund, Princeton Newport Partners, achieved an impressive annualized return of 15.1% over its 19 year duration.

The story more or less goes thusly:

  • Shannon invents Information Theory
  • With Shannon's help Kelley uses Information Theory to create the Kelly Criterion for determining an optimal betting strategy given inside information.
  • After meeting at M.I.T., Shannon and Thorpe research ways of using wearable computers to win at roulette. Thorpe goes on to use the Kelly Criteria to first make optimal bets while card counting and later make profitable investments for his hedge fund.

The book focuses on how the interaction between these three men gave rise to much wealth. Along the way Poundstone discusses related topics such as Efficient Market Theory, the Random Walk Hypothesis, and statistical arbitrage. Poundstone includes just enough mathematics to make the story interesting to technical readers without alienating other audiences.

It must be noted that the book offers a much richer story including the influences of organized crime at various stages of history.

All in all, a great book that I highly recommend to anyone interested in these topics.

Book Review: Labyrinths of Reason

[Read in August of 2009 but was delayed in writing the review.]

Labyrinths of Reason: paradox, puzzles, and the frailty of knowledge by William Poundstone.

Ouch, my head hurts. How do we really know we're not just brains-in-vats living in a simulated reality?

This is more or less the central theme of the book which guides the reader through various thought experiments designed to probe the depths of meaning, understanding, and knowing. Along the way Poundstone discusses various topics such as:

I could go on but doing so would still do the book an injustice. It's a great book that covers a lot of interesting topics. Well worth reading if you are interested in these kinds of things but be prepared to get a few headaches from excessive thinking. I'm planning to re-read it perhaps a year from now to try to better absorb the material.

Book Review: Prisoner's Dilemma

[Read in August of 2009 but was delayed in writing the review.]

Prisoner's Dilemma: John Von Neumann, Game Theory, and the Puzzle of the Bomb by William Poundstone.

A book ostensibly about Game Theory, I felt that it was more of a montage of:

The Game Theory content was fascinating and informative covering various kinds of games, strategies, and complications stemming from repeated games. Good stuff.

I found the Von Neumann biography captivating as it provided a deeper insight into his personal nature than I had read before. Clearly he was an incredibly intelligent man but he also clearly had a number of personal challenges. While this material makes Von Neumann appear more "human", it's not particularly pleasant to read about. Also, I thought the account of his death from cancer was too detailed and grim.

Personally, I have no interest in the history of the RAND corporation or the cold war so I found these parts of the book difficult to get through. But others may find this material more interesting.

All in all, not a bad book if you're interested in these topics but I wouldn't recommend this for "feel good" reading.

Sunday, September 27, 2009

Book Review: Complexity

Complexity: A Guided Tour by Melanie Mitchell

[Read in July of 2009 but was delayed in writing the review.]

I came across this book during a bookstore trip and decided to purchase it as it included topics related to a couple of recent posts and previously read book, Linked: The New Science of Networks. I'm glad I did as it made me realize that many of the things that I am interested in fall under the domain of Complex Systems.

The book is structured as five parts and nineteen chapters as summarized below:

PARTCHAPTERTHEME
1BACKGROUND AND HISTORY
1What is Complexity?
2Dynamics, Chaos, and Prediction
3Information
4Computation
5Evolution
6Genetics, Simplified
7Defining and Measuring Complexity
2LIFE AND EVOLUTION IN COMPUTERS
8Self-Reproducing Computer Programs
9Genetic Algorithms
3COMPUTATION WRIT LARGE
10Cellular Automata, Life, and the Universe
11Computing with Particles
12Information Processing in Living Systems
13How to Make Analogies (if You Are a Computer)
14Prospects of Computer Modeling
4NETWORK THINKING
15The Science of Networks
16Applying Network Science to Real-World Networks
17The Mystery of Scaling
18Evolution, Complexified
5CONCLUSION
19The Past and Future of the Sciences of Complexity

Part 1, nearly a third of the entire book, provides a thorough introduction to Complex Systems which are defined by the author as having the following properties:

  1. Complex collective behavior (i.e. the whole is greater than the sum of the parts)
  2. Signaling and information processing
  3. Adaptation to environmental changes

Example complex systems cited were insect colonies, the brain, the immune system, economies, and the world wide web.

Chapter 2 was an enjoyable introduction to dynamical systems and chaos theory which I had only a cursory awareness of. The simple "rabbit population" example using logistic maps was a great aid in understanding the concepts of fixed/periodic/strange attractors, and a sensitive dependence on initial conditions (aka. the Butterfly Effect). New to me were the concepts of the period doubling route to chaos and Feigenbaum's Constant (which is just plain odd).

Chapters 3 and 4 covered concepts related to information and computation including: entropy, Maxwell's Demon, statistical mechanics, Shannon's information theory, Hilbert's problems, Godel's incompletness theorem, and Turing machines. Overall the treatment was good and a sufficient introduction for readers unfamiliar with these concepts.

Chapters 5 and 6 discussed the topics of evolution and genetics. I found the discussion on evolution interesting as I don't recall learning about the feud between the early Darwinians, who believed in continuous small variations, and Mendelians, who believed in discrete large variations. I also don't recall learning about the Modern Synthesis or the continued challenges in the form of punctuated equilibrium. Interesting stuff.

Chapter 7 discussed various ways of defining and measuring complexity as:

  • size
  • entropy
  • algorithmic information content
  • logical depth
  • thermodynamic depth
  • computatoinal capacity
  • statistical complexity
  • fractal dimension
  • degree of hierarchy

While no single measure was identified as the measure, all were interesting to consider.

Chapters 8 to 10 discussed cellular automata, genetic algorithms, and artificial life. While the treatment was good, it felt light-weight after having just read The Recursive Universe. That said, Chapter 10 did provide a nice overview of Wolfram's work on cellular automata that augmented the prior material I had read on this subject.

Chapter 11 discussed the use of cellular automata for majority classification tasks. The specific task discussed was the determination of the dominant color in a one-dimensional vector of white and black pixels. The authors used genetic algorithms to evolve a set of CA rules that reliably performed this task by creating diagonal vectors that carried local majority-voting decisions which eventually intersected to produce a majority-vote for the entire original vector. The authors seemed unable to at first understand how the evolved CA worked but I thought it obvious as clearly any solution must involve the horizontal communication of local-majority votes in order to reach a global decision. It seems obvious (to me) that such horizontal communication would manifest as diagonal vectors when the CA evolutions are graphed vertically. I thought the use of particle equations to explain the phenomina was overkill but still interesting.

I enjoyed the author's account in Chapter 13 of her work on the CopyCat program which was designed to process analogies. The problem appears hard and the program was quite clever. I must admit I am extermely jealous that she got to work so closely with Douglas Hofstadter.

Chapters 15 to 16 discussed real-world network theory which included topics such as small-world phenomina, scale-free networks, clustering, preferential association, and network resiliance. Overall the treatment was good but mostly a review after having just read (actually listened to) Barabasi's book Linked, which I highly recommend reading.

Chapters 17 and 18 discussed network theory in the contexts of metabolic scaling and evolution which was very interesting. Some of this material was a review of Stuart Kauffman's work documented in his book Origins of Order. I previously read another of Kauffman's books, At Home in the Universe, which touched on some of the same topics but perhaps I'll add Origins to the list of potential future readings.

Chapter 19 concluded the book with an overview of the state of complexity science and the need for a unifying theory. While the field has accomplished many great things, it appears that a lot of difficult work remains which is perhaps why this is such an appealing topic.

In summary, Complexity is a great book that covers a lot of interesting topics. On a personal level, I am thankful to have read this book because, as I mentioned, it helped me realize that complexity science is the meta-topic that unifies many of my long term interests and therefore serves a guide for future research.

Thursday, September 24, 2009

The Case of the Bloated Firmware

A few years ago, a co-worker of mine was working on an embedded micro-controller for a proprietary enterprise class motherboard. I don't recall the exact details but my recollection is that the micro-controller was responsible for monitoring a group of sensors and reporting any anomalies.

My friend used a commercial C compiler to develop the code for the micro-controller which compiled fine but resulted in a binary that was too large for the device's EPROM. He tried a number of things to reduce the binary's size but was unable to get the code to fit. During a serendipitous hallway conversation he asked if I could take a look at the problem and we returned to his cube on a mission to shrink the binary.

First, he walked me through the source code to explain the overall purpose and design. At the C level the code looked fine so we next took a look at the disassembled binary and found our first clue - one of the for loops seemed to generate an excessive amount of instructions. Strange.

Again, the exact details escape me but the code in question looked roughly as follows:

 1:  int value1[NUM_SENSORS];
 2:  int value2[NUM_SENSORS];
 3:  int value3[NUM_SENSORS];
 4:  
 5:  for(sensor_num= 0; sensor_num < NUM_SENSORS; sensor_num++) {
 6:  
 7:    value1[sensor_num] = readSensorValue1(sensor_num);
 8:    value2[sensor_num] = readSensorValue2(sensor_num);
 9:    value3[sensor_num] = readSensorValue3(sensor_num);
10:  }

What we observed was that each of the array references resulted in a lot of instructions to compute the specified element's memory address. Since the loop body included three references, it generated a lot of instructions. Matters were made worse by the fact that there were multiple loops of this kind in the code. We found our culprit.

To work around the problem, I suggested that he re-write the code as follows:

 1:  struct {
 2:    int value1;
 3:    int value2;
 4:    int value3;
 5:  } sensors[NUM_SENSORS];
 6:  
 7:  for(sensor_num= 0; sensor_num < NUM_SENSORS; sensor_num++) {
 8:  
 9:    sensors[sensor_num].value1 = readSensorValue1(sensor_num);
10:    sensors[sensor_num].value2 = readSensorValue2(sensor_num);
11:    sensors[sensor_num].value3 = readSensorValue3(sensor_num);
12:  }

We changed the loops, recompiled the code, and presto chango the binary was small enough to fit!

A quick look at the disassembly confirmed my hypothesis that writing the code as proposed would result in only a single array element address calculation in the loop body - the address of each structure member was just an offset from the array element's starting address. The array element address calculations were still inefficient but we had reduced their number enough for the binary to fit within the device's EPROM. Success!

Overall, this wasn't a particularly difficult problem and didn't take that much time to resolve. However, the necessity to imagine how the compiler would respond to restructuring the code made this a fun, and memorable "bug" to work on.