Thursday, September 24, 2009

The Case of the Bloated Firmware

A few years ago, a co-worker of mine was working on an embedded micro-controller for a proprietary enterprise class motherboard. I don't recall the exact details but my recollection is that the micro-controller was responsible for monitoring a group of sensors and reporting any anomalies.

My friend used a commercial C compiler to develop the code for the micro-controller which compiled fine but resulted in a binary that was too large for the device's EPROM. He tried a number of things to reduce the binary's size but was unable to get the code to fit. During a serendipitous hallway conversation he asked if I could take a look at the problem and we returned to his cube on a mission to shrink the binary.

First, he walked me through the source code to explain the overall purpose and design. At the C level the code looked fine so we next took a look at the disassembled binary and found our first clue - one of the for loops seemed to generate an excessive amount of instructions. Strange.

Again, the exact details escape me but the code in question looked roughly as follows:

 1:  int value1[NUM_SENSORS];
 2:  int value2[NUM_SENSORS];
 3:  int value3[NUM_SENSORS];
 4:  
 5:  for(sensor_num= 0; sensor_num < NUM_SENSORS; sensor_num++) {
 6:  
 7:    value1[sensor_num] = readSensorValue1(sensor_num);
 8:    value2[sensor_num] = readSensorValue2(sensor_num);
 9:    value3[sensor_num] = readSensorValue3(sensor_num);
10:  }

What we observed was that each of the array references resulted in a lot of instructions to compute the specified element's memory address. Since the loop body included three references, it generated a lot of instructions. Matters were made worse by the fact that there were multiple loops of this kind in the code. We found our culprit.

To work around the problem, I suggested that he re-write the code as follows:

 1:  struct {
 2:    int value1;
 3:    int value2;
 4:    int value3;
 5:  } sensors[NUM_SENSORS];
 6:  
 7:  for(sensor_num= 0; sensor_num < NUM_SENSORS; sensor_num++) {
 8:  
 9:    sensors[sensor_num].value1 = readSensorValue1(sensor_num);
10:    sensors[sensor_num].value2 = readSensorValue2(sensor_num);
11:    sensors[sensor_num].value3 = readSensorValue3(sensor_num);
12:  }

We changed the loops, recompiled the code, and presto chango the binary was small enough to fit!

A quick look at the disassembly confirmed my hypothesis that writing the code as proposed would result in only a single array element address calculation in the loop body - the address of each structure member was just an offset from the array element's starting address. The array element address calculations were still inefficient but we had reduced their number enough for the binary to fit within the device's EPROM. Success!

Overall, this wasn't a particularly difficult problem and didn't take that much time to resolve. However, the necessity to imagine how the compiler would respond to restructuring the code made this a fun, and memorable "bug" to work on.