Thursday, December 24, 2009

Recovering Deleted JPEGs from a FAT File System - Part 5

Part 5 in a series of posts on recovering deleted JPEG files from a FAT file system.

UPDATE: 2010/1/28: it's come to my attention that the long file name support code described in this post is broken - it only works for files that utilize a single LFN entry. For the time being I plan to work around this bug for the FATRecover series but I do intend to post a fix eventually. I'll be sure to add another update with a link to the fix when it is available.

The topic of this installment is the root directory and the code required to process it. Although strictly speaking the file allocation table is the next major on-disk area, I think covering the root directory first will make the overall discussion clearer.

Note that this is a watershed post as the "hack" level of the example code increases dramatically. This was unfortunate but necessary to accommodate certain FAT functionality while limiting the discussion's length. That said, readers with sufficient skill should be able to use the example code to create more robust implementations.

Follow the "read more" link for a detailed description of how to read and interpret the root directory.

As discussed in parts 3 and 4, the root directory area is located on-disk immediately after the FAT area and consumes a fixed amount of pre-allocated space. The root directory area is essentially an array of 32 byte entries with the following structure:

#define FAT_DIRENTS_FNSZ (8)
#define FAT_DIRENTS_EXTSZ (3)

typedef struct fatDirectoryEntrySFN_s {     
  uint8_t   filename[FAT_DIRENTS_FNSZ];   // 07-00
  uint8_t   extension[FAT_DIRENTS_EXTSZ]; // 10-08
  uint8_t   attributes;                   // 11-11
  uint8_t   reserved1;                    // 12-12
  uint8_t   createTimeDsec;               // 13-13
  uint16_t  createTimeHMS;                // 15-14
  uint16_t  createTimeDay;                // 17-16
  uint16_t  accessedDay;                  // 19-18
  uint16_t  firstClusterHigh;             // 21-20
  uint16_t  writeTimeHMS;                 // 23-22
  uint16_t  writeDay;                     // 25-24
  uint16_t  firstCluster;                 // 26-27
  uint32_t  fileSize;                     // 31-28
} __attribute__ ((packed)) fatDirectoryEntrySFN_t;

The the first byte of the filename field specifies an entry's allocation status. Initially, all directory table entries are in the UNUSED state (0x00). Entries that are allocated and then later deallocated (due to file system deletions) are moved to the DELETED state (0xE5). The first byte of valid directory entries contains the first file name character which must have a value other than 0x00 or 0xE5 (if necessary the value 0x05 is used to encode the character value 0xE5).

File systems are expected to allocate directory entries sequentially starting with the first. As a result, partially allocated directory tables should consist of an inter-mixture of allocated and DELETED entries followed by one or more UNUSED entries. Directory tables that have been fully utilized will consist of only allocated and DELETED entries. When allocating entries, file systems can either re-use DELETED entries or allocate the next UNUSED entry (if one exists) - the preference is implementation specific.

The following code fragment provides #defines and a macro that can be used to determine a directory entry's allocation status.

#define FAT_DIRENTS_FNUNUSED  (0x00)
#define FAT_DIRENTS_FNDELETED (0xE5)

#define FAT_DIRENTS_FNVALID(pfatDE) \
  ((FAT_DIRENTS_FNUNUSED  != (pfatDE)->filename[0]) &&  \
   (FAT_DIRENTS_FNDELETED != (pfatDE)->filename[0]))

Valid directory entries contain the associated file system object's name in 8.3 format. The filename field contains the base file name while the extension field may contain an optional 3 character extension. As with the boot sector's text fields, the filename and extension fields are padded with spaces and lack null termination.

The attributes field contains flags describing the entry's type and characteristics. The following code fragment defines the various flags and provides macros for testing them.

#define FAT_DIRATT_READONLY     (0x01)
#define FAT_DIRATT_HIDDEN       (0x02)
#define FAT_DIRATT_SYSTEM       (0x04)
#define FAT_DIRATT_VOLUMELABEL  (0x08)
#define FAT_DIRATT_LFN          (0x0f)
#define FAT_DIRATT_DIRECTORY    (0x10)
#define FAT_DIRATT_ARCHIVE      (0x20)
#define FAT_DIRATT_MASK         (0x3f);

#define FAT_DIRATT_ISREADONLY(x) \
  (((x) & FAT_DIRATT_READONLY) == FAT_DIRATT_READONLY)
#define FAT_DIRATT_ISHIDDEN(x) \
  (((x) & FAT_DIRATT_HIDDEN) == FAT_DIRATT_HIDDEN)
#define FAT_DIRATT_ISSYSTEM(x) \
  (((x) & FAT_DIRATT_SYSTEM) == FAT_DIRATT_SYSTEM)
#define FAT_DIRATT_ISVOLUMELABEL(x) \
  (((x) & FAT_DIRATT_VOLUMELABEL) == FAT_DIRATT_VOLUMELABEL)
#define FAT_DIRATT_ISLFN(x) \
  (((x) & FAT_DIRATT_LFN) == FAT_DIRATT_LFN)
#define FAT_DIRATT_ISDIRECTORY(x) \
  (((x) & FAT_DIRATT_DIRECTORY) == FAT_DIRATT_DIRECTORY)
#define FAT_DIRATT_ISARCHIVE(x) \
  (((x) & FAT_DIRATT_ARCHIVE) == FAT_DIRATT_ARCHIVE)

Most of the attributes should be self-explanatory. Later in the post we'll discuss the LFN attribute and its dire implications.

For this project, the only other fields of interest are firstCluster and fileSize. The firstCluster field specifies the data area cluster containing the associated file system object's first chunk of data - the next post on the file allocation table will discuss this in greater depth. The fileSize field indicates a file's size in bytes - non-file entries should have a value of 0 in this field.

For the program I am re-developing alongside these posts, I took the following approach to processing the root directory:

  1. malloc a temporary buffer the same size as the root directory area.
  2. Read the on-disk root directory area into the buffer.
  3. Iterate over the directory table entries in the buffer and create an abstracted data structure for each valid entry.
  4. free the temporary buffer

Below is a high-level code fragment that implements the above steps:

  pRootDirBuff = malloc(rootdirSizeBytes);
  memset(pRootDirBuff, 0, rootdirSizeBytes);
  
  frSectorRead(pimageInfo, 
               pimageInfo->rootdirFirstSector,
               pimageInfo->rootdirSizeSectors,   
               pRootDirBuff);

  frDirProcessEntries(&(pimageInfo->rootDir), 
                      pRootDirBuff,
                      rootdirSizeBytes);

  free(pRootDirBuff);

pimageInfo is a housekeeping data structure my program uses to pass around context information like the disk image's open file descriptor, copies of the boot sector structures, and the information calculated in part 4. frSectorRead() is a utility routine that reads the specified sector extent from the disk image file into the memory buffer provided. All of the real action occurs in frDirProcessEntries().

My first implementation of frDirProcessEntries() looked something like the following:

int frDirProcessEntries(frDirEntry_t *pParentDir, 
                        char *pdirBuff, 
                        size_t sizeBuff)
{
fatDirectoryEntrySFN_t *fatDirectoryTable;
size_t numFatDirTableEntries;

fatDirectoryTable = (fatDirectoryEntrySFN_t *) pdirBuff;
numFatDirTableEntries = sizeBuff / 
                        sizeof(fatDirectoryEntrySFN_t);

for (entryIndex = 0;
     entryIndex < numFatDirTableEntries;
     entryIndex++) {

  if (!(FAT_DIRENTS_FNVALID(&(fatDirectoryTable[entryIndex])))) {
    continue;
  }
  entryCount++;

  pDirEntry = (frDirEntry_t * )malloc(sizeof(frDirEntry_t));
  memset(pDirEntry, 0, sizeof(frDirEntry_t));

  ptmp  = (char *) &(pDirEntry->filename);
  ptmp += 
      utilCopyNameChars((char *)fatDirectoryTable[entryIndex].filename,
                        ptmp,
                        (int) FAT_DIRENTS_FNSZ,
                        1);

  if (FAT_DIRENTS_FNVALID(&(fatDirectoryTable[entryIndex]))) {
     *ptmp = '.';
      ptmp++;
      ptmp += 
         utilCopyNameChars((char *)fatDirectoryTable[entryIndex].extension, 
                           ptmp,
                           (int) FAT_DIRENTS_EXTSZ,
                           1);
  }
  *ptmp = '\0';

  pDirEntry->attributes   = fatDirectoryTable[entryIndex].attributes;
  pDirEntry->fileSize     = fatDirectoryTable[entryIndex].fileSize;
  pDirEntry->firstCluster = fatDirectoryTable[entryIndex].firstCluster;

  pDirEntry->pParentDir    = pParentDir;
  pDirEntry->pNextSibling  = pParentDir->pDirEntries;
  pParentDir->pDirEntries  = pDirEntry;
  pDirEntry = NULL;
}

return entryCount;
}

Basically, the routine:

  1. Calculates the number of directory entries in the buffer provided.
  2. Casts it as an array of directory entries.
  3. Iterates over the entries via a for loop.
  4. Checks the first byte of each entry to see if it is valid.
  5. Allocates an abstracted data for each valid entry.
  6. Copies the file name and other metadata to the abstract structure.
  7. Adds the abstract structure to the parent directory's linked list of contents.

Technically speaking the for loop should terminate when a FAT_DIRENTS_FNUNUSED entry is encountered but the code above just skips over such entries and continues processing the table - this shouldn't be a problem so I didn't bother to fix it (this is a hobby endeavor after all).

As mentioned, frDirEntry_t is an abstracted data structure used to represent directory entries. Its structure can be inferred from the code above. The motivation for using a secondary structure is to avoid re-processing the raw entries for each listing (the importance of this will become clear later).

utilCopyNameChars() is a utility routine that copies characters from the raw directory entry text fields to a C string.

int utilCopyNameChars(char *srcString, 
                      char *destString, 
                      int maxChars, int srcSkip)
{
  // Copy characters from srcString to destString until 
  // a space character, null character, 0xff value, or 
  // maxChars is reached.
  //
  // Returns the number of characters copied

  char *psrcChar  = srcString;
  char *pdestChar = destString;
  int   count = 0;

  while ((*psrcChar != ' ')  &&
         (*psrcChar != '\0')  &&
         (((unsigned char)*psrcChar) != 0xff)  &&
         (count < maxChars)) {

    *pdestChar = *psrcChar;
    psrcChar += srcSkip;
    pdestChar++;
    count++; 
  }

  return count;
}

The while loop terminates upon reaching either a space character, NULL character, or the value 0xff (which will be explained later). The skip argument supports a hack that is explained below.

After adding a short routine to print the abstracted directory entry structures (which I won't show), I got the following output from running the improved fatrecover program against the test image, frtest1.dmg, from the prior two posts.

LISTING DIR: ROOT
 ATTR     SIZE            NAME          1ST CLUSTER
------  --------  --------------------  -----------
.....A     92889            4204~1.JPG  (0x00000a)
RHSV..        -1                   A4.  (00000000)
.H..D.         0             FSEVEN~1.  (0x000008)
RHSV..        -1                   A..  (00000000)
.H..D.         0             TRASHE~1.  (0x000002)
RHSV..        -1                   A..  (00000000)
.H...A      4096               _~1.TRA  (0x000003)
RHSV..        -1                   A..  (00000000)
...V.A         0              FRTEST1.  (00000000)

YUCK! The funky file names, -1 sizes, and simultaneous setting of the RHSV flags indicated that I had run into the dreaded (by me) long file name support.

When I first did this project in 2004 I punted on handling long file names. I considered doing the same for this re-implementation but the idea didn't sit well with me - I have a hard time avoiding the same problem twice. Besides, without long file name support I would have had to modify the test disk image to clean up the output and update the prior posts as needed. This felt too much like hiding dirty laundry so I decided to suck it up and hack together long file name support - this isn't going to be pretty.

Long file names are supported by using multiple directory entries with the following structure:

#define FAT_DIRENTL_FN1SZ (10)
#define FAT_DIRENTL_FN2SZ (12)
#define FAT_DIRENTL_FN3SZ (4)

typedef struct fatDirectoryEntryLFN_s {     
  uint8_t   sequenceNum;                  // 00-00
  uint8_t   filename1[FAT_DIRENTL_FN1SZ]; // 10-01
  uint8_t   attributes;                   // 11-11
  uint8_t   reserved;                     // 12-12
  uint8_t   checksum;                     // 13-13
  uint8_t   filename2[FAT_DIRENTL_FN2SZ]; // 25-14
  uint8_t   reserved2[2];                 // 27-26
  uint8_t   filename3[FAT_DIRENTL_FN3SZ]; // 31-28
}  __attribute__ ((packed)) fatDirectoryEntryLFN_t;

These long file name (LFN) structures are identified by the RHSV flags being simultaneously set and precede their associated 8.3 entry (SFN) in the directory table. The 8.3 directory entry includes a shortened version of the long file name and the object's associated metadata (i.e. size, starting cluster, etc).

Each LFN entry includes a sequence number indicating its order. The entries are stored in highest to lowest order in the directory table making it easier to determine their total number. The last LFN entry has a a sequence number of 1 bitwise OR'd with the value 0x40. Below is a macro that can be used to check for the 0x40 marker:

#define FAT_DIRENTL_LASTSEQ (0x40)
#define FAT_DIRENTL_ISLASTSEQ(x) \
  (((x) & FAT_DIRENTL_LASTSEQ) == FAT_DIRENTL_LASTSEQ)

The following is a simple example of an LFN and SFN directory table sequence:

| ENTRY# | TYPE | SEQ# |
|--------+------+------|
|      1 | LFN  | 0x04 |
|      2 | LFN  | 0x03 |
|      3 | LFN  | 0x02 |
|      4 | LFN  | 0x41 |
|      5 | SFN  |   -- |

Each LFN entry has three fields for storing file name characters. Some or all of an LFN entry's text fields may contain file name characters. Those that do will have a non-null value in the first byte. Below are macros that can be used to test each field for valid contents:

#define FAT_DIRENTL_FNUNUSED  (0x00)

#define FAT_DIRENTL_FN1VALID(pfatDEL) \
  (FAT_DIRENTL_FNUNUSED  != (pfatDEL)->filename1[0])
#define FAT_DIRENTL_FN2VALID(pfatDEL) \
  (FAT_DIRENTL_FNUNUSED  != (pfatDEL)->filename2[0])
#define FAT_DIRENTL_FN3VALID(pfatDEL) \
  (FAT_DIRENTL_FNUNUSED  != (pfatDEL)->filename3[0])

As with other FAT structures, these fields lack null termination so they must be handled similarly to other on-disk text fields.

Unfortunately, unlike the other on-disk structures, the LFN entry text fields use a UTF-16 UNICODE encoding. I prefer my tangents one at a time and since I don't have much experience in dealing with UTF-16 in C I decided to hack a shortcut (I warned you). Since I know my test images will only use UTF-8 characters I decided to only copy the low-order byte of the UTF-16 characters. I accomplished this by calling utilCopyNameChars() with a skip argument value of 2 and a maxChars argument value of the field length divided by 2. Not my proudest coding moment but it worked.

To handle LFN entries, I modified my original directory table processing for loop to include a sub while loop to iterate over the LFN entries. The code fragment below shows the modified loop along with comments documenting the (many) assumptions made (the remainder of frDirProcessEntries() remained the same):

fatDirectoryTable = (fatDirectoryEntrySFN_t *) pdirBuff;
numFatDirTableEntries = sizeBuff / 
                        sizeof(fatDirectoryEntrySFN_t);

for (entryIndex = 0;
     entryIndex < numFatDirTableEntries;
     entryIndex++) {

  if (!(FAT_DIRENTS_FNVALID(&(fatDirectoryTable[entryIndex])))) {
    continue;
  }
  entryCount++;

  pDirEntry = (frDirEntry_t * )malloc(sizeof(frDirEntry_t));
  memset(pDirEntry, 0, sizeof(frDirEntry_t));

  ptmp  = (char *) &(pDirEntry->filename);

  if (FAT_DIRATT_ISLFN(fatDirectoryTable[entryIndex].attributes)) {
    // N.B. - Hack alert. Rudimentary support for long file names.
    //        Assumes that LFN entries are valid (no checksum checking).
    //        Assumes that last LFN marked with end of sequence flag
    //        Assumes that file names are in UTF-16.
    //        Assumes that all LFN entries exist and are in proper order.
    //        Assumes that LFN and SFN entries are contiguous.
    //        Assumes that file name is less than 256 characters long.
    //
    //        Iterates over LFN entries, copies filename to pDirEntry.
    //        Leaves entryIndex at the associated SFN.

    fatDirectoryEntryLFN_t *pfatDirEntLFN = NULL;
    do {
      pfatDirEntLFN = 
        (fatDirectoryEntryLFN_t *)&(fatDirectoryTable[entryIndex]);

      assert(FAT_DIRATT_ISLFN(fatDirectoryTable[entryIndex].attributes));
      assert(entryIndex < numFatDirTableEntries);

      if (FAT_DIRENTL_FN1VALID(pfatDirEntLFN)) {      
        ptmp += 
               utilCopyNameChars((char *)pfatDirEntLFN->filename1,
                            ptmp,
                            (int) FAT_DIRENTL_FN1SZ/2,
                            2);
      }

      if (FAT_DIRENTL_FN2VALID(pfatDirEntLFN)) {      
        ptmp += 
              utilCopyNameChars((char *)pfatDirEntLFN->filename2,
                            ptmp,
                            (int) FAT_DIRENTL_FN2SZ/2,
                            2);
      }

      if (FAT_DIRENTL_FN3VALID(pfatDirEntLFN)) {      
        ptmp += 
              utilCopyNameChars((char *)pfatDirEntLFN->filename3,
                            ptmp,
                            (int) FAT_DIRENTL_FN3SZ/2,
                            2);
      }

      entryIndex++;

    } while (!(FAT_DIRENTL_ISLASTSEQ(pfatDirEntLFN->sequenceNum)));
  } else {

    // copy file name and extension   
    ptmp += 
       utilCopyNameChars((char *)fatDirectoryTable[entryIndex].filename,
                  ptmp,
                  (int) FAT_DIRENTS_FNSZ,
                  1);

    if (FAT_DIRENTS_FNVALID(&(fatDirectoryTable[entryIndex]))) {
      *ptmp = '.';
      ptmp++;
      ptmp += 
        utilCopyNameChars((char *)fatDirectoryTable[entryIndex].extension,
                          ptmp,
                          (int) FAT_DIRENTS_EXTSZ,
                          1);
    }
  }
  *ptmp = '\0';

  pDirEntry->attributes   = fatDirectoryTable[entryIndex].attributes;
  pDirEntry->fileSize     = fatDirectoryTable[entryIndex].fileSize;
  pDirEntry->firstCluster = fatDirectoryTable[entryIndex].firstCluster;

  pDirEntry->pParentDir    = pParentDir;
  pDirEntry->pNextSibling  = pParentDir->pDirEntries;
  pParentDir->pDirEntries  = pDirEntry;
  pDirEntry = NULL;
}

With this LFN hack in place, re-running the program against the test image produced:

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.6.2
BuildVersion:   10C540

$ wc -l fatrecover.c 
     632 fatrecover.c

$ make clean all
rm -r fatrecover fatrecover.dSYM 
gcc -g -Wall fatrecover.c -o fatrecover

$ fatrecover frtest1.dmg 

=== FAT RECOVER v0.0 ===

Opening file frtest1.dmg................OK
Reading boot sector.....................OK
Processing boot sector..................OK
Reading root dir........................OK

<BOOT SECTOR INFO REMOVED FOR BREVITY>

LISTING DIR: ROOT
 ATTR     SIZE            NAME          1ST CLUSTER
------  --------  --------------------  -----------
.....A     92889            4.2.04.jpg  (0x00000a)
.H..D.         0            .fseventsd  (0x000008)
.H..D.         0              .Trashes  (0x000002)
.H...A      4096            ._.Trashes  (0x000003)
...V.A         0              FRTEST1.  (00000000)

Ah, much better. The first entry is the JPG image file which is 92889 bytes long and begins in cluster 0xa. The .fseventsd and .Trashes entries are artifacts from mounting the disk image on an OSX system - they can be ignored. The last entry (which actually appears first in the on-disk table) is a volume label as indicated by the V attribute flag.

As a sanity check, mounting the disk image and listing the root directory from OSX produces:

$ hdiutil attach frtest1.dmg 
/dev/disk1                      /Volumes/FRTEST1

$ls -ao /Volumes/FRTEST1/
total 232
drwxrwxrwx  1 jcardent  16384 Dec 24 10:52 .
drwxrwxrwt@ 4 root        136 Dec 24 10:52 ..
drwxrwxrwx@ 1 jcardent   2048 Dec 24 10:52 .Trashes
-rwxrwxrwx  1 jcardent   4096 Dec 14 05:28 ._.Trashes
drwxrwxrwx  1 jcardent   2048 Dec 24 10:52 .fseventsd
-rwxrwxrwx  1 jcardent  92889 Dec 14 05:28 4.2.04.jpg

Other than being in opposite orders the two listings match. Woot!

Before declaring total victory, there is one major shortcoming with the LFN hack shown above - it requires an entire directory table to be read into an in-memory buffer. My original intent was to process directory tables one sector or cluster at a time to minimize the memory footprint. However, the hack above can't handle LFN entries spanning a sector or cluster boundary as this would involve separate calls to frDirProcessEntries(). I thought of a couple of ways to achieve both goals but decided that they would unnecessarily complicate the discussion as we'll only be dealing with small disk images in this series. If you plan to use this code to handle arbitrarily large file systems you will probably want to consider solving this problem.

In the next post we'll tackle the last file system area, the file allocation table.