Part 8 in a series of posts on recovering deleted JPEG files from a FAT file system.
In part 7, I demonstrated recovering deleted JPEG files through knowing their pre-deletion location in a FAT file system. In the real use-case of recovering accidentally deleted files, the locations are unknown making this approach impossible.
Recovering deleted files without knowing their location requires a method to find them within the unerased data. In this post, I'll show how the structure of a JPEG file can be used to do just that. Follow the read more link for the full discussion.
JPEG File Structure
Generally speaking, files meant to be processed programmatically employ some form of deterministic structure. Two common approaches are to partition the file into well defined segments or use an embedded catalog to record the file's contents (similar to a file system directory). Regardless of the approach, this deterministic structure can be used to reconstitute a file from its residual data.
In JPEG's case, the segmentation approach is used. The official JPEG format is specified in Annex B of the ISO/IEC International Standard 10918-1 - otherwise known as the JPEG Interchange Format (JIF). The specification defines:
- a variety of segment types to store metadata, compressed image data, etc.
- the combinations of segment types that form valid JPEG files.
- unique two-byte markers used to demarcate segments and other key aspects of a JPEG file's structure.
For the purposes of this discussion, the markers are particularly important.
Every two-byte marker consists of the value 0xFF
followed by a
non-zero value representing the marker's type. In some cases, markers
may be proceeded by a series of 0xFF
"fill bytes" to meet alignment
requirements - these fill bytes can be ignored. To avoid false markers
in segment payloads (e.g. compressed image data), all non-marker
related 0xFF
values must be followed by x0x00
to escape them - the
null bytes should be ignored during processing. This simple marker
scheme allows a JPEG file to be parsed without having to interpret
each constituent segment.
Apparently, the JIF format is rarely used in practice due to its complexity. Instead, two simplified variants - JFIF and EXIF - are commonly used instead. Both JFIF and EXIF utilize JIF's built-in extension mechanism and marker values. This means that a program that understands JIF markers can determine the structure of both JFIF and EXIF files.
An Example
To further understand the structure of a JPEG file, let's examine one
of the test image files. Inspecting the first 128 bytes of the file
4.1.01.jpg
using hexdump
results in:
$ hexdump -n 128 -s 0 -C images/4.1.01.jpg 00000 ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 00 48 |......JFIF.....H| 00010 00 48 00 00 ff db 00 43 00 03 02 02 03 02 02 03 |.H.....C........| 00020 03 03 03 04 03 03 04 05 08 05 05 04 04 05 0a 07 |................| 00030 07 06 08 0c 0a 0c 0c 0b 0a 0b 0b 0d 0e 12 10 0d |................| 00040 0e 11 0e 0b 0b 10 16 10 11 13 14 15 15 15 0c 0f |................| 00050 17 18 16 14 18 12 14 15 14 ff db 00 43 01 03 04 |............C...| 00060 04 05 04 05 09 05 05 09 14 0d 0b 0d 14 14 14 14 |................| 00070 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 |................|
To make the output clearer, the markers have been bolded and color coded.
First, we see that the file starts off with the marker 0xFF_D8
, this
is the Start-of-Image marker (SOI
) that must begin all JPEG
files. SOI
markers stand alone and do not have an associated
segment.
Next comes the marker 0xFF_E0
which identifies an APP0
application
specific segment. Application segments are JIF's built-in extension
mechanism to allow application specific information to be embedded
inside JPEG files. The JIF standard reserves 16 markers (0xFF_E0
to
0xFF_EF
) for application segments which are available for general
use - the JIF standard makes no attempt to assign application segments
to specific applications. By convention, JFIF files use an APP0
segment while EXIF files use an APP1
segment. To guard against other
applications using the same application segments, both JFIF and EXIF
identify themselves by including the 'JFIF' or 'EXIF' ASCII string in
the 4th through 7th bytes of the segment (from the marker's first
byte). In this case, we see that the APP0
segment indeed contains
the string 'JFIF' in bytes 6 through 9.
After the marker, each segment begins with a two-byte length parameter
(excluding the marker). The output above indicates that the APP0
is
0x10
bytes long and sure enough the next marker is found at offset
0x14
. This time the marker 0xFF_DB
indicates the beginning of a
quantization table segment (DQT
) that is 0x43
bytes long. This is
followed by another DQT
segment at offset 0x59
.
Similarly analyzing the remainder of the file reveals the following markers and segments.
OFFSET | MARKER | SEGMENT? | LENGTH(B) | DESCRIPTION |
---|---|---|---|---|
0X0000 | 0XFF_D8 (SOI ) | N | - | Start of image |
0X0002 | 0XFF_E0 (APP0 ) | Y | 0X10 | Application segment |
0X0014 | 0XFF_DB (DQT ) | Y | 0X43 | Quantization table |
0X0059 | 0XFF_DB (DQT ) | Y | 0X43 | Quantization table |
0x009E | 0xFF_C0 (SOF0 ) | Y | 0x11 | Start of frame |
0x00B1 | 0xFF_C4 (DHT ) | Y | 0x1D | Huffman table |
0x00D0 | 0xFF_C4 (DHT ) | Y | 0x3E | Huffman table |
0x0110 | 0xFF_C4 (DHT ) | Y | 0x1B | Huffman table |
0x012D | 0xFF_C4 (DHT ) | Y | 0x37 | Huffman table |
0x0166 | 0xFF_DA (SOS ) | Y | UNSPECIFIED | Start of scan |
0x5B7D | 0xFF_D9 (EOI ) | N | - | End of image |
Houston we have a problem
Notice in the table above that the SOS
segment has an unspecified
length. Based on my reading of the JIF specification and other
references, the length of the entropy coded image data in the SOS
segment is not explicitly specified. Instead, it is terminated by
either an EOI
or other marker. I suspect this was done to allow
encoders to generate JPEG files as images are compressed - in this
case the length of the compressed data is unknown when the SOS
is
started.
This presents a problem for recovering deleted JPEG files as it means their structure isn't completely deterministic. In the next post, I'll discuss the limitations this imposes and demonstrate how markers can be used to recover contiguous deleted files.