19 July 2016
Posted by Sami Tolvanen, Software Engineer
Android uses multiple layers of protection to keep users safe. One of these layers is verified boot, which improves security by using cryptographic integrity checking to detect changes to the operating system. Android has alerted about system integrity since Marshmallow, but starting with devices first shipping with Android 7.0, we require verified boot to be strictly enforcing. This means that a device with a corrupt boot image or verified partition will not boot or will boot in a limited capacity with user consent. Such strict checking, though, means that non-malicious data corruption, which previously would be less visible, could now start affecting process functionality more.
By default, Android verifies large partitions using the dm-verity kernel driver, which divides the partition into 4 KiB blocks and verifies each block when read, against a signed hash tree. A detected single byte corruption will therefore result in an entire block becoming inaccessible when dm-verity is in enforcing mode, leading to the kernel returning EIO errors to userspace on verified partition data access.
This post describes our work in improving dm-verity robustness by introducing forward error correction (FEC), and explains how this allowed us to make the operating system more resistant to data corruption. These improvements are available to any device running Android 7.0 and this post reflects the default implementation in AOSP that we ship on our Nexus devices.
Using forward error correction, we can detect and correct errors in source data by shipping redundant encoding data generated using an error-correcting code. The exact number of errors that can be corrected depends on the code used and the amount of space allocated for the encoding data.
Reed-Solomon is one of the most commonly used error-correcting code families, and is readily available in the Linux kernel, which makes it an obvious candidate for dm-verity. These codes can correct up to ⌊t/2⌋ unknown errors and up to t known errors, also called erasures, when t encoding symbols are added.
A typical RS(255, 223) code that generates 32 bytes of encoding data for every 223 bytes of source data can correct up to 16 unknown errors in each 255 byte block. However, using this code results in ~15% space overhead, which is unacceptable for mobile devices with limited storage. We can decrease the space overhead by sacrificing error correction capabilities. An RS(255, 253) code can correct only one unknown error, but also has an overhead of only 0.8%.
An additional complication is that block-based storage corruption often occurs for an entire block and sometimes spans multiple consecutive blocks. Because Reed-Solomon is only able to recover from a limited number of corrupted bytes within relatively short encoded blocks, a naive implementation is not going to be very effective without a huge space overhead.
In the changes we made to dm-verity for Android 7.0, we used a technique called interleaving to allow us to recover not only from a loss of an entire 4 KiB source block, but several consecutive blocks, while significantly reducing the space overhead required to achieve usable error correction capabilities compared to the naive implementation.
Efficient interleaving means mapping each byte in a block to a separate Reed-Solomon code, with each code covering N bytes across the corresponding N source blocks. A trivial interleaving where each code covers a consecutive sequence of N blocks already makes it possible for us to recover from the corruption of up to (255 - N) / 2 blocks, which for RS(255, 223) would mean 64 KiB, for example.
An even better solution is to maximize the distance between the bytes covered by the same code by spreading each code over the entire partition, thereby increasing the maximum number of consecutive corrupted blocks an RS(255, N) code can handle on a partition consisting of T blocks to ⌈T/N⌉ × (255 - N) / 2.
Interleaving with distance D and block size B.
An additional benefit of interleaving, when combined with the integrity verification already performed by dm-verity, is that we can tell exactly where the errors are in each code. Because each byte of the code covers a different source block—and we can verify the integrity of each block using the existing dm-verity metadata—we know which of the bytes contain errors. Being able to pinpoint erasure locations allows us to effectively double our error correction performance to at most ⌈T/N⌉ × (255 - N) consecutive blocks.
For a ~2 GiB partition with 524256 4 KiB blocks and RS(255, 253), the maximum distance between the bytes of a single code is 2073 blocks. Because each code can recover from two erasures, using this method of interleaving allows us to recover from up to 4146 consecutive corrupted blocks (~16 MiB). Of course, if the encoding data itself gets corrupted or we lose more than two of the blocks covered by any single code, we cannot recover anymore.
While making error correction feasible for block-based storage, interleaving does have the side effect of making decoding slower, because instead of reading a single block, we need to read multiple blocks spread across the partition to recover from an error. Fortunately, this is not a huge issue when combined with dm-verity and solid-state storage as we only need to resort to decoding if a block is actually corrupted, which still is rather rare, and random access reads are relatively fast even if we have to correct errors.
Strictly enforced verified boot improves security, but can also reduce reliability by increasing the impact of disk corruption that may occur on devices due to software bugs or hardware issues.
The new error correction feature we developed for dm-verity makes it possible for devices to recover from the loss of up to 16-24 MiB of consecutive blocks anywhere on a typical 2-3 GiB system partition with only 0.8% space overhead and no performance impact unless corruption is detected. This improves the security and reliability of devices running Android 7.0.