Work in Progress

You are viewing unfinished draft material right now.

Sufficiently Good Media Formats

Modern media file formats are bloated, and there are too many. A from-scratch implementation of a web browser’s media support would be a ludicrous undertaking. The rampant complexity in this space results in unnecessary fragility and dependence on larger vendors.

The Sufficiently Good Media Formats are a tiny family of media file formats that cover the media capabilities of modern applications: images, audio, and videos, supporting both lossless and lossy compression. The designs are simple, small, and reuse the same concepts across different media types; and the open specifications do not assume familiarity with digital signal processing.

New system designs can represent media via our file formats exclusively, significantly reducing complexity. Example use-cases include peer-to-peer systems, where the possibility of independent implementations is a core value. Another use-case is in archival work, where simplicity directly translates into resilience and durability.

Overview and Concepts

This document introduces and then specifies the Sufficiently Good Media Formats (just “our formats” going further). The designs are shaped by two opposing virtues: simplicity and compression. The simplemost file formats store raw data, but this results in large file sizes. Compression techniques leverage redundancy in the data to reduce file size, but doing so makes things less simple.

Whereas most file formats attempt to maximise compression at the expense of simplicity, our designs follow the Pareto principle: we take on the 20% of complexity that enable 80% of the efficiency. Those numbers are made-up of course, balancing the tradeoffs is more of an art than a science. But you get the idea.

Image formats, audio formats, and video formats are typically developed independently from each other. While similar compression techniques are at play in all of them, each does things in a slightly different way. This results in unnecessary conceptual overhead, and prohibits code reuse. Our formats, in contrast, are consistent in the techniques they employ, drastically lowering the overall amount of stuff going on.

Hence, the specification starts with a section on general techniques (you are reading it right now), and the media-specific parts at the end are quite small. But before we can describe the commonalities between image, audio and video data, we first give short introduction to how each of these media types can be represented by a computer.

Media Types

Humans hear sound by reacting to changes in air pressure over time. These changes are continuous, but computers can only store discrete data. The most common way to approximate the continuous signal with a computer is to sample the signal at regular time intervals. This is called (linear) pulse-code modulation (PCM), and it is how our files store sound data. A mono recording stores only one value per sample, whereas a stereo recording stores multiple values at each sample time, to be played back from different positions in space (for example the left and the right speaker in headphones).

Human sight is based on light falling onto three different kinds of receptors in the eye, each reacting most strongly to a particular colour: red, green, or blue. In computers, we represent a two-dimensional image by first dividing it into a regular grid of pixels, and then assigning to each pixel an intensity for each of red, green, and blue light. We often store a fourth value for each pixel, indicating opacity. The two-dimensional grid of pixels is interpreted as a one-dimensional sequence by going through the pixels row by row (left-to-right, top-to-bottom).

Videos are sequences of images (called frames), each to be displayed for a uniform span of time. Displaying 60 frames per second suffices to perceive these discrete images as a continuous stream. Optionally, a video can include audio data as well.

Channels and Samples

The different media formats have a lot in common. Each collects information in sequential channels: red, green, blue, and alpha intensities for images; pulse audio data for sound; frames and audio data for video. Each channel stores samples at discrete, uniform intervals: pixels are samples in space, audio data and video frames are samples in time. And each sample carries a fixed amount of information: 8 bit per channel for images, 24 bit per audio sample, fixed-size images for video frames.

These specifications are not concerned with how to convert the continuous signals of the real world into channels of discrete samples. We assume that data is already provided in the form of channels of discrete samples, and merely specify how to efficiently serialisesuch data in a file.

The first question of serialisation arises around channels: do you encode each channel one after the other, or do you interleave their samples? Our specifications do the latter. If we write X[n] for the n-th sample of channel X, then our formats would serialise the samples of three-channel data as follows: A[0], B[0], C[0], A[1], B[1], C[1], A[2], and so on.

Channels might provide samples at different frequencies: in video, for example, there are typically 60 frames per second, but 48000 audio samples per second. When interleaving samples from channels with different frequencies, samples from a low-frequency channel are simply placed when it is their turn: Frame[0], Audio[0], Audio[1], ..., Audio[799], Frame[1], Audio[800], Audio[801], ..., Audio[1599], Frame[2], Audio[1600], and so on (800 is 48000 divided by 60, i.e., the frequency of audio samples divided by the frequency of image frames).

Lossless Compression

We could define very simple file formats now that start with a header of metadata, followed by all raw, interleaved samples. But we can do better than that: lossless compression exploits structure and redundancy among samples to reduce the number of bits we have to serialise. This is a rather abstract notion; more concretely we make use of two properties.

First, when multiple consecutive samples have the same value, we can encode the value only once and then encode how often it repeats. This technique is called run-length encoding. In images, repeating samples are quite common, so this technique yields good savings.

Second, we can use that small numbers can be encoded in fewer bits than large numbers. All we need to do is to convert the sequences of samples into sequences of smaller numbers, from which the original samples can be reconstructed. For that, we employ two techniques: sample prediction and channel decorrelation.

In sample prediction, we try to predict the value of each sample based on the previous samples of the same channel. Instead of serialising actual sample values, we serialise the differences between the actual sample and the predicted value. If predictions are mostly accurate, then the resulting sequence consists of mostly small numbers. Even when simply always predicting the value of the n-th sample to be equal to that of the (n-1)-th sample, this technique performs quite well.

Whereas sample prediction exploits redundancy within a single channel, channel decorrelation exploits redundancy across multiple channels. In general, the idea is to separate multiple channels into those parts that are similar, and those parts that differ. So instead of encoding the actual channels of the media, we encode derived sequences of samples. For example, we can expect the left and right audio channel of a stereo recording to be fairly correlated. So instead of encoding the two channels independently, we encode the sum of the two channels, and we encode the difference between the left audio channel and the average of the two channels. This latter derived sequence will consist of mostly small and similar values if the two original channels are fairly similar.

Before we define precisely how we efficiently encode sequences of small numbers with many repetitions, there is one more concept to introduce: lossy compression.

Lossy Compression

Lossy compression reduces file sizes by irrevocably discarding information. Our formats support two simple techniques for lossy compression: reducing the sample rate, and quantising samples to lower bit depths.

Reducing the sample rate is as simple as it sounds: for each channel, the metadata at the start of our file formats indicates the sample rate of the channel. For audio data, a reduced sample rate means fewer samples per second. For image data, a reduced sample rate means that a single sample defines a value for multiple consecutive pixels.

Sample quantisation simply means mapping samples of a high bit depth to a lower bit depth. In our formats, this happens after obtaining the decorrelated channels but before sample prediction; all samples of the channel are simply shifted by the same amount via a right logical bit shift. The metadata at the start of our file formats stores the bit width used for each channel.

When decoding a shifted sample, the original value cannot be uniquely reconstructed, because many different samples result in the same value after bitshifting. Instead, we map the shrunk samples back into the original channel depth by spacing them evenly. More precisely, let original_max denote the largest possible original sample (i.e., as many 1 bits as the original channel depth), and let compressed_max denote the largest possible compressed sample (i.e., as many 1 bits as the compressed channel depth). Then a compressed sample s is mapped back into the original channel domain via (s * original_max) / compressed_max, where / rounds down to the next natural number.

TODO: make this definition work for signed samples as well (does it suffice to simply treat the samples as unsigned integers for the reconstruction?) Also, change this a bit: go from the optimal spread (which requires costly division) to something more efficient: left-shift the compressed sample, and then add the product of the compressed sample times the result of dividing the max original value by the compression factor. This is precise if the compressed width divides the original width, and still pretty okay otherwise.

Encoding Samples

TODO general grouping mechanism: residual count of group (as a truncated exponential golomb code with a max value of TODO), bit-width per residual (truncated exp golomb), the residuals; this gives run-length encoding for free by setting the bit-width per residual to zero; also make sure everything works with negative residuals

Segments and Seeking

TODO (unified sound and images, segments and seeking, interaction with sample prediction, no streaming).

WIP Notes

Images

Samples are assumed to be sRGB colours (the file formats don't care and have no support for gamma correction, but for display this matters).

Metadata: width, height, bits per channel, channel frequencies (for chroma subsampling). Some special purpose stuff: grayscale images and color map images, possibly special-case an alpha channel of zero bit width to indicate 1 rather than 0? Or allow (or mandate) specifying the value to use for any zero-bit-width channel. Possibly even for 1-bit channels?

Channel decorrelation via https://en.wikipedia.org/wiki/YCbCr

Audio

Audio file is simply a video file with zero channels for images. Should have its own extension, but that's basically it.

Video

Other

Notes on lossy encoders (dithering and colour quantisation for images, dithering and noise shaping for audio)