Multimedia Container Formats & Codecs

Typically, end users interact with Multimedia Container formats. Common container formats include:

  • MP4
  • RM (RealVideo player)
  • Matroska (MKV)
  • Flash Video (FLV, F4V)
  • AVI (by Microsoft, based on RIFF)
  • MPEG-TS (MPEG transport stream, used in HLS streaming)
  • Ogg
  • 3GP (old mobile devices)

The container formats, depending on the exact spec, can wrap video, audio, metadata, text (e.g. subtitle), still picture and for some cases arbitrary data.

When it comes to video and audio, we have video coding formats and audio coding formats. They are commonly called codecs.

Common audio codecs:

  • Free Lossless Audio Codec (FLAC)
  • Monkey’s Audio (APE)
  • Dolby Digital (AC3 etc)
  • MP3 (MPEG-2 Audio Layer III, Thomson (later Technicolor) and Fraunhofer IIS claims to have developed MP3 format and hold MP3 patents. 2 of them are partnered together for licensing)
  • AAC (Advanced Audio Coding, by ISO)
  • WMA (Windows Media Audio)

Common video codecs:

  • H.264/MPEG-4 Part 10, Advanced Video Coding (MPEG-4 AVC), techniques used:
  • Ogg

There’s also the concept of profile and level:

A profile restricts which encoding techniques are allowed. For example, the H.264 format includes the profiles baseline, main and high (and others). While P-slices (which can be predicted based on preceding slices) are supported in all profiles, B-slices (which can be predicted based on both preceding and following slices) are supported in the main and high profiles but not in baseline.

A level is a restriction on parameters such as maximum resolution and data rates.

With high levels cleared out of the way, we turn to actual format of MP4 and H264/AVC:

Part 10: Advanced Video Coding (AVC) Part 12: ISO base media file format Part 14: MP4 file format Part 15: Advanced Video Coding (AVC) file format

Part 12 [caption width=”625” align=”alignnone”]mp4 format mp4 format[/caption]

An object in this terminology is a box.

Boxes start with a header which gives both size and type. The header permits compact or extended size (32 or 64 bits) and compact or extended types (32 bits or full Universal Unique IDentifiers, i.e. UUIDs). The standard boxes all use compact types (32-bit) and most boxes will use the compact (32-bit) size. Typically only the Media Data Box(es) need the 64-bit size. The size is the entire size of the box, including the size and type header, fields, and all contained boxes. This facilitates general parsing of the file.

The definitions of boxes are given in the syntax description language (SDL) defined in MPEG-4 (see reference in Clause 2). Comments in the code fragments in this specification indicate informative material. The fields in the objects are stored with the most significant byte first, commonly known as network byte order or big-endian format. When fields smaller than a byte are defined, or fields span a byte boundary, the bits are assigned from the most significant bits in each byte to the least significant. For example, a field of two bits followed by a field of six bits has the two bits in the high order bits of the byte.

A typical MP4 file can look like this:

ftye isom isomiso2avc1mp41


mvhd duration


tkhd layer



segment_duration is an integer that specifies the duration of this edit segment in units of the timescale in the Movie Header Box media_time is an integer containing the starting time within the media of this edit segment (in media time scale units, in composition time). If this field is set to –1, it is an empty edit. The last edit in a track shall never be an empty edit. Any difference between the duration in the Movie Header Box, and the track’s duration is expressed as an implicit empty edit at the end. media_rate specifies the relative rate at which to play the media corresponding to this edit segment. If this value is 0, then the edit is specifying a ‘dwell’: the media at media-time is presented for the segment-duration. Otherwise this field shall contain the value 1.


mdhd duration 9747968 timescale 15360 == 634 language und undetermined ISO 639-2 codes language declares the language code for this media. See ISO 639-2/T for the set of three character codes. Each character is packed as the difference between its ASCII value and 0x60. Since the code is confined to being three lower-case letters, these values are strictly positive

hdlr handler_type vide name VideoHandler


vmhd color composition mode - not in use


dref empty

url empty

stbl size 296001

If the track that the Sample Table Box is contained in does reference data, then the following sub-boxes are required: Sample Description, Sample Size, Sample To Chunk, and Chunk Offset. Further, the Sample Description Box shall contain at least one entry. A Sample Description Box is required because it contains the data reference index field which indicates which Data Reference Box to use to retrieve the media samples. Without the Sample Description, it is not possible to determine where the media samples are stored. The Sync Sample Box is optional. If the Sync Sample Box is not present, all samples are sync samples.

VisualSampleEntry coding_name avc1 width 640 height 360 horizontal_resolution vertical_resolution


stts Decoding Time to Sample Box sample_count 19039 sample_delta 512 == 9747968

The meaning of DTS is the time when a frame/sample/picture (interchangeable terms in this context) should be put on display.

Using compact format above, it means first sample at 0 time unit, next 512 time unit, next 1024 time unit.

The exact sample (what to put?) not clear yet at this point.

If, for instance, increase sample_delta 1 times, video will be played at half speed. Audio not affected.

We can also get fps with above info. Timescale is 15360, every 512 we play 1 frame. So 15360/512 = 30 fps

We can change sample_delta to 15360 to see 1 second 1 frame effect (not supported by some players as value too big)

DTS - Decode Time Stamp

The Decode Time Stamp (DTS) indicates the time at which an access unit should be instantaneously removed from the receiver buffer and decoded. It differs from the Presentation Time Stamp (PTS) only when picture reordering is used for B pictures. If DTS is used, PTS must also be provided in the bit stream.

PTS (or DTS) is entered in the bitstream at intervals not exceeding 700 mS. ATSC further constrains PTS (or DTS) to be inserted at the beginning of each access unit.

stss Sync Sample Box (key frame) entry_count = 155 1,136,386…

ctts Composition Time to Sample Box

Written on May 4, 2017