Part 1: Systems
Part 1 of the MPEG-1 standard covers systems, and is defined in ISO/IEC-11172-1.
MPEG-1 Systems specifies the logical layout and methods used to store the encoded audio, video, and other data into a standard bitstream, and to maintain synchronization between the different contents. This file format is specifically designed for storage on media, and transmission over data channels, that are considered relatively reliable. Only limited error protection is defined by the standard, and small errors in the bitstream may cause noticeable defects.
This structure was later named a program stream: "The MPEG-1 Systems design is essentially identical to the MPEG-2 Program Stream structure." This terminology is more popular, precise (differentiates it from a transport stream) and will be used here.
Elementary Streams
Elementary streams (ES) are the raw bitstreams of MPEG-1 audio and video, output by an encoder. These files can be distributed on their own, such as is the case with MP3 files.
Additionally, elementary streams can be made more robust by packetizing them, i.e. dividing them into independent chunks, and adding a cyclic redundancy check (CRC) checksum to each segment for error detection. This is the Packetized Elementary Stream (PES) structure.
System Clock Reference (SCR) is a timing value stored in a 33-bit header of each ES, at a frequency/precision of 90 kHz, with an extra 9-bit extension that stores additional timing data with a precision of 27 MHz. These are inserted by the encoder, derived from the system time clock (STC). Simultaneously encoded audio and video streams will not have identical SCR values, however, due to buffering, encoding, jitter, and other delay.
Program Streams
Program Streams (PS) are concerned with combining multiple packetized elementary streams (usually just one audio and video PES) into a single stream, ensuring simultaneous delivery, and maintaining synchronization. The PS structure is known as a multiplex, or a container format.
Program time stamps (PTS) exist in PS to correct the inevitable disparity between audio and video SCR values (time-base correction). 90 kHz PTS values in the PS header tell the decoder which video SCR values match which audio SCR values. PTS determines when to display a portion of an MPEG program, and is also used by the decoder to determine when data can be discarded from the buffer. Either video or audio will be delayed by the decoder until the corresponding segment of the other arrives and can be decoded.
PTS handling can be problematic. Decoders must accept multiple program streams that have been concatenated (joined sequentially). This causes PTS values in the middle of the video to reset to zero, which then begin incrementing again. Such PTS wraparound disparities can cause timing issues that must be specially handled by the decoder.
Display Time Stamps (DTS), additionally, are required because of B-frames. With B-frames in the video stream, adjacent frames have to be encoded and decoded out-of-order (re-ordered frames). DTS is quite similar to PTS, but instead of just handling sequential frames, it contains the proper time-stamps to tell the decoder when to decode and display the next B-frame (types of frames explained below), ahead of its anchor (P- or I-) frame. Without B-frames in the video, PTS and DTS values are identical.
Multiplexing
To generate the PS, the multiplexer will interleave the (two or more) packetized elementary streams. This is done so the packets of the simultaneous streams can be transferred over the same channel and are guaranteed to both arrive at the decoder at precisely the same time. This is a case of time-division multiplexing.
Determining how much data from each stream should be in each interleaved segment (the size of the interleave) is complicated, yet an important requirement. Improper interleaving will result in buffer underflows or overflows, as the receiver gets more of one stream than it can store (eg. audio), before it gets enough data to decode the other simultaneous stream (eg. video). The MPEG Video Buffer Verifier (VBV) assists in determining if a multiplexed PS can be decoded by a device with a specified data throughput rate and buffer size. This offers feedback to the muxer and the encoder, so that they can change the mux size or adjust bitrates as needed for compliance.
The PS, additionally, stores aspect ratio information which tells the decoder how much to stretch the height or width of a video when displaying it. Different display devices (such as computer monitors and televisions) have different pixel heights/widths, which will result in video encoded for one appearing "squished" when played on the other, unless the aspect ratio information in the PS is used to compensate.