Note from archiver<at>cs.uu.nl:
This page is part of a big collection
of Usenet postings, archived here for your convenience.
For matters concerning the content of this page,
please contact its author(s); use the
source, if all else fails.
For matters concerning the archive as a whole, please refer to the
or contact the archiver.
Subject: MPEG-FAQ: multimedia compression [3/9]
This article was archived around: 9 Nov 1996 09:32:59 GMT
Version: v 4.1 96/06/02
1. a low-cost encoder which only possesses frame
motion estimation may use dct_type to decorrelate
the prediction error of a prediction which is
inherently field by characteristic
2. an intelligent encoder realizes that it is more bit
efficient to signal frame prediction with field
dct_type for the prediction error, than it is to signal
a field prediction.
A typical scenario. A field prediction tends to form a
field-correlated prediction error.
A typical scenario. A frame prediction tends to form a
frame-correlated prediction error.
Makes little sense. If the encoder went through the
trouble of finding a field prediction in the first place,
why select frame organization for the prediction error?
prediction modes now include field, frame, Dual Prime, and 16x8 MC.
The combinations for Main Profile and Simple Profile are shown below.
size (after half-
same as MPEG-1, with possibly different
treatment of prediction error via dct_type
Two independently coded predictions are
made: one for the 8 lines which correspond
to the top field, another for the 8 bottom
Two independently coded predictions are
made: one for the 8 lines which correspond
to the top field, another for the 8 bottom
field lines. Uses averaging of two 16x8
prediction blocks from fields of opposite
parity to form a prediction for the top and
bottom 8 lines. A second vector is derived
from the first vector coded in the bitstream.
size (after half-
same as MPEG-1, with possibly different
treatment of prediction error via dct_type
Two independently coded predictions are
made: one for the 8 lines which correspond
to the top field, another for the 8 bottom
A single prediction is constructed from the
average of two 16x16 predictions taken from
fields of opposite parity.
concealment motion vectors can be transmitted in the headers of intra
macroblocks to help error recovery. When the macroblock data that the
concealment motion vectors are intended for becomes corrupt, these
vectors can be used to specify a concealment 16x16 area to be extracted
from the previous picture. These vectors do not affect the normal
decoding process, except for motion vector predictions.
Additional chroma_format for 4:2:2 and 4:4:4 pictures. Like MPEG-1,
Main Profile syntax is strictly limited to 4:2:0 format, however, the
4:2:2 format is the basis of the 4:2:2 Profile (aka Studio Profile).
In 4:2:2 mode, all syntax essentially remains the same except where
matters of block count are concerned. A coded_block_pattern extension
was added to handle signaling of the extra two prediction error
blocks. The 4:4:4 format is currently undefined in any Profile.
multiplex order within Macroblock
4:2:0 (6 blocks)
main stream television, consumer entertainment.
4:2:2 (8 blocks)
studio production environments, professional
editing equipment, distribution and servers
4:4:4 (12 blocks)
Non-linear macroblock quantization was introduced in MPEG-2 to increase
the precision of quantization at high bit rates, while increasing the
dynamic range for low bit rate use where larger step size is needed.
The quantization_scale_code may be selected between a linear (MPEG-1
style) or non-linear scale on a picture (frame or field) basis. The new
non-linear range corresponds to a dynamic range of 0.5 to 54 with
respect to the linear (MPEG-1 style) range of 1 to 31.
alternate scan introduced a new run-length entropy scanning pattern
generally more efficient for the statistics of interlaced video
signals. Zig-zag scan is the appropriate choice for progressive
intra_dc_precision: the MPEG-1 DC value is mandatory quantized to a
precision of 8 bits. MPEG-2 introduced 9, 10, and 11 bit precision set
on a picture basis to increase the accuracy of the DC component, which
by very nature, has the most significant contribution towards picture
quality. Particularly useful at high bit rates to reduce
posterization. Main and Simple Profiles are limited to 8, 9, or 10 bits
of precision. The 4:2:2 High Profile, which is geared towards higher
bitrate applications (up to 50 Mbits/sec), permits all values (up to 11
separate quantization matrices for Y and C: luminance (Y) and
chrominance (Cb,Cr) share a common intra and non-intra DCT coefficient
quantization 8x8 matrix in MPEG-1 and MPEG-2 Main and Simple Profiles.
The 4:2:2 Profile permits separate quantization matrices to be
downloaded for the luminance and chrominance blocks. Cb and Cr still
share a common matrix.
intra_vlc_format: one of two tables may now be selected at the picture
layer for variable length codes (VLCs) of AC run-length symbols in
Intra blocks. The first table is identical to that specified for
MPEG-1 (dc_coef_next). The newer second table is more suited to the
statistics of Intra coded blocks, especially in I- frames. The best
illustration between Table 0 and Table 1is the length of the symbol
which represents End of Block (EOB). In Table zero, EOB is 2 bits. In
Table one, it is 4 bits. The implication is that the EOB symbol is
2^-n probable within the block, or from an alternative perspective,
there are an average of 3 to 4 non-zero AC coefficients in Non-intra
blocks, and 9 to 16 coefficients in Intra blocks. The VLC tree of
Table 1 was intended to be a subset of Table 0, to aid hardware
implementations. Both tables have 113 VLC entries (or events).
escape: When no entry in the VLC exists for a AC Run-Level symbol, an
escape code can be used to represent the symbol. Since there are only
63 positions within an 8x8 block following the first coefficient, and
the dynamic range of the quantized DCT coefficients is [-2047,+2048],
there are (63*2047), or 128,961 possible combinations of Run and Level
(the sign bit of the Level follows the VLC). Only the 113 most common
Run-Level symbols are represented in Table 0 or Table 1. The length of
the escape symbol (which is always 6 bits) plus the Run and Level
values in MPEG-1 could be 20 or 28 bits in length. The 20 bit escape
describes levels in the range [-127,+127]. The 28 bit double escape
has a range of [-255, +255]. MPEG-2 increased the span to the full
dynamic range of quantized IDCT coefficients, [-2047, +2047] and
simplified the escape mechanism with a single representation for this
event. The total length of the MPEG-2 escape codeword is 24 bits (6
bit VLC followed by a 6-bit Run value, and 12 bit Level value). It was
an assumption by MPEG-1 designers that no quantized DCT coefficient
would need greater representation than 10 bits [-255,+255]. Note:
MPEG-2 escape mechanism does not permit the value -2048 to be
mismatch control: The arithmetic results of all stages are defined
exactly by the normative MPEG decoding process, with the single
exception of the Inverse Discrete Cosine Transform (IDCT). This stage
can be implemented with a wide variety of IDCT implementations. Some
are more suited for software, others for programmable hardware, and
others still for hardwired hardware designs. The IDCT reference formula
in the MPEG specification would, if directly implemented, consume at
least 1024 multiply and 1024 addition operations for every block. A
wide variety of fast algorithms exist which can reduce the count to
less than 200 multiplies and 500 adds per block by exploiting the
innate symmetry of the cosine basis functions. A typical fast IDCT
algorithm would be dwarfed by the cost of the other decoder stages
combined. Each fast IDCT algorithm has different quantization error
statistics (fingerprint), although subtle when the precision of the
arithmetic is, for example, at least 16-bits for the transform
coefficients and 24-bits for intermediate dot product values.
Therefore, MPEG cannot standardize a single fast IDCT algorithm. The
accuracy can be defined only statistically. The IEEE 1180
recommendation (December 1990) defines the error tolerance between an
ideal direct-matrix floating point implementation (a direct
implementation of the MPEG reference formula) and the test IDCT.
Mismatch control attempts to reduce the drift between different IDCT
algorithms by eliminating bit patterns which statistically have the
greatest contribution towards mismatches between the variety of
methods. The reconstructions of two decoders will begin to diverge over
time since their respective IDCT designs will reconstruct occasional,
slightly different 8x8 blocks.
MPEG-1s mismatch control method is known canonicially as Oddification,
since it forces all quantized DCT coefficients to negative values. It
is a slight improvement over its predecessor in H.261. MPEG-2 adopted
a different method called, again canonically, LSB Toggling, further
reducing the likelihood of mismatch. Toggling affects only the Least
Significant Bit (LSB) of the 63rd AC DCT coefficient (the highest
frequency in the DCT matrix). Another significant difference between
MPEG-1 and MPEG-2 mismatch control is, in MPEG-1, oddification is
performed on the quantized DCT coefficients, whereas in MPEG-2,
toggling is performed on the DCT coefficients after inverse
quantization. MPEG-1s mismatch control method favors programmable
implementation since a block of DCT coefficients when quantized.
The two chrominace pictures (Cb, Cr) possess only half the resolution
in both the horizontal and vertical direction as the luminance picture
(Y). This is the definition of the 4:2:0 chroma format. Most
television displays require that at least the vertical chrominance
resolution matches the luminance (4:2:2 chroma format). Computer
displays may further still demand that the horizontal resolution also
be equivalent (4:4:4 chroma format). There are a variety of filtering
methods for interpolating the chrominance samples to match the sample
density of luminance. However, the official location or center of the
lower resolution chrominance sample should influence the filter design
(relative taps weights), otherwise the chrominance plane can appear to
be shifted by a fractional sample in the wrong direction.
The subsampled MPEG-1 chroma position has a center exactly half way
between the four nearest neighboring luminance samples. To be
consistent with the subsampled chrominance positions of 4:2:2
television signals, MPEG-2 moved the center of the chrominance samples
to be co-located horizontally with the luminance samples.
copyright_id extension can identify whether a sequence or subset of
frames within the sequence is copyrighted, and provides a unique 64-bit
copyright_id_number registered with the ISO/IEC.
Syntax can now signal frame sizes as large as 16383 x 16383. Since
MPEG-1 employed a meager 12-bits to describe horizontal_size and
vertical_size , the range was limited to 4095x4095. However, MPEGs
Levels prescribe important interoperability points for practical
decoders. Constrained Parameters MPEG-1 and MPEG-2 Low Level limit the
sample rate to 352x240x30 Hz. MPEG-2s Main Level defines the limit at
720x480x30 Hz. Of course, this is simply the restriction of the dot
product of horizontal_size, vertical_size, and frame_rate. The Level
also places separate restrictions on each of the these three
Reflecting the more television oriented manner of MPEG-2, the optional
sequence_display_extension() header can specify the chromaticy of the
source video signal as it was prior to representation by MPEG syntax.
This information includes: whether the original video_format was
composite or component, the opto-electronic transfer_characteristics,
and RGB->YCbCr matrix_coefficients. The picture_display_extension()
provides more localized source composite video characteristics on a
frame by frame basis (not field-by-field), with the syntax elements:
field_sequence, sub_carrier_phase, and burst_amplitude. This
information can be used by the displays post-processing stage to
reproduce a more refined display sequence.
Optional pan & scan syntax was introduced which tells a decoder on a
frame-by-frame basis how to, for example, window a 4:3 image within the
wider 16:9 aspect ratio of the coded frame. The vertical pan offset
can be specified to within 1/16th pixel accuracy.
How does MPEG syntax facilitate parallelism ?
For MPEG-1, slices may consist of an arbitrary number of macroblocks.
They can be independently decoded once the picture header side
information is known. For parallelism below the slice level, the coded
bitstream must first be mapped into fixed-length elements. Further,
since macroblocks have coding dependencies on previous macroblocks
within the same slice, the data hierarchy must be pre-processed down to
the layer of DC DCT coefficients. After this, blocks may be
independently inverse transformed and quantized, temporally predicted,
and reconstructed to buffer memory. Parallelism is usually more of a
concern for encoders. In many encoders today, block matching (motion
estimation) and some rate control stages (such as activity and/or
complexity measures) are processed for macroblocks independently.
Finally, with the exception that all macroblock rows in Main Profile
MPEG-2 bitstreams must contain at least one slice, an encoder has the
freedom to choose the slice structure.
What is the MPEG color space and sample precision?
MPEG strictly specifies the YCbCr color space, not YUV or YIQ or YPbPr
or YDrDb or any other many fine varieties of color difference spaces.
Regardless of any bitstream parameters, MPEG-1 and MPEG-2 Video Main
Profile specify the 4:2:0 chroma_format, where the color difference
channels (Cb, Cr) have half the "resolution" or sample grid density in
both the horizontal and vertical direction with respect to luminance.
MPEG-2 High Profile includes an option for 4:2:2 chroma_format, as does
the MPEG 4:2:2 Profile (a.k.a. Studio Profile) naturally. Applications
for the 4:2:2 format can be found in professional broadcasting,
editing, and contribution-quality distribution environments. The
drawback of the 4:2:2 format is simply that it increases the size of
the macroblock from six 8x8 blocks (4:2:0) to eight, while increasing
the frame buffer size and decoding bandwidth by the same amount (33
%). This increase places the buffering memories well past the magic
16-Mbit limit for semiconductor DRAM devices, assuming the pictures are
stored with a maximum of 414,720 pixels (720 pixels/line x 576
lines/frame). The maximum allowable pixel resolution could be reduced
by 1/3 to compensate (e.g. 544 x 576). However, if a hardware decoders
operate on a macroblock basis in the pipeline, on-chip static memories
(SRAM) will increase by 1/3. The benefits offered by 1/3 more pixels
generally outweighs full vertical chrominance resolution. Other
arguments favoring 4:2:0 over 4:2:2 include:
Vertical decimation increases compression efficiency by reducing
syntax overhead posed in an 8 block (4:2:2) macroblock structure.
You're compressing the hell out of the video signal, so what possible
difference can the 0:0:2 chromiance high-pass make?
Is 4:2:0 the same as 4:1:1 ?
No, no, definitely no. The following table illustrates the nuances
between the different chroma formats for a frame with pixel dimensions
of 720 pixels/line x 480 lines/frame.
CCIR 601 (60 Hz) image Chroma sub-sampling factors
format Y Cb, Cr Vertical Horizontal
3:2:2, 3:1:1, and 3:1:0 are less common variations, but have been
documented. As shocking as it may seem, the 4:1:0 ratio was used by
Intels DVI for several years.
The 130 microsecond gap between successive 4:2:0 lines in progressive
frames, and 260 microsecond gap in interlaced frames, can introduce
some difficult vertical frequencies, but most can be alleviated through
What is the sample precision of MPEG ? How many colors
can MPEG represent ?
By definition, MPEG samples have no more and no less than 8-bits
uniform sample precision (256 quantization levels). For luminance
(which is unsigned) data, black corresponds to level 0, white is level
255. However, in CCIR recommendation 601 chromaticy, luminance (Y)
levels 0 through 14 and 236 through 255 are reserved for blanking
signal excursions. MPEG currently has no such clipped excursion
restrictions, although decoder might take care to insure active samples
do not exceed these limits. With three color components per pixel, the
total combination is roughly 16.8 million colors (i.e. 24-bits).
How are the subsampled chroma samples cited ?
It is moderately important to properly co-site chroma samples,
otherwise a sort of chroma shifting effect (exhibited as a halo) may
result when the reconstructed video is displayed. In MPEG-1 video, the
chroma samples are exactly centered between the 4 luminance samples
(Fig 1.) To maintain compatibility with the CCIR 601 horizontal
chroma locations and simplify implementation (eliminate need for phase
shift), MPEG-2 chroma samples are arranged as per Fig.2.
Y Y Y Y Y Y Y Y YC Y YC Y
C C C C
Y Y X Y Y Y Y Y YC Y YC Y
Y Y Y Y Y Y Y Y YC Y YC Y
C C C C
Y Y Y Y Y Y Y Y YC Y YC Y
Fig.1 MPEG-1 Fig.2 MPEG-2 Fig.3 MPEG-2 and
4:2:0 organization 4:2:0 organization CCIR Rec. 601
How do you tell an MPEG-1 bitstream from an MPEG-2
A. All MPEG-2 bitstreams must contain specific extension headers that
immediately follow MPEG-1 headers. At the highest layer, for example,
the MPEG-1 style sequence_header() is followed by sequence_extension().
Some extension headers are specific to MPEG-2 profiles. For example,
sequence_scalable_extension() is not allowed in Main Profile
A simple program need only scan the coded bitstream for byte-aligned
start codes to determine whether the stream is MPEG-1 or MPEG-2.
What are start codes?
These 32-bit byte-aligned codes provide a mechanism for cheaply
searching coded bitstreams for commencement of various layers of video
without having to actually parse variable-length codes or perform any
decoder arithmetic. Start codes also provide a mechanism for
resynchronization in the presence of bit errors. A start code may be
preceded by an arbitrary number of zero bytes. The zero bytes can be
use to guarantee that a start code occurs within a certain location, or
by rate control to increase the bitrate of a coded bitstream.
Coded block pattern
Coded block pattern:
(CBP --not to be confused with Constrained Parameters!) When the frame
prediction is particularly good, the displaced frame difference(DFD, or
temporal macroblock prediction error) tends to be small, often with
entire block energy being reduced to zero after quantization. This
usually happens only at low bit rates. Coded block patterns prevent
the need for transmitting EOB symbols in those zero coded blocks.
Coded block patterns are transmitted in the macroblock header only if
the macrobock_type flag indicates so.
Why is the DC value always divided by 8 ?
Clarification point: The DC value of Intra coded blocks is quantized by
a constant stepsize of 8 only in MPEG-1, rendering the 11-bit dynamic
range of the IDCT DC coefficient to 8-bits of accuracy. MPEG-2 allows
for DC precision of 8, 9, 10, or 11 bits. The quantization stepsize is
fixed for the duration of the picture, set by the intra_dc_precision
flag in the picture_extension_header().
Why is there a special VLC for DCT_coefficient_first:?
Since the coded_block_pattern in NON-INTRA macroblocks signals every
possible combination of all-zero valued and non-zero blocks, the
dct_coef_first mechanism assigns a different meaning to the VLC
codeword (run = 0, level =+/- 1) that would otherwise represent EOB
(10) as the first coefficient in the zig-zag ordered Run-Level token
What’s the deal with End of Block ?
Saves unnecessary run-length codes. At optimal bitrates, there tends
to be few AC coefficients concentrated in the early stages of the
zig-zag vector. In MPEG-1, the 2-bit length of EOB implies that there
is an average of only 3 or 4 non-zero AC coefficients per block. In
MPEG-2 Intra (I) pictures, with a 4-bit EOB code in Table 1, this
estimate is between 9 and 16 coefficients. Since EOB is required for
all coded blocks, its absence can signal that a syntax error has
occurred in the bitstream.
What’s this “Macroblock stuffing,” dammit ?:
A genuine pain for VLSI implementations, macroblock stuffing was
included in MPEG-1 to maintain smoother, constant bitrate control for
encoders. However, with normalized complexity/activity measures and
buffer management performed a priori (before coding of the macroblock,
for example) and local monitoring of coded data buffer levels now a
common operation in encoders, (e.g. MPEG-2 encoder Test Model), the
need for such localized bitrate smoothing evaporated. Stuffing can be
achieved through slice start code padding if required. A good rule of
thumb is: if you find often yourself wishing for stuffing more than
once per slice, you probably don't have a very good rate control
algorithm. Nonetheless, to avoid any temptation, macroblock stuffing
is now illegal in MPEG-2 (A general syntax restriction brought to you
by the Implementation Studies Subgroup!)
What’s the deal with slice_vertical_position and
The absolute position of the first macroblock within a slice is known
by the combination of slice_vertical_position and the
macroblock_address_increment. Therefore, the proper place of a lost
slice found in a highly corrupt bitstream can be located exactly within
the picture. These two syntax elements are also the only known means
of detecting slice gaps----areas of the picture which are not
represented with any information (including skipped macroblocks). A
slice gap occurs when the current macroblock address of the first
macroblock in a slice is greater than the previous macroblock address
by more than 1 macroblock unit. A slice overlap occurs when the current
macroblock address is less than or equal to the previous macroblocks
address. The previous macroblock in both instances is the last known
macroblock within the previous slice. Because of the semantic
interpretation of slice gaps and overlaps, and because of the syntactic
restrictions for slice_vertical_position and
macroblock_address_increment, it is not syntactically possible for a
skipped macroblock to be represented in the first and last positions of
a slice. In the past, some (bad) encoders would attempt to signal a
run of skipped macroblocks to the end of the slice. These evil skipped
macroblocks should be interpreted by a compliant decoder as a gap, not
as a string of skipped macroblocks.
What is meant by modified Huffman VLC tables:
The VLC tables in MPEG are not Huffman tables in the true sense of
Huffman coding, but are more like the tables used in Group 3 fax. They
are entropy constrained, that is, non-downloadable and optimized for a
limited range of bit rates (sweet spots). A better way would be to say
that the tables are optimized for a range of ratios of bit rate to
sample rate (e.g. 0.25 bits/pixel to 1.0 bits/pixel). With the
exception of a few codewords, the larger tables were carried over from
the H.261 standard drafted in the year 1990. This includes the AC
run-level symbols, coded_block_pattern, and macroblock_address_increment.
MPEG-2 added an "Intra table," also called "Table 1". Note that the
dct_coefficient tables assume positive/negative coefficient PMF
How does MPEG handle 3:2 pulldown?
MPEG-1 video decoders had to decide for themselves when to perform 3:2
pulldown if it was not indicated in the presentation time stamps (PTS)
of the Systems layer bitstream. MPEG-2 provides two flags
(repeat_first_field, and top_field_first) which explicitly describe
whether a frame or field is to be repeated. In progressive sequences,
frames can be repeated 2 or 3 times. Simple and Main Profile limit are
limited to repeated fields only. It is a general syntactic restriction
that repeat_first_field can only be signaled (value ==1) in a frame
structured picture. It makes little sense to repeat field pictures in
an interlaced video signal since the whole process of 3:2 pulldown
conversion was meant to convert progressive, film sequences to the
display frame rate of interlaced television.
In the most common scenario, a film sequence will contain 24 frames
every second. The bit_rate element in the sequence header will
indicate 30 frames/sec, however. On average, every other coded frame
will signal a repeat field (repeat_first_field==1) to pad the frame
rate from 24 Hz to 30 Hz:
(24 coded frames/sec)*(2 fields/coded frame)*(5 display fields/4 coded
fields) = 30 display frames/sec
After all this standardization, what’s left for research?
A . Despite the fact that a comprehensive worldwide standard now exists
for digital video, many areas remain wide open for research: advanced
encoding and pre-processing, motion estimation, macroblock decision
models, rate control and buffer management in editing environments,
implementation complexity reduction, etc. Many areas have yet to be
solved ... (and discovered)..
Are some encoders better than others ?
A. Definitely. For example, the motion estimation search range of a
has great influence over final picture quality. At a certain point a
very large range can actually become detrimental (it may encourage
large differential motion vectors). Practical ranges are usually
between +/- 15 and +/- 32. As the range doubles, for instance, the
search area quadruples. (like the classic relationship between in
increase in linear vs. area).
Rate control marks a second tell-tale area where some encoders perform
significantly better than others.
And finally, the degree of "pre-processing" (now a popular buzzword in
the business) signals that the encoder belongs to an elite marketing
Is the encoder standardized ?
A. The encoder rests just outside the normative scope of the standard,
as long as the bitstreams it produces are compliant. The decoder,
however, is almost deterministic: a given bitstream should reconstruct
to a unique set of pictures. However, since the IDCT function is the
ONLY non-normative stage in the decoder, an occasional error of a Least
Significant Bit per prediction iteration is permitted. The designer is
free to choose among many DCT algorithms and implementations. The IEEE
1180 test referenced in Annex A of the MPEG-1 (ISO/IEC 11172-2) and
MPEG-2 (ISO/IEC 13818-2) Video specifications spells out the
statistical mismatch tolerance between the Reference IDCT, which is a
separable 8x1 "Direct Matrix" DCT implemented with 64-bit floating
point accuracy, and the IDCT you are testing for compliance.
What is the TM (Test Model) ?
What is the TM rate control and adaptive quantization technique ?
A. The Test model (MPEG-2) and Simulation Model (MPEG-1) were not, by
any stretch of the imagination, meant to epitomize state-of-the art
encoding quality. They were, however, designed to exercise the syntax,
verify proposals, and test the relative compression performance of
proposals in a timely manner that could be duplicated by
co-experimenters. Without simplicity, there would have been no doubt
endless debates over model interpretation. Regardless of all else,
more advanced techniques would probably trespass into proprietary
The final test model for MPEG-2 is TM version 5b, a.k.a. TM version 6,
produced in March 1993 (the time when the MPEG-2 video syntax was
frozen). The final MPEG-1 simulation model is version 3 (SM-3). The
MPEG-2 TM rate control method offers a dramatic improvement over the SM
method. TM adds more accurate estimation of macroblock complexity
through use of limited a priori information. Macroblock quantization
adjustments are computed on a macroblock basis, instead of
once-per-macroblock row (which in the SM-3 case consisted of an entire
How does the TM work?
Rate control and adaptive quantization are divided into three steps:
Step One: Target Bit Allocation
In Complexity Estimation, the global complexity measures assign
relative weights to each picture type (I,P,B). These weights (Xi, Xp,
Xb) are reflected by the typical coded frame size of I, P, and B
pictures (see typical frame size discussion). I pictures are usually
assigned the largest weight since they have the greatest stability
factor in an image sequence and contain the most new information in a
sequence. B pictures are assigned the smallest weight since B energy
do not propagate into other pictures and are usually more highly
correlated with neighboring P and I pictures than P pictures are.
The bit target for a frame is based on the frame type, the remaining
number of bits left in the Group of Pictures (GOP) allocation, and the
immediate statistical history of previously coded pictures (sort of a
moving average global rate control, if you will).
Step Two: Rate Control via Buffer Monitoring
Rate control attempts to adjust bit allocation if there is significant
difference between the target bits (anticipated bits) and actual coded
bits for a block of data. If the virtual buffer begins to overflow,
the macroblock quantization step size is increased, resulting in a
smaller yield of coded bits in subsequent macroblocks. Likewise, if
underflow begins, the step size is decreased. The Test Model
approximates that the target picture has spatially uniform distribution
of bits. This is a safe approximation since spatial activity and
perceived quantization noise are almost inversely proportional. Of
course, the user is free to design a custom distribution, perhaps
targeting more bits in areas that contain more complex yet highly
perceptible data such as text.
Step Three: Adaptive Quantization
The final step modulates the macroblock quantization step size obtained
in Step 2 by a local activity measure. The activity measure itself is
normalized against the most recently coded picture of the same type (I,
P, or B). The activity for a macroblock is chosen as the minimum among
the four 8x8 block luminance variances. Choosing the minimum block is
part of the concept that a macroblock is no better than the block of
highest visible distortion (weakest link in the chain).
[deferred to later date]
Can motion vectors be used to determine object velocity?
Motion vector information cannot be reliably used as a means of
determining object velocity unless the encoder model specifically set
out to do so. First, encoder models that optimize picture quality
generate vectors that typically minimize prediction error and,
consequently, the vectors often do not represent true object
translation from picture-to-picture. Standards converters that
resample one frame rate to another (as in NTSC to PAL) use different
methods (motion vector field estimation, edge detection, et al) that
are not concerned with Rate-Distortion theory. Second, motion vectors
are not transmitted for all macroblocks anyway.
Is it possible to code interlaced video with MPEG-1 syntax?
A. Two methods can be applied to interlaced video that maintain
syntactic compatibility with MPEG-1 (which was originally designed for
progressive frames only). In the field concatenation method, the
encoder model can carefully construct predictions and prediction errors
that realize good compression but maintain field integrity (distinction
between adjacent fields of opposite parity). Some pre-processing
techniques can also be applied to the interlaced source video that
would, e.g., lessen sharp vertical frequencies.
This technique is not terribly efficient of course. On the other hand,
if the original source was progressive (e.g. film), then it is more
trivial to convert the interlaced source to a progressive format before
encoding. (MPEG-2 would then only offer slightly superior performance
through such MPEG-2 enhancements as greater DC coefficient precision,
non-linear mquant, intra VLC, etc.) Reconstructed frames are usually
re- interlaced in the Display process following the decoding stages.
The second syntactically compatible method codes fields as separate
pictures. Rumors have spread that this approach does not quiet work
nearly as well as the pretend its really a frame method.
Can MPEG be used to code still frames ?
Yes. MPEG Intra pictures are similar to baseline sequential JPEG pictures.
There are, of course, advantages and disadvantages to using MPEG over
JPEG to represent still pictures.
1. MPEG has only one color space (YCbCr)
2. MPEG-1 and MPEG-2 Main Profile luma and chroma share quanitzation
and VLC tables (4:2:0 chroma_format)
3. MPEG-1 is syntactically limited to 4k x 4k images, and 16k x 16k for MPEG-2.
1. MPEG possesses adaptive quantization which permits better rate
control and spatial masking.
2. With its limited still image syntax, MPEG averts any temptation to
use unnecessary, expensive, and academic encoding methods that have
little impact on the overall picture quality (you know who you are).
3. Philips' CD-I spec. has a requirement for a MPEG still frame mode,
with double SIF image resolution. This is technically feasible mostly
thanks to the fact that only one picture buffer is needed to decode a
still image instead of the 2.5 to 3 buffers needed for IPB sequences.
Why was the 8x8 DCT size chosen?
A. Experiments showed little compaction gains could be achieved with
larger transform sizes, especially in light of the increased
implementation complexity. A fast DCT algorithm will require roughly
double the number of arithmetic operations per sample when the linear
transform point size is doubled. Naturally, the best compaction
efficiency has been demonstrated using locally adaptive block sizes
(e.g. 16x16, 16x8, 8x8, 8x4, and 4x4) [See Gary Sullivan and Rich
Baker "Efficient Quadtree Coding of Images and Video," ICASSP 91, pp
Inevitably, adaptive block transformation sizes introduce additional
side information overhead while forcing the decoder to implement
programmable or hardwired recursive DCT algorithms. If the DCT size
becomes too large, then more edges (local discontinuities) and the like
become absorbed into the transform block, resulting in wider
propagation of Gibbs (ringing) and other unpleasant phenomena.
Finally, with larger transform sizes, the DC term is even more
critically sensitive to quantization noise.
Why was the 16x16 prediction size chosen?
The 16x16 area corresponds to the Least Common Multiple (LCM) of 8x8
blocks, given the normative 4:2:0 chroma ratio. Starting with medium
size images, the 16x16 area provides a good balance between side
information overhead & complexity and motion compensated prediction
accuracy. In gist, experiments showed that the 16x16 was a good
trade-off between complexity and coding efficiency.
What do B-pictures buy you?
A. Since bi-directional macroblock predictions are an average of two
macroblock areas, noise is reduced at low bit rates (like a 3-D filter,
if you will). At nominal MPEG-1 video (352 x 240 x 30, 1.15 Mbit/sec)
rates, it is said that B-frames improves SNR by as much as 2 dB. (0.5
dB gain is usually considered worth-while in MPEG). However, at higher
bit rates, B- frames become less useful since they inherently do not
contribute to the progressive refinement of an image sequence (i.e.
not used as prediction by subsequent coded frames). Regardless,
B-frames are still politically controversial.
B pictures are interpolative in two ways: 1. predictions in the
bi-directional macroblocks are an average from block areas of two
pictures 2. B pictures "fill in" like a digital spackle the immediate
3-D video signal without contributing to the overall signal quality
beyond that immediate point in time. In other words, a B picture,
regardless of its internal make-up of macroblock types, has a life
limited only to itself. As mentioned before, B picture energy does not
propagate into other frames. In a sense, bits spent on B pictures are
Why do some people hate B-frames?
A. Computational complexity, bandwidth, end-to-end delay, and picture
buffer size are the four B-frame Pet Peeves. Computational complexity
in the decoder is increased since some macroblock modes require
averaging between two block predictions (macroblock_motion_forward==1
Worst case, memory bandwidth is increased an extra 15.2 MByte/s
(assuming 4:2:0 chroma_format at Main Level), not including any half
pel or page-mode overhead) for this extra directional prediction. To
really rub it in, an extra picture buffer is needed to store the future
reference picture (backwards prediction frame). Finally, an extra
picture delay is introduced in the decoder since the frame used for
backwards prediction needs to be transmitted to the decoder and
reconstructed before the intermediate B-pictures in display order can
Cable television have been particularly adverse to B-frames since, for
CCIR 601 rate video, the extra picture buffer pushes the decoder DRAM
memory requirements past the magic 8- Mbit (1 Mbyte) threshold into the
evil realm of 16 Mbits (2 Mbyte).---- although 8-Mbits is fine for 352
x 480 B picture sequence. However, cable often forgets that DRAM does
not come in convenient high-volume (low cost) 8- Mbit packages as does
friendly 4-Mbit and 16-Mbit packages. In a few years, the cost
difference between 16 Mbit and 8 Mbit will become insignificant
compared to the bandwidth savings gain through higher compression. For
the time being, some cable boxes will start with 8-Mbit and allow
future drop-in upgrades to the full 16-Mbit.
How are interlaced and progressive pictures indicated in
The following tree may help illustrate the possible layers of
progressive and interlaced coding modes:
progressive interlaced sequence
sequence / \
Field picture Frame picture
Frame or field prediction Frame MB prediction only
Field dct Frame dct
What does it mean to be compliant with MPEG ?
There are two areas of conformance/compliance in MPEG:
1. Compliant bitstreams
2. Compliant decoders
Technically speaking, video bitstreams consisting entirely of I-frames
are syntactically compliant with the MPEG specification. The I-frame
sequence simply utilizes a rather limited subset of the full syntax.
Compliant bitstreams must obey the range limits (e.g. motion vectors
ranges, bit rates, frame rates, buffer sizes) and permitted syntax
elements in the bitstream (e.g. chroma_format, B-pictures, etc).
Decoders, however, must be able to decode all combinations of legal
bitstreams.. For example, a decoder which is incapable of decoding P or
B frames is definitely not a Main Profile or Constrained Parameters
decoder! Likewise, full arithmetic precision must be obeyed before any
decoder can be called "MPEG compliant." The IDCT, inverse quantizer,
and motion compensated predictor must meet the accuracy requirements
defined in the MPEG document. Real-time conformance is more complicated
to measure than arithmetic precision, but it reasonable to expect that
decoders that skip frames on reasonable bitstreams are not likely to be
What are Profiles and Levels?
A. MPEG-2 Video Main Profile and Main Level is analogous to MPEG-1's
CPB, with sampling limits at CCIR 601 parameters (720x480x30 Hz or
720x576x24 Hz). "Profiles" limit syntax (i.e. algorithms), whereas
"Levels" limit coding parameters (sample rates, frame dimensions, coded
bitrates, etc.). Together, Video Main Profile and Main Level
(abbreviated as MP@ML) normalize complexity within feasible limits of
1994 VLSI technology (0.5 micron), yet still meet the needs of the
majority of applications. MP@ML is the conformance point for most cable
and satellite TV systems.
[insert a description of each Profiles and Levels here]
Can MPEG-1 encode higher sample rates than 352 x 240 x 30 Hz ?
A. Yes. The MPEG-1 syntax permits sampling dimensions as high as 4095 x
4095 x 60 frames per second. The MPEG most people think of as "MPEG-1"
is really a kind of subset known as Constrained Parameters bitstream
What are Constrained Parameters Bitstreams?
MPEG-1 CPB are a limited set of sampling and bitrate parameters
designed to normalize decoder computational complexity, buffer size,
and memory bandwidth while still addressing the widest possible range
of applications. The parameter limits were intentionally designed to
permit decoder implementations integrated with 4 Megabits (512 Kbytes)
480 or 576
The sampling limits of CPB are bounded at the ever popular SIF rate:
396 macroblocks (101,376 pixels) per picture if the picture rate is
less than or equal to 25 Hz, and 330 macroblocks (84,480 pixels) per
picture if the picture rate is 30 Hz. The MPEG nomenclature loosely
defines a pixel or "pel" as a unit vector containing a complete
luminance sample and one fractional (0.25 in 4:2:0 format) sample from
each of the two chrominance (Cb and Cr) channels. Thus, the
corresponding bandwidth figure can be computed as:
352 samples/line x 240 lines/picture x 30 pictures/sec x 1.5
or 3.8 Ms/s (million samples/sec) including chroma, but not including
blanking intervals. Since most decoders are capable of sustaining VLC
decoding at a faster rate than 1.8 Mbit/sec, the coded video bitrate
has become the most often waived parameter of CPB. An encoder which
intelligently employs the syntax tools should achieve SIF quality
saturation at about 2 Mbit/sec, whereas an encoder producing streams
containing only I (Intra) pictures might require as much as 8 Mbit/sec
to achieve the same video quality.
Why is Constrained Parameters so important?
A. It is an optimum point that allows (just barely) cost effective
VLSI implementations in 1992 technology (0.8 microns). It also
implies a nominal guarantee of interoperability for decoders and a
reasonable class of performance for encoders. Since CPB is the most
popular canonical MPEG-1 conformance point, MPEG devices which are not
capable of at least meeting SIF rates are usually not considered to be
true MPEG by industry.
Picture buffers (i.e. "frame stores") and coded data buffering
requirements for MPEG-1 CPB fit just snugly into 4 Mbit of memory
Who uses constrained parameters bitstreams?
A. Principal CPB applications are Compact Disc video (White Book or
CD-I) and desktop video. Set-top TV decoders fall into a higher
sampling rate category known as "CCIR 601" or "Broadcast rate," which
as a rule of thumb, has sampling dimensions and bandwidth 4 times
that of SIF (Constrained Parameter sample rate limit).
Are there ways of circumventing constrained parameters bitstreams for
SIF class applications and decoders ?
A. Yes, some. Remember that CPB limits pictures by macroblock count
(or pixels/frame). 416 x 240 x 24 Hz sampling rates are still within
these constraints. Deviating from 352 samples/line could throw off many
decoder implementations which possess limited horizontal sample rate
conversion abilities. Some decoders do in fact include a few rate
conversion modes, with a filter usually implemented via binary taps
(shifts and adds). Likewise, the target sample rates are usually
limited or ratios (e.g. 640, 540, 480 pixels/line, etc.). Future MPEG
decoders will likely include on-chip arbitrary sample rate converters,
perhaps capable of operating in the vertical direction (although there
is little need of this in applications using standard TV monitors where
line count is constant, with the possible exception of windowing in
cable box graphical user interfaces).
Also, many CD videos are letterboxed at the 16:9 aspect ratio. The
actual coded and display sampling dimensions are 384 x 216 (note
384/216 = 16/9). These programs are typically movies coded at the more
manageable 24 frames/sec.
Are there any other conformance points like CPB for MPEG-1?
A. Undocumented ones, yes. A second generation of decoder chips
emerged on the market about 1 year after the first wave of SIF-class
decoders. Both LSI Logic and SGS-Thomson introduced CCIR 601 class
MPEG-1 video decoders to fill in the gap between canonical MPEG-1 (SIF)
and the emergence of Main Profile at Main Level (CCIR 601) MPEG-2
decoders. Under non-disclosure agreement, C-Cube had the CL- 950,
although since Q2'94, the CL-9100 is now the full MPEG-2 successor in
production. MPEG-1 decoders in the CCIR 601 class, or Main Level, were
all too often called MPEG-1.5 or MPEG-1++ decoders. For the first year
of operation, the Direct Broadcasting Satellite service in the United
States (Hughes Direct TV and Hubbards USSB) called only upon MPEG-1
syntax to represent interlaced video before switching to full MPEG-2
What frame rates are permitted in MPEG?
A limited set is available for the choosing in MPEG-1 and the currently
defined set of Profiles and Levels of MPEG-2, although "tricks" could
be played with Systems-layer Time Stamps to convey non-standard picture
rates. The set is: 23.976 Hz (3-2 pulldown NTSC), 24 Hz (Film), 25 Hz
(PAL/SECAM or 625/60 video), 29.97 (NTSC), 30 Hz (drop-frame NTSC or
component 525/60), 50 Hz (double-rate PAL), 59.97 Hz (double rate
NTSC), and 60 Hz (double-rate, drop-frame NTSC/component 525/60
Only 23.976, 24, 25, 29.97, and 30 Hz are within the conformance space
of Constrained Parameter Bitstreams and Main Level.
What areas can be improved upon to create a better syntax
Several improvements can be made to the MPEG syntax while remaining
within the framework of block based coding. As implementation
technology improves with time, the ratio of computation to sample rate
can be increased for the same implementation cost. With each
evolutionary stage in the shrinking of the semiconductor lithography
process (line width), more complex coding methods become economically
realizable. Some of the well-known or well-anticipated areas for
improvement are described below:
For intra pictures, subband methods such as wavelets combined with
improved quantization and entropy coders could gain as much as 2-4 dB
over MPEG Intra pictures. The problem becomes more complex when
considering the coding of Intra Macroblocks in mixed pictures, such as
P or B, since the extend of a subband must, in the simplest of
schemes, be limited to the dimensions of a macroblock.
Prediction error coding
One of the strongest gripes against MPEG is the use of the DCT for
decorrelation of prediction error blocks. One explanation is that the
DCT is suited for the statistical correlation of intra signals, but
less suited for the statistics of prediction error (Non-Intra) signals.
One common proposal is to replace the DCT with a Vector Quantizer.
Prediction error (Non-intra) blocks typically contain far fewer bits
than intra blocks. (The bits that comprise a Non-intra blocks can be
thought of as having been previously distributed over previous blocks
in previous pictures in the form of coefficients and side
Finer coding unit granularity’s:
The size of the transform block could be made smaller, larger, or both
(myriad of different sizes). Likewise, the size of the motion
compensation block can be made larger or smaller. The cost is more
complex semantics (more decoder complexity) and the overhead bits to
select the block size. Instead of sharing the same side information,
the blocks within the macroblock could be assigned their own motion
vectors, macroblock quantization scale factors, etc.
Many advanced techniques were in investigated by MPEG during the
formative stages of the specification, but were eventually eliminated
for falling below a threshold set for coding gain vs. implementation
complexity. Often, proposals presented a significant departure from the
main stream algorithms under consideration. Each bit added to the
syntax, or rule added to the semantics represents several gates to a
silicon implementation, or from a software perspective, an extra table,
if-then or case statement at multiple points in the decoding program.
What are the similarities and differences between MPEG and
During its formative stages, H.263 was known as "H.26P" or "H.26X". It
is an ITU-T standard for low-bitrate video and audio teleconferencing.
It is designed to be more efficient (at least 2dB) than H.261 for bit
rates below 64 kbits/sec (ISDN B channel). The primary target bit
rate, approximately 27,000 bits/sec, is the payload rate of the V.34
(a.k.a "V.Fast" or "V.Last") modem standard. In a typical scenario, 20
kbit/sec would be allocated for the video portion, and 6.5 kbit/sec for
the speech portion.
Since the H.261 syntax was defined in 1990, techniques and
implementation power have naturally improved. H.263 collects many of
the advanced methods proposed during MPEGs formative stages into a
syntax which shares a common basis more with MPEG-1 video than with
The detailed differences and similarities are summarized below:
Sample rate, precision, and color space:
H.263 pictures are transmitted with QCIF dimensions. MPEG and JPEG
allow nearly any picture size to be described in the headers. A fixed
picture size promotes interoperability by forcing all implementors to
operate at a common rate, rather than by allowing implementors to get
away with whatever lowest sample rate the consumer can be tricked into
buying. Another reason for a fixed sample rate is that, unlike MPEG
which is generic, H.263 is geared towards a specific application
(teleconferencing). Other MPEG applications such as CD Video and Cable
TV define their own fixed parameters. Chromaticy is again YCbCr, 4:2:0
macroblock structure, and 8 bits of uniform sample precision.
How would you describe MPEG to the Data Compression
A. MPEG video is a block-based coding scheme.
How does MPEG video really compare to TV, VHS, laserdisc ?
A. VHS picture quality can be achieved for film source video at about 1
million bits per second (with careful application of proprietary
encoding methods). Objective comparison of MPEG to VHS is complex.
The luminance response curve of VHS places -3 dB (50% response, the
common definition of bandlimit) at around analog 2 MHz (digital
equivalent to 200 samples/line). VHS chroma is considerably less dense
in the horizontal direction than MPEG's 4:2:0 signal (compare 80
samples/line equivalent to 176 !!). From a sampling density
perspective, VHS is superior only in the vertical direction (480
luminance lines compared to 240). When other analog factors are taken
into account, such as interfield crosstalk and the TV monitor Kell
factor, the perceptual vertical advantage becomes much less than 2:1.
VHS is also prone to such inconveniences as timing errors (an annoyance
addressed by time base correctors), whereas digital video is fully
discretized. Duplication processes for pre-recorded VHS tapes at high
speeds (5 to 15 times real time playback speed) introduces additional
handicaps. In gist, MPEG-1 at its nominal parameters can match VHSs
sexy low-pass-filtered look, but for critical sequences, is probably
overall inferior to a well mastered, well duplicated VHS tape.
With careful coding schemes, broadcast NTSC quality can be approximated
at about 3 Mbit/sec, and PAL quality at about 4 Mbit/sec for film
source video. Of course, sports sequences with complex spatial-
temporal activity should be treated with higher bit rates, in the
neighborhood of 5 and 6 Mbit/sec. Laserdisc is perhaps the most
difficult medium to make comparisons with.
First, the video signal encoded onto a laserdisc is composite, which
lends the signal to the familiar set of artifacts (reduced color
accuracy of YIQ, moirse patterns, crosstalk, etc). The medium's
bandlimited signal is often defined by laserdisc player manufacturers
and main stream publications as capable of rendering up to 425 TVL (or
frequencies with Nyquist at 567 samples/line). An equivalent component
digital representation would therefore have sampling dimensions of 567
x 480 x 30 Hz. The carrier-to-noise ratio of a laserdisc video signal
is typically better than 48 dB. Timing accuracy is excellent,
certainly better than VHS. Yet some of the clean characteristics of
laserdisc can be simulated with MPEG-1 signals as low as 1.15 Mbit/sec
(SIF rates), especially for those areas of medium detail (low spatial
activity) in the presence of uniform motion (affine motion vector
fields). The appearance of laserdisc or Super VHS quality can therefore
be obtained for many video sequences with low bit rates, but for the
more general class of images sequences, a bit rate ranging from 3 to 6
Mbit/sec is necessary.
What are the typical coded sizes for the MPEG frames?
Typical bit sizes for the three different picture types:
30 Hz SIF
@ 1.15 Mbit/sec
30 Hz CCIR 601
@ 4 Mbit/sec
Note: the above example is taken from a standard test sequence coded by
the Test Model method, with an I frame distance of 15 (N = 15), and a P
frame distance of 3 (M = 3).
Of course, among differing source material, scene changes, and use of
advanced encoder models these numbers can be significantly different.
At what bitrates is MPEG-2 video optimal?
The Test subgroup has defined a few example "Sweet spot" sampling
dimensions and bit rates for MPEG-2:
Equivalent to VHS quality. Intended for film source video. Half
horizontal 601(HHR). Looks almost broadcast NTSC quality
PAL broadcast quality (nearly full capture of 5.4 MHz luminance
signal). 544 samples matches the width of a 4:3 picture windowed
within 720 sample/line 16:9 aspect ratio via pan&scan
Full CCIR 601 sampling dimensions
These numbers may be too ambitious. Bit rates of 3, 6, and 8 Mbit/sec
respectively provide transparent quality for the above application
examples when generated by a reasonably sophisticated encoder.
Why does film perform so well with MPEG ?
1. The frame rate is 24 Hz (instead of 30 Hz) which is a savings of
2. Film source video is inherently progressive. Hence no fussy
interlaced spectral frequencies.