Understanding Digital Video
By Steve Rose (copyright 2002, all rights reserved)
Wednesday, October 23, 2002
Digital Video Primer
(see also How Video On Demand Works)
The real world appears to be totally analog. The brightness of a scene, the loudness of a sound, the intensity of a taste or smell, all seem to vary on a continuous scale, without discrete breaks in level.
Electronic circuits can be made to work in this manner, so that their output is proportional to their input. If the circuit is an amplifier with a gain of 1000, then a one millivolt change in input produces a one volt change in the output, regardless of whether the input changed from zero volts to .001 volt, or from 2 volts to 2.001 volts. The output is a proportional replica, or analog, of the input. The characteristics that measure the accuracy of the amplifier are linearity and freedom from distortion. However, analog electronic circuits are never perfect, so whenever a signal is processed, its quality is degraded.
It is easy to build an electronic circuit whose response is all or nothing. These two state, or binary circuits, can vary any parameter to represent the "all" and "nothing" states: a voltage, a current, a state of magnetization, the presence or absence of a stored charge, etc. All that is required is that there is a clear threshold between the two states. It doesn't even matter if the signal representing each of the states is noisy, as long as one can be clearly discriminated from the other. As a result, information about the state of a circuit can be passed through any number of stages without distortion, as the output will still be either "all" or "nothing". In most cases, one state will be a voltage near 0, that represents the number 0, and the other state will be a voltage near the power supply voltage, normally positive, that represents the number 1.
What good is a circuit that can just represent 0 or 1? By combining this type of binary circuit, you can represent any number, and by using numbers to represent other symbols or parameters, binary circuits can represent (and manipulate) any kind of information. It is surprising how many of the items that we think of as things (e.g. books, newspapers, recordings, transaction records, pictures, movies, etc.) are actually information. When you buy a newspaper, it isn't for the paper or the ink, but the shape of the ink on the paper!
The American Standard Code for Information Interchange (ASCII) uses the number 65 (decimal) to represent the letter A. Any time an ASCII based computer sends text to another computer, it sends the value 65 to represent A. When you type A on your keyboard, it is stored in your computer as the number 65. When you send an email, and the recipient's computer finds that 65, it puts an A on the screen, just as you intended.
To represent audio with numbers (for example, during a computer videoconference), the output level of your microphone is measured periodically, and the number representing that measurement is transmitted to the computer of the person with whom you are conferring. When the recipient's computer receives that number, it uses it to determine the amount of displacement of its speaker. Since the signal from the microphone represents the displacement of its diaphragm responding to vibrations from your voice, and the speaker of the recipient's computer is vibrating the air in the same way at the distant location, the recipient hears your voice (as long as the measured samples of the microphone occur frequently enough).
So how do you combine binary circuits to represent numbers? Look at how you count now. In the decimal system, we have ten symbols, 0 through 9, to represent a value in each position of a number. Each position to the left represents a value ten times greater than the preceding position. So the first position on the right (before the decimal point) represents how many 1's we have, from 0 to 9. The next position represents how many tens, the next how many hundreds, etc. We can count to 9 in the first position, so the number 1 in the second position has to represent 10. We can count to 99 with two positions, so the number 1 in the third position has to represent 100. This is the most efficient way to represent numbers, as there is only one way to represent each number.
If we only have two symbols for each position, 0 and 1, we can still count to any number – it just takes more positions. Since we can only count to 1 in the first position, the second position has to represent 2. Since we can only count to 3 with two positions ( one plus two, or a binary 11, pronounced one one), the third position has to represent 4. Since we can only count to 7 with three positions, the fourth position has to represent 8. So instead of the decimal system's ones place, tens place, hundreds place, thousands place, etc., we have the binary system's ones place, twos place, fours place, eights place, etc. We can still represent any number, it just takes more places. For example, to count to a million or so, we need seven places in the decimal system, and twenty positions in the binary system.
Notice that in the decimal system, the places represent the powers of 10:
Ten to the zeroth power is the ones place, Ten to the first power is the tens place, ten to the second power is the hundreds place, etc. The same is true in the binary system: two to the zeroth power is the ones place, two to the first power is the twos place, two to the second power is the fours place, etc. In fact, in a number system based on any number of symbols, the same is true. If we can represent sixteen symbols, then the places would be 16^0 = ones, 16^1 = sixteens, 16^2 = two hundred and fifty sixes, etc.
In fact, when humans represent their work with computers, they typically use a base 16 representation (called hexadecimal) instead of the computer's native binary. Instead of working with binary numbers like 11111110 (254 decimal), hexidecimal (base 16) representation is used to make the numbers shorter and easier to work with. Instead of creating new symbols for the values 10 through 15, the letters A through F are used (lower or upper case doesn't matter). Counting from 0 to 15 in binary takes four places, so by visualizing the binary number in groups of four binary digits (called bits), 11111110 becomes 1111 (= 1 + 2 + 4 + 8 = 15), represented by the letter F in hexadecimal (frequently just called hex), followed by 1110 = 14 decimal or the letter E in hex. So 11111110 binary = 254 decimal = FE hexadecimal. It is the same number, the same value, just represented differently.
So how high can you count on one hand? If you assign each finger a binary position value (e.g. 1, 2, 4, 8, 16), and that value is expressed when that finger is raised, then you can count from 0 to 31 on one hand, leading to the famous Hawaiian "seventeen, bra!".
It is possible to "throw away" some of the possible values in a binary representation of numbers, to make it more "human friendly", and use just ten of the sixteen possible values of a four bit binary number to represent 0 through 9. Each decimal position is represented by four bits. This is referred to as "binary coded decimal", and although it means that equipment designers and software engineers have to jump through more hoops, it results in a system that is generally easier to understand, and for some financial calculations may be more accurate. How could it be more accurate? Well, any integer can be represented by any number system, but not any fraction. For example, in decimal, we can't really represent 1/3 (.333333...). In binary, we can't simply represent all decimal fractions from 1/100 to 99/100 (pennies). As a result, unless math is very carefully done in binary, it ends up with penny rounding errors, which can of course compound themselves. Because BCD counts in decimal using binary representation, it can accurately represent all of the normal financial values in a way that can use the same calculation and rounding rules as traditional accounting.
Digital signals can be transmitted in parallel (several bits at a time, typically 8, each on a separate wire), or one bit after another (serially). Parallel interfaces are generally used for computer plug in circuit boards (e.g. PCI and ISA boards), internal disk drives (Hard Disk, CD-ROM, floppy drive), and external printers and SCSI devices. Other external interfaces are serial (USB, Firewire, Ethernet, RS-232). Some serial interfaces are much faster than some parallel interfaces, although in theory it should always be possible to make a parallel interface faster (parallel = multiple synchronized serial lines). The speed of an interface generally depends on when it was devised, and the medium over which it is transmitted.
Serial and parallel interfaces may be unidirectional or bidirectional. If the same wires are used to transmit the data in both directions, each end has to take turns, which is called half duplex operation. If there are separate wires for each direction, full duplex operation is possible, which allows both ends to transmit simultaneously. (Optical fiber is generally used in this manner, but a single fiber may also be used to transmit information in both directions simultaneously.)
When bits are transmitted in parallel, there are one or more extra wires used to synchronize the reception of the data. Serial interfaces may use a separate clock line, a clock signal embedded in the data (both synchronous techniques), or signal the start of each byte and depend on a local clock at the receiver to maintain synchronization for the duration of that byte (asynchronous operation). Separate lines may be used for throttling purposes ("Hey! My buffer is almost full! Hold off for a while!") in each direction, especially with RS-232 interfaces.
So how are signals converted from analog to digital? Analog to digital converters (ADC) grab a sample of the value of a signal at a specific instant, then measure the signal and convert it to a number. This number represents the value of the signal at that instant, for example, its voltage. By taking enough regularly spaced samples, it is possible to recreate the original signal from the measured values. Nyquist determined that if you take more than two samples per cycle of the highest frequency of interest, the signal can be recreated. He also determined that if any frequencies greater than half the sampling rate are present, alias signals will be created which were not present in the input. These relationships are illustrated by the following diagram. Typical values, as used for audio Cds, are a sampling rate of 44,100 times per second, and a highest frequency of interest of 20,000 Hz. This means that the input signal must contain almost no energy in frequencies above 22,050 Hz (½ the sample rate).
When a signal is sampled, some precision is lost due to the resolution of the measurement. This is called quantization error, as the process of measurement is the process of quantifying the value, and the resolution of the measurement is determined by the number of bits used to represent the measurement. With 8 bits, the measurement can have one of 256 values. With 16 bits, it can have one of 65,536 values (2^16). The greater the resolution, the less the distortion caused by quantization error.
Once the analog signal has been sampled into a sequence of digital values, a great deal of magic can be accomplished. The signal can be transformed, filtered, compressed, have special effects added such as echo, be mixed with other signals, etc., all mathematically! This is frequently done using a special purpose computer called a Digital Signal Processor (DSP). Further, the signal can be transported over great distances, through many stages of processing, recorded and copied, without losing any further quality, due to its binary nature.
Once the data representing the signal reaches the point where it is to be consumed, it is converted back to an analog signal by a Digital to Analog Converter (DAC). This is a device that takes the digital number representing the value of the signal at that instant and converts it back to the corresponding voltage. By continuing to do this at each sample interval, and filtering the output so that it contains no signals at frequencies higher than the highest frequency of interest, the original signal (plus any intentional processing) is recreated.
Here is an interesting sidelight on our analog world: It is much more digital than analog, but the sample resolution is really high. Take the intensity of light. Light consists of photons, each a discrete amount of light. Light is already quantified, in the most literal sense. And there are manmade and natural sensors that can detect individual photons. What about the analog nature of air pressure, the basis of sound? Disregarding that the pressure represents the sum of the energy of individual atoms striking our ear drum, the response of neurons in our perceptual system is quantified. Each nerve impulse generated has the same strength as other impulses, but what matters is the rate and timing of impulses and which neurons are generating the impulses. So we really live in a digital (quantum) world, but with excellent resolution!
Another interesting thing about the real world: The signals to which we are exposed can vary in intensity by a million to one or more, and yet we have to be able to operate in either extreme. Our ears are so sensitive that they can detect a sound that is only moving our eardrums the distance of a hydrogen atom, but we can withstand (for short periods) the sound of a jet engine or rock band. Light can vary from moonlight to sunlight, but we can still see. It doesn't seem like a million to one in terms of brightness or loudness because our ears and eyes have a logarithmic sensitivity – we perceive on a curve. It takes a much larger change in actual brightness to be perceived in the day than in the night. Some sampling circuits, especially those intended for audio, take advantage of this logarithmic sensitivity to establish a nonlinear sampling response, where sample increments near zero represent much smaller changes than sample increments at the signal value extremes. This allows fewer bits to be used with the same apparent quality as perceived by a human, and is our first example of how knowledge of human perception can be used to reduce bit rates and cost when working with digitized signals.
Signals, especially audio and video, typically contain a remarkable amount of redundancy. The sound of a musical instrument producing a note at a constant amplitude and pitch is typically almost the same waveform, produced over and over, thousands of times a second. A lot of redundant data can be eliminated by sending the original waveform once, then only sending the subtle changes. The original waveform can be fully reproduced at the receiving end, an example of lossless compression. (Lossless compression means that the original signal can be perfectly reproduced with the information provided.)
A masking phenomonon occurs with human hearing, where a soft sound at a given frequency will be overwhemed by a louder sound at that frequency, and never heard. There is no need to transmit the subtle sound, which saves bandwidth. This is an example of lossy compression, as we have discarded information, but we have not affected the perceived quality of the reproduced signal. This is a second example of using knowledge of human perception to reduce the bit rate. (Lossy compression means that real information has been discarded, but the reproduced signal is as good as possible with the number of bits used, with the goal being that a human can't tell the difference between the reproduced signal and the original.)
A fax machine compresses data by looking at a page one narrow stripe at a time. Instead of sending "white white white white black black white...", it sends "4,1,2,0,18.1..." where white is 1 and black is 0. The message is interpreted by the receiver as "4 whites, 2 blacks, eighteen whites...". This run length coding saves lots of space in the transmission, and is why a regular document is transmitted much faster than one with an image. Run length coding is lossless – the original image is reproduced exactly, within the limits of the sampling resolution of the fax machine.
Video consists of a sequence of still images, presented so quickly that the brain sees only a moving image (persistence of vision). Video not only contains redundant information within an image, it contains even more redundant information from frame to frame. To compress video, many techniques (tricks) are applied. First, the resolution of the luminance portion of the video may be reduced (MPEG-1 typically reduces the resolution by one quarter (CIF) or one sixteenth (QCIF), relative to full NTSC resolution). Second, the number of color samples is reduced by half ("4:2:0" versus 4:2:2, explained elsewhere), since the eye is less sensitive to this information. So before we have even considered processing, we reduced the information we have to deal with (a lossy process).
Then a succession of techniques are applied. First, the image is divided into 8 by 8 blocks of pixels, forming bands across the screen. Each block is compressed individually. A Discrete Cosine Transform is applied to the matrix of samples representing the block, to change the data from a spatial orientation to a frequency orientation. This does not result in compression, it just organizes the data from a different point of view. By scanning the resulting matrix of values in a diagonal zigzag, though, the most significant values end up at the beginning of the list, and by quantizing the result (essentially like resampling and throwing away the least significant information), the total amount of information is significantly reduced. DCT is reversable and lossless. Quantization is irreversable and very lossy. The good news is that by doing the transform, the discarded information is generally below the threshold of human perception – it wouldn't have been seen at the other end anyway, even if we had spent the bandwidth to send it. The next steps for the data are to use Run Length Coding and Huffman Coding, both lossless compression techniques. This complete image is called an I frame. The video data is then packetized, interleaved with audio data and metadata, and transmitted.
The next frame of video contains almost the same information, and there is no point in retransmitting everything. The first frame is decoded and compared with the next frame. Each group of four blocks (a macroblock) is compared with the region of the picture nearby, to determine if it has just moved. If so, we need not resend it, but just send its "motion vector" (which direction and how far it moved). Once the entire picture has been examined in this way, and recreated to the extent possible, it is compared to the decoded original picture, and the difference image compressed and sent (a predicted, or P frame). The comparison is done to the decoded frame rather than the original data for the first frame, as the decoded frame is all the receiver will have as a reference.
In MPEG-2, it is also possible to compare a frame to one that comes later as well as one that came earlier. This creates a B frame, or bidirectional frame. This requires a new twist on transmitting the data – it must be transmitted out of order. To decode a B frame, the later reference frame must have already been received and decoded. It won't be presented, however, until after all B frames that reference it. The advantage that makes it worthwhile is that a B frame is about 1/3 of the data of a P frame, which is about 1/3 of the data of an I frame. MPEG-2 also knows how to handle the interleaved fields of NTSC video which make up a frame.
At the receiving end, the signal goes through the same steps in reverse. Huffman and run length coding is removed, the matrix recreated with zeros substituted for discarded values, and a reverse DCT is performed. If this is a P or B frame, it is applied to the reference frame(s), and the luminance and chrominance signals are recreated.
It is important to realize that the process which takes the compressed video and audio, packetizes it, identifies it, adds metadata, multiplexes it with other program streams, and generally prepares it for transport is not part of MPEG-2 compression. It is the MPEG-2 transport standard, defined in the MPEG-2 systems document. It can be used as a protocol to deliver MPEG directly to the set top box. However, it can also serve as a protocol to deliver MPEG over IP (Internet Protocol). Several modulator manufacturers have adopted an MPEG Transport over IP, using Gigabit Ethernet connections, as a means of distributing MPEG content within the headend. The traditional means has been to use native MPEG Transport protocol over an interface called DVB-ASI (Digital Video Broadcast Asynchronous Serial Interface), which is far more expensive. The point is that there are two MPEG-2 standards, one for compressing audio and video, and the other for transporting the resulting data. MPEG transmitted using IP protocol, either locally or over the Internet, may be a program stream, or a transport stream.
With conventional analog video, all you have to do to switch between video sources is to synchronize them, then wait for the next vertical sync interval, then switch. Ironically, digital video is a lot fussier in changing from one source to another (a process called splicing). First a word about GOPs (groups of pictures): A GOP starts with a reference I frame, containing a complete image. It is typically followed by a P frame, which references the I frame, or a B frame, which is derived from the preceding and following I or P frames. If the forward reference frame is the I frame of the next GOP (e.g. IBBI...), the current GOP is characterized as open. If the forward reference frame is part of the current GOP (e.g. IBBPI....), then the GOP is closed and may be used as a splice point. The video being switched from is switched at the end of a closed GOP, and the video being switched to must be at the beginning of a GOP. If the video being switched from is at the end of an open GOP, then it will use the I frame of the video being switched to as its forward reference. This can result in considerable embarrassment due to the extended weird glitch created by the decoder referring to something completely alien to the video frame used for encoding. This has lead to the creation of a class of equipment referred to as splicers, which vary in complexity. A simple splicer detects when a valid splice may be accomplished, and does so. A complex splicer manipulates both video streams to create an appropriate splice point on demand. Splicers typically have the ability to do the job for all of the video streams in a multiprogram MPEG multiplex. At the top of the line are splicers that can selectively take video / audio streams from several incoming multiplexes, combine the selected streams into a unique output multiplex, and adjust the bit rate and other parameters to optimize the use of bandwidth in the resulting multiplex, and create perfect splices. Adjusting the bit rate is referred to as transcoding, and it too can be done with varying levels of sophistication.
Once we have a digital signal, and want to transmit it, we have to modulate an RF carrier with our digital information. This is generally done by using a modulator which produces a fixed intermediate frequency output (IF), then follow it with a frequency agile up converter (which takes the lower frequency IF and converts it to any selected channel assignment in the standard cable realm, typically 52 to 850 Mhz. The modulator usually uses a technique called Quadrature Amplitude Modulation, which uses a combination of frequency and phase to create 64 or 256 possible output values per symbol period (one cycle). When QAM modulation is visualized on a scope, it is displayed as a square group of dots (8 x 8 for QAM64, 16 x 16 for QAM256) called a constellation. The dots can be fuzzy (noise and impairments in the system), but as long as they don't overlap, the signal can be decoded.
QAM64 represents a six bit value (2^6=64) with each symbol (also called a baud). Its bit rate is six times its symbol rate. Since a little more than 5 Mhz of the 6 Mhz channel allocation is used (the rest is for guard bands), QAM64 has a symbol rate of 5.06 M Symbols / second, and a bit rate of about 30 Mb/s. The cable standard for QAM256 has a higher symbol rate (narrower guard bands) at 5.36 M Symbols / second, but since it can represent 8 bits per Hertz (2^8=256), its bit rate is 42 Mb/s. About 10% of the bit rate is consumed by overhead, including forward error correction. So: In simplified terms, a symbol is transmitted every cycle, and another term for symbols per second is baud. The number of bits transmitted per cycle is the bit efficiency, and the bit rate = bits / symbol * symbols per second, or bit efficiency * baud. (Baud rate only equals bit rate when the bit efficiency is one bit per Hertz, but Baud is widely misused with telephone modems.)
QAM modulation works well in the downstream direction on a cable plant. The higher the number of possible values, the cleaner the system must be. QAM64 works downstream in almost any system, QAM256 works downstream in clean systems. Upstream is more difficult, and typically, QAM16 is considered a good result in a clean system. QAM signals are a little delicate, because they vary in amplitude and a low amplitude symbol can get lost in the noise. Other modulation schemes, such as Quadrature Phase Shift Keying (QPSK), are much more robust, although their bit efficiency is low (typically netting one bit per Hertz). Each QPSK symbol is full amplitude, giving the greatest noise immunity. For satellite transmission, this means that a transponder can run in saturated mode (most efficient), so QPSK is the most commonly used satellite modulation scheme. QPSK can also be used for the upstream signal in a noisy system. Even more robust is Code Division Multiple Access (CDMA), which transmits each bit in several different parts of the spectrum (chosen by a pseudorandom selection), then receives and sums each of those different transmissions. The random noise tends to cancel when it is added together, and the signal representing that bit "rises" out of the noise with successive summation. Not only can CDMA work in a noisy system, its own signal is generally indistinguishable from noise, unless you have the pseudorandom sequence and know where to look.
So that is about it. Acquire a signal, preprocess it, sample it, manipulate and combine it, squeeze the heck out of the result, store it if necessary, switch it with other signals, transmit it, receive it, fluff it back up, decode and reproduce it. Nothing to it. You may also want to make sure that only the intended recipient can see or hear the result, which means that you will have to encrypt the signal before transmission and decrypt it after reception, but that is another story. Some excellent references for understanding even more detail than we have discussed are:
Home page of ISO / MPEG
A private (MPEG TV) link page to MPEG resources (many links broken)
A favorite introduction