How Video On Demand Systems Work

By Steve Rose (copyright 2002, all rights reserved)
Thursday, October 24, 2002

When a subscriber requests a movie from the VOD system, the ultimate result is that a compressed digital stream of bits representing the movie is transmitted from the video server to the set top box (STB), where it is decoded back to good old analog video and audio, and displayed on the TV directly or by being modulated on an RF channel that links the TV to the set top box.

· First, the movie has to be digitized and compressed, and distributed to the video server.

· Information about the movie has to be added both to the management system (e.g. during what period does our contract with the studio allow us to play the movie), and to the list of movies from which the customer makes a selection (e.g. title, stars, cost, etc.).

· When the customer selects the VOD “channel” on the STB, we have to make sure that the application that displays the VOD information is loaded on the STB.

· When the customer chooses a movie, we have to make sure that there are no account problems that would make it a poor business decision to deliver the movie, and that we have enough bandwidth at this moment to deliver it.

· Then we have to allocate the resources for the delivery, and set up the protection scheme which allows only that subscriber to view the movie.

· Now we can start delivering the movie! However, we have to keep listening to the user in case they want to pause, fast forward, rewind, or stop the movie.

· And if they don’t finish the movie, we need to keep it on a list of movies that that subscriber is entitled to continue to view during the period for which it was purchased.

So that is why system diagrams look so complicated for an apparently simple job. Adding to the apparent complexity is the need for many separate computers. However, this is just because no individual processor is fast enough for all of the jobs we need to accomplish, so the jobs get subdivided and placed on different machines to speed things up. Also, we don’t want to put all of our eggs in one basket, so having more than one machine lets us continue to operate even if a machine fails.

An important note about the word “server”: Server can mean a physical box running one or more processes, or it can refer to a software process itself. Or in the case of the video server, it can refer to a collection of boxes running as one entity! So always feel free to ask for clarification about computer terms. Remember, computer guys are the ones who think that 1K is 1024, and can’t decide if 1 M is 1,000,000 or 1024 x 1024! (Actually, technoids think it is 1024 x 1024, and marketers for hard drives think it is 1,000,000 because it makes the drive seem larger).

Compression

To digitize the movie, it is scanned as a television image. Each line is divided into about 700 pieces, and (in theory) each piece is measured for brightness (luminance) and color (which really involves three measurements, for red, blue, and green). Each piece is called a pixel, which stands for “picture element”. Each measurement is a number, which can be expressed digitally (as a sequence of ones and zeros).

Now we can start playing games, to reduce the amount of information we need to transmit. This is really the first stage of compression. Here are some clues that will help.

Our eyes are least sensitive to blue and red, and most sensitive to green. Our perception of detail is most influenced by green. It turns out we can throw away ¾ of the red information, and ¾ of the blue information, and still reproduce an excellent picture. (The trick is to average between lines, as well as between samples, to reduce it this far.) When you hear the term “4:2:0” or “4:1:1” when applied to video, it means that there are 4 luminance samples per each red and blue sample (it is a little more detailed, but this is the essence).

The brightness of a pixel is determined by the sum of the brightness of the red, blue, and green components. So if we know the luminance of a point, and how much is from red and how much from blue, the amount of green can be derived rather than being transmitted separately

Now we have a sequence of bits that represent the image. There are many regular patterns in the sequence, just as there are regularities in the images themselves. An extreme example is the similarity of successive frames of video, until there is a scene change. The point of compression is to eliminate these patterns, or redundancies. If I change my socks, it is a lot easier to say “same outfit, different socks” than to describe the whole outfit again. The same is true for digital video compression. It is a lot easier to start with the same picture and just send the differences.

To make the process more efficient, the picture is subdivided into blocks, and each block can be compared to a block that came in the preceding picture or in the next picture. The resulting information is how much the block has moved, and what are the differences. But wait – how can it be compared to the next picture, which in standard video hasn’t arrived yet? Easy – we send the pictures out of order, as required for decoding, then reorder them in the STB into the proper presentation order. To identify this reordering, we label the frame with a decoding time stamp and a separate presentation time stamp.

These techniques and a few others were invented by the Moving Picture Experts Group, a collection of crazed technoids from around the world (part of the International Standards Organization, or ISO), assembled by Leonardo Chiariglione from Italy. The standard we use, called MPEG-2, involves the transmission of three types of pictures: I frames, which are complete still images containing no references to other frames (Intra-coded frames); P frames, which use information from the preceding frame (Predicted frames); and B frames, which use information from preceding and following (either or both) frames (Bidirectional frames). Typically, a P frame is about 1/3 of the size of an I frame, and a B frame about 1/3 the size of a P frame. Typically an I frame is sent a few times each second to provide a point of reference. That is why you will sometimes see a digital image “assembling itself” from little blocks, until suddenly the whole image appears. When the whole image appears, an I frame has just been received.

Various techniques are used to compress I frames. One is a “run length” approach. When a whole line is white, it transmits “white, 700 times”, instead of “white, white, white,…white”. The most powerful is called “Discreet Cosine Transform”, or DCT. DCT looks at the video in each block from a different point of view – from a frequency perspective, instead of time – and with a small trick of reordering, this causes the information to be rearranged in a way that reflects our ability to perceive it. As a result, a great deal of the information that would make little or no difference in our perception can be lopped off and discarded. This is lossy compression, in that the decompressed image is not identical to the original, but the additional compression is remarkable. There are other transforms that could have been used, by the way, but DCT was selected because it uses the simplest math, and is easier and faster to implement.

Audio compression uses different techniques, but with similar results. Audio typically takes about 1/10 or less of the bandwidth of the video information, but still requires great sophistication to compress. And without clean audio, video is just about useless (which is why they charge to rent headsets on airplanes).

Finally, when all of the other compression techniques have been applied, a “vocabulary” is selected which provides the smallest representation. In other words, if you were writing a story about footwear, you wouldn’t want to use the term “bent cloth tube closed at one end” over and over, you’d just say “socks”.

The point of compression is to eliminate redundancy in the resulting bit stream. As a result, a perfect output is indiscriminable from random noise (unless you know where to start and what to do to decode it). However, this means that noise in the input signal can’t be compressed (it contains no redundancy), so it wastes a lot of bandwidth. Anything that generates a signal that looks like random noise requires more bandwidth to produce a clean image on the other end. Examples are explosions, panning past the crowd during a sporting event, and pictures of trees with their leaves blowing in the wind.

Satellite Distribution of Content

Once we have the movie digitized and compressed, how do we get it to our video servers, located around the country? There are three choices, each of which is used:

1. Put the results on a tape or disk, replicate it for each of hundreds or thousands of locations, and ship it by UPS / Fedex / etc. This way is expensive, slow, and exposes the content to delays and theft, but it works.

2. Send the content over a terrestrial data network. This is instantaneous, but expensive since each location requires a separate transmission. Even if terrestrial multicast were available, a great deal of redundant bandwidth would be required.

3. Send the content by satellite multicast. For most purposes, this is the smart bet. Although satellite bandwidth can be expensive, everyone can receive a single broadcast simultaneously, so no replication is required. “Multicasting” is implicit in satellite transmission. There are two gotchas: Local weather can interfere with satellite reception, and anyone in the satellite’s “footprint” can receive the signal. The first problem is solved by both sending redundant information (“forward error correction”), and by asking each site if any of the transmission was missed. If so, only that information needs to be retransmitted. The second problem is solved by encrypting the content to prevent its interception.

What is “multicast”? All it means is sending the same information to a selected set of receiving locations at the same time. There are two fundamental types of communication systems: In one, messages are sent from point to point, like a phone call from one person to another; in the other, messages are sent to everyone, like a radio broadcast, and it is up to the receiver to select which messages it listens to, if any. In a point to point, or switched, system, many messages can be transmitted simultaneously, as each is independent and only goes to the intended recipient. It is a parallel system. In a broadcast system, only one message may be sent at a time, but everyone can receive it – it is a serial system. A version of multicasting using a point to point network involves replicating information only at the last possible step, so that bandwidth is saved in earlier steps. This capability is not consistently deployed.

A satellite is a radio broadcaster – everyone in the satellite’s “footprint” (the area served by the satellite) can receive its signal. The only way to control who can use the information that is broadcast is to encrypt it, and only tell those who are entitled how to decrypt it.

This part of the system has the job of distributing produced content to many locations in the same time period. The content being distributed is not being watched in real time, it is being stored for subsequent use (when its license period begins). We can provide adequate protection by encryption. The content is already digitized and compressed, so it makes very efficient use of a data communications pathway. As a result, satellite distribution is ideal for our purposes.

Satellites operate at very high frequencies, where rain can actually interfere with the signal! When you are sending a signal to hundreds or thousands of locations, some will have inclement weather that will interfere with parts of the transmission. There are two techniques that we use to overcome the “rain fade” problem. One is called forward error correction, or FEC, where we send extra information that can be used to overcome many transmission errors, without having to resend any information. FEC uses mathematical “magic” to minimize the amount of extra information that must be sent in the first place. However, sometimes there are errors that are too big for FEC to overcome. In those cases, information must be retransmitted. By definition, different locations miss different parts of the data. Fortunately, the data is divided into packets that can be individually identified by sequence number. A receiving location can determine which packets it has missed, and ask for just those packets to be retransmitted. Any time that more than one location has missed the same packet, it only has to be retransmitted once. Typically, the retransmissions occur after the original transmission is complete, which gives a better chance that any interfering local weather has moved on.

This brings up another reason that our job is so well suited to satellite. Because the content being transmitted is for later use, we can tolerate the delays involved in retransmitting any missed information. This means that we can use smaller antennas and less power and bandwidth than would be required for real time live transmissions (e.g. a 1.5 meter satellite antenna instead of a 10 meter dish). Our transmissions also consume half to a tenth of the bandwidth of analog video satellite transmissions for the same information. However, we are not constrained to transmit at a rate that represents the normal rate for video – we can go faster or slower than real time.

Once the content gets to the headend, it is stored on the video server (more about that device later). There are several other elements to getting a movie to you aside from delivering it to your local cable headend. There has to be accountability for the delivery -- both for the cable company, to be sure that they have delivered what they have promised, to the right person; and for the customer, to be sure that there is a fair bill generated for all services received. Those pieces include:

1. The billing system, which is the ultimate repository for financial transactions.

2. A controller which manages on-demand services, and tells the video server what to do.

3. A controller for the delivery network, that allocates the resources necessary to fulfil a customer request (sets up a pathway from the video server to the customer).

4. A business management system, which is the “dispatcher”, knows how to talk to the other elements of the system, and keeps records of almost everything that transpires. It can ask the billing system if an account has enough credit to play a requested movie, it can ask the network controller if there is enough bandwidth to deliver it (and reqest that the path be set up), and it can tell the video server to play the title. It also knows whether a movie is in its "license period" where the cable operator is allowed to play it, and removes it from the menu if it is no longer available.

5. A network management system, which "oversees" the operation of each of the elements of the system, ranging from programs to computers to modulators to amplifiers, and lets the system operator know if any piece is misbehaving.

Because some of these systems take time to respond, there is one important underlying philosophy: Assume the customer is right. It sometimes takes a while to verify the status of an account, especially if the billing system must be queried. The assumption is made that everything is OK, and services are launched. If it turns out that the customer is not entitled to the service (overdue bill, past credit limit, etc.), the service is stopped at the time that the determination is made. The period of service already received is considered a preview, and no bill results. [1]

Once we have all of the “paperwork” taken care of, it is up to the video server to deliver the movie.

The job of the video server is to deliver isochronous (one second of content per second of real time) compressed digital video to the correct port (the one connected to the downstream modulator which can reach the customer who has made the request), and to address that video data stream so that the customer's STB can recognize and retrieve it. It also communicates with the Conditional Access system to set up encryption for that stream, so that only that customer can receive and decode it. Even after it begins to deliver the stream, it continues to listen for commands from the subscriber for pause, fast forward, rewind, and stop. It is the server that makes Video On Demand just like playing back a VCR.

The server is remarkable in the rate at which it can deliver data. When you consider that a small to medium cable system needs a server that can deliver about 10,000 streams for 50,000 to 100,000 subscribers, and that each stream is about 4 million bits per second, the total bandwidth of the server has to be about 40 gigabits per second!

Finally, we come to the Set Top Box (STB) itself. This is a conflicted box. It has to be able to decode legacy analog broadcast television channels as well as digital cable television signals. Then it has to retransmit them to the TV looking just like a regular broadcast station, usually on channel 3 or 4. (The raw video and audio are also made available for sets that can accept direct signals). So far, we have most of the signal processing guts of a TV set, a miniature TV station, and MPEG-2 demultiplexing and decoding hardware. To that we add a graphics generation and display system (to generate the program guide screens, and otheruser interface screens), and a small general purpose computer. The computer has to have the ability to store a program semi-permanently, in a way that will be immune to power failures but can be updated automatically as required (flash memory). The STB also has sophisticated and secure high speed decryption hardware, including a “smart card” which can be inserted by the user and which may contain encrypted “secrets” to be used to allow access only to authorized customers. And, by the way, the STB has to be able to receive a downstream data channel (in addition to decoding data from a selected channel), as well as generating an upstream channel RF data signal strong enough to get back to the head end or hub where the customer data receivers are located (talk about a salmon swimming upstream!) And it has to do all of this for the lowest possible price, since the cable operator has todeploy so many! It is really one of the most remarkable pieces of the whole system.

As if the hardware of the STB weren’t enough, it also has to have sophisticated software which operates all of the components of the box in real time, generates displays, maintains up to date schedule information and operates the interactive program guide, handles emergency notifications, and runs the applications that enable new services to be offered to subscribers (even to play games on the STB!). Since the STB memory is limited, some applications have to be loaded on request, either over the high speed direct data path, or over a separate channel that repeatedly sends a list of programs for the STB. This is referred to as the data carousel, because the same information (updated occasionally) is repeated over and over on this channel. This allows the STB to download the information by just waiting for it to come around, rather than having to request it, and originated when there was no upstream data channel in STBs.

There are several layers of software in the STB. Underlying it all is the operating system, which has to manage a very sophisticated box with lots going on with very little processing horsepower. Applications for the STB have to run in the limited processing and memory resources left, and require great skill to write. Program authors not only have to be aware of STB limitations, but also have to be familiar with the whole digital broadband delivery system.

So that is about it in a nutshell (although it turned into a Brazil Nut instead of a Filbert). The system is complex overall, but approachable when you understand each piece. And things that seem complex frequently consist of a bunch of simple components. While the modern VOD capable cable system is sophisticated, it is also approachable. Have fun using it!

[1] There are already classic examples of people trying to abuse this philosopy! For example, one subscriber in San Diego (apparently out of credit), would keep requesting movies, then fast forwarding to the last place that he had seen in the movie and resume viewing from there until the system would determine his status and shut him off. Because he did this thousands (!) of times over a two week period, systems and operators are attuned to watch for abusive behavior and disconnect those folks rather than having to change their philosophy that almost all customers are honest.