Original Time Warner Server Documents, CRC, Capacity Plus
By Steve Rose
Monday, November 04, 2002
This page contains the components of the document that is referred to in US patent number 6,449,730, which was used as prior art in the initial lawsuit between nCUBE and Seachange.
6449730.pdf, page 3).
The first three papers were prepared for a Time Warner Cable VOD investigation that started in 1991. I was given Queens, NY, as the model for VOD deployment (1,000,000 subscribers, 10% peak usage, 3 Mb/s per stream), although that was scaled to 10,000 peak users for the first paper.
It also contains a copy of the original Capacity Plus Computer Systems brochure featuring RamDisk and self repair. Capacity Plus computers were sold in Hawaii from 1979 to 1984, but we didn't get our brochure together until 1981. The reason that TWC chose us for the initial study was because they were aware that we had invented RamDisk, and they had a ram based server proposal on the table from IBM and needed a sanity check on the numbers (as Jim Chiddix mentions in his letter, also included).
The EETimes article was a while coming, and is not 100%, but it was great and put me in touch with a lot of folks. What we had actually talked about is the economy of adding storage to a switch, versus using a switch to interconnect servers, since so little processing is required in a video server.
Finally there are the CRC Electronics brochures which were also part of the original document. A copy of the paper that Joan Van Tassel and I wrote for New Telecom Quarterly (published in 1996) can be found at www.viaduct.tv. The technical parts of the document are from a report prepared for CableLabs (a follow up to their first Media Server RFI) and delivered at a seminar in 1994.
(The formatting on the first paper is a little strange due to the OCR process.)
LARGE-SCALE VIDEO ON DEMAND
AN INVESTIGATION OF STORAGE, RETRIEVAL, AND SWITCHING
Video On Demand Playback Machine Investigation
for ATC by Steve Rose, Viaduct Corp.
The goal of this investigation was to determine whether any devices exist which could be used for Video On Demand (VOD) playback. The criteria included storage for about 400 movies; the ability to serve 100,000 subscribers of whom 10,000 may be active simultaneously, each able to start viewing any movie at any time; and the ability to combine the output signals to accommodate 200 to 1000 subscribers per neighborhood on an FTF distribution system. (FTF = HFC)
To determine the amount of memory required to store 400 or so movies, a movie length of two hours (7200 seconds) was multiplied by the data rate required by compressed high quality audio and video (3 megabits per second), and bits converted to bytes, with a result of 2.7 gigabytes. To store about 400 movies (given varying movie lengths) will require about 1000 gigabytes (= one terabyte).
To accommodate 100,000 subscribers in groups of 200 to 1000, 100 to 500 high speed ports are needed, each able to support up to a one gigabit per second output data steam. (It is estimated that about 10% of the subscribers to whom VOD is available will use it simultaneously. In a group of 1000 subscribers, the estimated demand will be for 100 simultaneous movies, each at a data rate of 3,000,000 bits per second, for a total of 300 megabits per second. A one gigabit per second data rate allows peak demand to be three times greater than anticipated demand.)
The VOD playback machine has four functions: Storage of the movies; simultaneous retrieval of the movies at up to 10,000 points in time; switching the correct movie to the correct subscriber group; and combining (multiplexing) the signals for all subscribers in each group. This all must be accomplished without interruption in the viewer’s program material.
Because of timing demands versus the speed of reasonably affordable memory, the storage must be spread over many elements (typically 8192 elements of 128 megabytes apiece). Because of the high speed involved, each output port will require its own processor. And the most complex part of the machine will be the switching network connecting the storage to the output.
A “massively parallel supercomputer” faces demands which are so similar that it is an ideal platform for the VOD machine. It has thousands of processor modules, each with its own memory, which must share information on a fundamentally random interchange basis with the other processor modules. Each of the jobs in the VOD machine can be accomplished by a group of processor modules and appropriate software.
Potential vendors contacted included Teradata, MasPar, nCUBE, Intel, and Thinking Machines, all leaders in the field who have delivered working equipment. They are unaware of who the customer is, or the exact nature of the project (an analogous project was used to convey our needs). Intel, Thinking Machines, and nCUBE have equipment being delivered now or available within three months which they feel will come close to the requirements outlined above. Maximum memory sizes range from .1 to .5 terabyte, but each manufacturer felt they could find a way to mix electronic memory with magnetic or optical storage in a manner invisible to the subscriber to meet our one terabyte requirements. nCUBE also has some constraints on the number of output ports.
All three are enthusiastically willing to be involved, including making changes to the next generation of their equipment to better support our needs (about a two year timeframe). Further testing would be required by each manufacturer to determine the extent to which their current devices meet our criteria. Prices discussed range from ten to thirty million dollars, but could be considerably lower if they are sucessful in substituting magnetic or optical storage for electronic memory.
The entire market for massively parallel supercomputers in 1991 was less than 300 million dollars. The manufacturers are primarily aimed at the grand challenge science and transaction processing markets, and are just starting to look for commercial applications. The future VOD market could expand their total existing market by a factor of two or more. This should make it easy for manufacturers to justify changing designs to more closely accomodate VOD.
Video On Demand Investigation
for ATC by Steve Rose, Viaduct Corp.
1) To discover whether technology exists or will soon exist to construct a video library and switching system using compressed video which when combined with Fiber To Feeder (FTF) cable architecture could provide true Video On Demand (VOD).
2) Assuming the answer is affirmative, to determine the most promising available technologies.
3) To determine conceptually how such approaches to VOD might work.
4) To make broad cost estimates for the above.
1) A system with 100,000 subscribers equipped to receive this service, with up to 10,000 simultaneous accesses.
2) Each fiber trunk feeding a neighborhood of 200 to 1000 subscribers, with a downstream digital carrying capacity of 1 gigabit per second.
3) High quality compression of video and audio into a 3 megabit per second data stream per program.
4) An on-line library of about 400 programs (primarily movies).
5) A separate billing computer which receives the program requests, tracks the billing, and sends a message to the VOD machine to start a viewing process for a given subscriber.
6) A maximum delay from subscriber request to beginning of delivery of program of two seconds.
Components of the Video On Demand machine:
1) A storage system for the video and audio information to be distributed.
2) A retrieval system for the stored information which guarantees access at the minimum interval specified for the total number of clients to be served.
3) A multiplexer for each cluster of subscribers (e.g. all subscribers on one fiber) which combines the retrieved information for each active subscriber into one data steam.
4) A switching system which routes the retrieved information to the multiplexer.
The Storage System
The storage system must be large enough for the programming to be distributed on an immediate access basis. Each of the 400 or so programs will be 1.5 to 3 gigabytes long, based on 60 to 120 minutes each at 3 megabits per second, for a total of about one terabyte of information storage. For efficient transmission bandwidth utilization, the data should be compressed to the maximum extent possible, which implies a variable data rate. Therefore, the data will need to be stored with some sort of time tags as the system will not otherwise know at what rate the information needs to be transmitted. A one second granularity would probably be adequate, stored in a length:data format. It isn’t necessary to know which second of data is being handled, only how much data constitutes the next second, so the overhead for time tagging should be less than .01%.
The Retrieval System
The retrieval system places very heavy demands on storage. Assuming 10,000 active subscribers and an average data rate of .5 megabytes per second, a byte of information will have to be recovered every 200 picoseconds (5 gigabytes per second). The use of cost effective memory implies a memory cycle time of about 100 nanoseconds. Random access memory for a single processor can recover information in any order, but only one word at a time (a word typically consists of one to eight bytes of data). Even with 64 bit wide memory, the average rate per byte is 12 nanoseconds, or about two orders of magnitude too slow. This means that memory must be segmented so that thousands of bytes may be recovered simultaneously. Each segment will require a controller of some type.
In terms of currently available storage technology, one terabyte would require nine million one megabit chips, or 2.25 million four megabit chips. Memory chips are commonly packaged as four megabyte blocks (nine bits per byte to provide parity checking) at a cost in small quantities of about $150 each for a 60 nanosecond access specification. About 256,000 of these blocks would be needed at a cost of around 37 million dollars not including quantity purchase discounts. This price includes the memory only, not the supporting circuitry. Within two years, these prices should fall by about a factor of four.
Assuming neighborhoods of 200 to 1000 subscribers, 100 to 500 output multiplexers will also be required. Each multiplexer will combine the data for all active subscribers in its group in a statistically time domain multiplexed fashion. Since the data rate to each subscriber is variable, it makes more sense to use only as much time as required to transmit the necessary data rather than using a fixed time slot for each subscriber. The multiplexer’s job is to combine the output data in a manner which maximizes bandwidth utilization, and to guarantee that each subscriber receives data in a manner which prevents program interruptions from the subscriber’s viewpoint. During those moments that realtime bandwidth demand is low, the multiplexer works to fill each subscriber’s buffer in anticipation of peak demand periods. Each multiplexer will require a controller to keep track of the status of each subscriber.
The big job falls to the switcher which routes up to 10,000 individual program processes (each program at each point in time in use) to 10,000 active subscribers, many of whom may be sharing a given program process. The aggregate data rate of this switch is the same as the data retrieval rate, 5 gigabytes per second. And the switch must keep track of the program processes, since each program will span ten or more storage processors. Since several multiplexers could be receiving the same program process data stream, there should be a broadcast (multicast) mode to the switch. This would allow several multiplexers to be connected in parallel to one program process, and the data transmitted to all of them simultaneously.
Outline of Initial Considerations
I. Overview of challenge
A. Major increase in deliverable bandwidth results in new service opportunities.
B. Movies on demand seem to be a natural extension of pay per view plus the movie rental business.
C. Movies on demand may be implemented with several technologies. To determine the most appropriate, criteria of service must be developed.
1. Maximum start delay: This is the period between a customer requesting a service, and the service beginning. Two seconds has been chosen to provide an interval which will be perceived as an immediate response by the subscriber.
2. Maximum period of service interruption: This is the longest tolerable period of interruption resulting from the playback technology. No perceptible interruption is acceptable.
D. Other factors affecting choice of technology depend on system criteria.
1. How many movies need to be on line with < 2 second start delay? Four hundred movies (or other program material) occupying about one terabyte of storage is the number used for this investigation.
2. What is the anticipated service life of the equipment? This affects the degree of flexibility expected from the equipment, for example, if it can be upgraded to HDTV as demand changes. A four year minimum service life has been used.
3. What is the minimum criterion for video quality? Standard VHS is a minimum criterion, since that level has proven acceptable in the video rental market.
II. Summary of approaches
A. Multiple video tape machines
1. Each machine can serve only one set of viewers (a set is a group viewing a program at the same point in the timeline of the program), which may consist of a single viewer. There have to be as many machines as there are simultaneous viewer sets (each separated in time by two seconds or more or by different program material), and as many copies of each tape as there are possible simultaneous viewer sets for that program. The ongination facility would have to be staffed or a complex mechanical juke box system devised. True random access to material would not be possible.
B. Multiple optical disk machines, each with multiple heads.
1. The constraints for optical disk would be similar to videotape, except that the number of machines required would be reduced in proportion to the number of heads per machine. Optical disk would also take advantage of precompressed data rather than video and audio having to be compressed in real time.
2. A good idea for the use of optical disk is having preinterleaved recordings on a nine or ten minute interval on the disk. However, there is a broad gap between providing programming with a two second delay versus a ten minute delay.
3. Another concept to make optical disk technology workable is the use of an optical drum with multiple playback heads. However, if the heads are not totally independent, it will be necessary to use a constant data rate encoding scheme, which discards the bandwidth conservation advantages of variable data rates. And again, the interval would be minutes instead of seconds.
C. Digital playback
I. Direct transmission of compressed digital signal with decompression at customer’s end, resulting in conservation of bandwidth throughout the cable system.
2. Easier to achieve multiple access to a single program, due to use of electronic memory for storage instead of mechanical optical disk or video tape storage.
3. True random access to programming is inherent in this storage technique. (It should be possible to bounce from place to place within the timeline of a program with a minimum of delay -- instant fast forward and rewind, or random access for instructional purposes.) The ability to take advantage of random access depends on having a short interval from a user request to the beginning of the requested service.
Ill. Approaches to Digital Playback
A. RAM Disk approach: Programs are downloaded to electronic memory (roughly a terabyte for 400 programs), and all playback is from RAM.
B. Virtualized memory approach: Programs come from a RAM buffer, which is constantly being reloaded from longer term mechanical storage (magnetic and/or optical disk).
IV. Interim conclusions
A. Compressed digital storage with decompression at the subscriber end seems
to make the most sense.
B. The problem is most straightforward when approached as a RAM disk, since virtualization (mechanical mass storage substituted invisibly for RAM) can be added later.
C. Two approaches to constructing such a RAM disk are the design of a large hardware memory, with an N x N switch connecting the user ports to the memory; and a multiprocessor computer, with the internal switching network of the computer used to route data from the terabyte memory to the I/O processor connected to the user port.
Brute force hardware reality check:
Assume a RAM only system. Although this is the most expensive approach, it avoids the confusion of a virtualized system (one where most of the storage is on disk) for initial considerations. Nothing in this assumption will prevent later virtualization of the design, if it is practical to do so.
Assume the Lowest Common Denominator approach to the selection of components. Since we will be using 70,000 to 250,000 memory devices (up to 2.25 million chips) to reach one terabyte, use the fastest which is commonly available in 1992, rather than the absolute fastest (an order of magnitude more expensive).
Third, look closely for the bottlenecks. Cacheing and buffering are frequently offered as solutions to slow memory access. But the main memory must be able to keep up with the data demand rate placed on it for hours at a time under all circumstances; and cacheing and buffering are generally short term fixes.
A straightforward approach to the design of the system would be each memory device connected to an N x N routing switcher, with the other axis of the switcher connected to a user port. The user ports would be multiplexed into neighborhood groups. If we assume 4MB memory devices (22 bits for address), multiplexing of data and address, 10,000 users, 250,000 memory devices, and 100 X 100 integrated crosspoint switches, then the total number of switches needed in our system is 22 X 200 X 2500. With this sort of design, we would need 44 times more switching IC’s than memory devices, or 11 million switches. This is before we consider control signals, or the resolution of contention for individual memory devices, or multiplexing the output The problem in building a dedicated hardware VOD device is primarily switching rather than storage.
Centralized vs. Distributed VOD
A distributed approach to VOD has also been mentioned, where program resources might be distributed throughout a cable system. But since one program might have many viewers at different points in time within the program, considerable cable bandwidth would have to be used to redistribute the program. Centralized VOD makes the best use of resources.
The VOD machine could be built with around 8000 storage controllers with 128MB of RAM each, 100 output multiplexers, and enough switching controllers to handle up to 10,000 simultaneous processes. It might be possible to build special purpose devices for each of these jobs. However, it seems to make more sense to use a general purpose device, tailor its configuration, and code it for the job at hand. “Massively parallel supercomputers” fit this description. These devices consist of up to several thousand processors, each with up to 16 to 128 megabytes of memory, embedded in a very high speed switching network capable of many simultaneous paths. Each available device is subject to its own limitations outlined later. The three manufacturers whose hardware seems to be within the realm of meeting the needs of a VOD machine, and who are excited about the project even with the little they have been told, are Thinking Machines Corp., nCUBE, and Intel.
Video On Demand Attributes
The VOD application on a parallel computer system has the following attributes:
VOD requires very little processing power:
All of the VOD processes work with integers (addresses and byte counts). There is no floating point processing. Everything revolves round transferring information from place to place, timing, and setting up transfers.
VOD requires lots of communication bandwidth:
At every point in the system, the ability to move massive amounts of information in short periods of time is critical. The information is not processed or manipulated, just moved.
VOD requires massive storage:
With only four hundred programs on line, each stored with maximum compression, a terabyte of storage is still needed. And this terabyte will probably have to be RAM. Virtualization (primary storage on magnetic or optical media configured to look like RAM) is discussed below. But with the assumed maximum interval between subscriber requests, and variable data rates, it is unlikely that magnetic or optical storage will be practical. (oops – and this paper describes striping, but just for RAM! ’96 NTQ article is explicit, as was ’94 CableLabs presentation. sr)
VOD is an inherently load balanced application:
When examining the suitability of an application for a parallel processor, a primary issue is load balancing, or the distribution of the processing across the resources of the entire machine. VOD is ideal in this respect. The storage of a program is distributed over multiple processors (referred to as striping), as is the job of multiplexing the output and controlling the routing of data.
VOD needs guaranteed access rates:
It must be guaranteed that a subscriber’s video will not be interrupted once begun, or the ability to charge the subscriber for that event will be lost. This boils down to requiring guaranteed access rates within the switching network of the parallel computer.
VOD has no file access conflicts:
VOD files are not updated with information from subscribers. Program data is essentially read only, so many of the data sharing problems faced in a typical multiuser situation disappear.
VOD updates programs offline:
The only time program memory is written is when the target storage processors are not in use, again avoiding multiple access problems.
VOD processes are extremely predictable once initiated:
Under normal circumstances, once a subscriber begins viewing a movie, nothing will interrupt the process until the movie is over. This means that a subscriber process and its corresponding program process keep going for 60 minutes or longer once begun, which may simplify some aspects of their coding.
VOD Applications with Parallel Processing
The use of a general purpose parallel computer allows much more flexibility than mechanical or other fixed architecture approaches to the same problem. One important reason is the speed with which new processes may be set up, allowing a sequence of events to be perceived by the viewer as one session. At a minimum, this would correspond to VCR like controls for the user. However, it could also be used for:
A branching hierarchy of instruction dependent on user feedback can be used to allow a student to .advance as quickly as possible.
It would be possible to set up special event programming which would play back all of the romantic scenes from Lloyd Bridges movies, or all the classical science fiction scenes involving robots before 1960, or any other salable collage.
Virtual reality entertainment (nonlinear participation):
In conjunction with more sophisticated decoders, it would be possible for a VOD to act as a repository of information used in virtual reality scenarios -- even to involve the interaction of multiple viewers.
Theater Minus One (linear participation):
In the same context as Music Minus One, and taking advantage of virtual reality concepts to project the viewer into the scene, this would involve productions where the viewer could take on a fixed roll in a theatrical production.
On line catalog shopping:
In the style of home shopping channels, except the content would be under the control of the shopper. Flexible VOD is perfect for this application.
Barker capability to preview and sell PPV programs:
Any excess capacity of the VOD system could be used to sell VOD services to subscribers not currently active. When the system is fully loaded with active subscribers, barker activity would not be available -- which is fine, as there would be no further system capacity to sell.
Real time PPV broadcasting:
An ideal VOD compression strategy should be front loaded if possible. The majority of the processing should be done during compression, allowing simpler real time decompressors and better compression ratios. However, for live events, this poses a problem. Parallel processors are generally well suited for compression, and excess capacity of the VOD computer could be used to provide real time compression when required.
Compression and encryption of new programs:
By the same token, compression and encryption of new programming may be effectively handled by the VOD computer.
Available computer time:
Inherent in the design of most parallel processors is the ability to dynamically partition resources. During periods of low VOD demand, a significant portion of the computer’s power could be used for other purposes.
The program process controls the flow of data from the group of storage processors containing the selected program to the multiplexers connected to the active subscribers for this process. It needs to know which storage processors Contain the program; which multiplexer processors contain the subscribers; the subscriber number within each receiving multiplexer; the number of bytes representing the next second of program material; and what the current real time is. Its job is to ensure that every second, a second’s worth of program material is delivered to the subscriber. Once that has been accomplished, it can check the multiplexers to see if they are ready for more material, and work ahead. The goal is always to keep the pipeline full, where the pipeline is represented by a buffer in the multiplexer, and a buffer in the subscriber’s decoder.
The storage process in the read mode responds as quickly as possible to demands placed by all program processes currently reading from its memory. If program processes are reading at an average rate of .5 megabytes per second, and there is a two second window for subscribers for each program process, there will be as many as one program process per megabyte of memory of the storage processor. If the processor holds 128 megabytes of memory, it must be able to service 128 processes with one byte every two microseconds on the average. Since a second’s worth of data would be transferred as a block, though, each process would control the processor for an average of 1/128th of a second.
The subscriber process runs on the multiplexer. It needs to keep track of the subscriber’s address, how many seconds of material are stored in the subscriber’s decoder (how much leeway there is to accommodate periods of high bandwidth or throughput demand by postponing service to this subscriber), where in memory the next chunk of the subscriber’s data stream is stored, and how many bytes of data are in the next second of programming.
The multiplexer process is in charge of giving each of the subscriber processes a turn at accessing the output port. During its turn, the subscriber process sends the subscriber address, the byte count to be transmitted (the next second of program), and the data. It increments the count of seconds of information stored in the subscriber’s decoder, and decrements the same count if a second of real time has elapsed since the last transmission. When there is extra time (all subscribers have received the next second of programming so that there will be no interruption of service), the multiplexer process determines which subscriber has the least cushion of programming in their decoder buffer and whether the number of bytes to be transmitted for the next second of program can be transmitted within the time remaining. It gives as many subscriber processes a turn as can meet these criteria.
Hopefully, the code for the storage processor would be small enough to run from the processor’s instruction cache, and the transfer of data could be handled via DMA from the main memory by a communications controller. This would be desirable of all other processes as well, which implies they should be written in assembly language.
Currently available standard high speed I/O channels run at up to 100 megabytes per second using HIPPI technology. (HJPPI stands for High Performance Parallel Interface, ANSI draft standard X3T9.3.) This is close to the target of 1 gigabit per second, although it is also possible to frequency multiplex two or more output channels (each statistically time domain multiplexed) for each neighborhood fiber.
It would be highly desirable in terms of cost to use optical or magnetic storage instead of RAM for playback. However, with the criterion of being able to provide a playback point every two seconds of each program, mechanical storage is not able to keep up. Even with drastically relaxed VOD requirements, mechanical storage would require specially built devices, and would prevent a wide range of future applications. Because of the high predictability of each subscriber process once initiated, it may be possible to develop a practical means of virtualizing some of the RAM storage (invisibly substituting less expensive optical or magnetic storage) while maintaining our two second criterion.
Thinking Machine Corporation’s newest Connection Machine, the CM-5, supports up to sixteen thousand process addresses. A processor consumes one, a HIPPI output port consumes six. Each processor can support 8 to 32 megabytes of memory. The data network guarantees a minimum bandwidth per processor of 5 megabytes per second for transmissions between any two processors, no matter what their distance. (Processors in closer proximity can communicate at up to 20 megabytes per second).
The five megabyte per second rate implies no more than ten simultaneous users per storage processor (at .5 mb/sec each). This means no more than 10 MB per processor at a 2 second interval, or a six second interval for 32 MB storage processors. With six hundred addresses consumed by 100 HIPPI outputs, fifteen thousand processors could be used. The 100 HIPPI outputs could serve up to 100,000 simultaneous users. If twelve thousand were used for storage processors with a six second interval and 32 MB each, the machine would have a storage capacity of 384 GB, or about 110 to 150 programs. This would leave three thousand processors to manage the data transfer.
The price of this machine if delivered as soon as possible would be “about the same as a B2”. This is the first MIMD machine for TMC. Their previous machines have been SIMD. This implies a lot of newly developed operating system software.
The Intel Paragon supports up to 1/8 terabyte of RAM in its current proposed configuration (Beta deliveries will begin shortly). Its active 2D mesh data network is viewed by many as the most effective data transfer configuration, with data rates of 200 MB per second and clean wiring layout The Paragon also uses HIPPI output ports.
The Paragon is the most likely machine to achieve the data rates between storage and output which will be required for the largest VOD systems. Its operation is readily understandable, and straightforward plans exist for future speedups. It is possible to plug in proprietary processors into the 2D mesh, although typically one might add proprietary processors to the Paragon processor board.
The nCUBE 2 supports up to 8192 processing elements and up to 1/2 terabyte of RAM. However, the maximums appear to be mutually exclusive, with the greatest memory for 8192 processors being 4MB each or 32 GB total (about 10 to 20 programs). Although each data path from the processor supports a rate of only 2.2 MB / second, there are 14 paths from each processor. In optimal conditions, 8 paths can be active at their maximum rate simultaneously. Its proprietary I/O system supports an aggregate bandwidth of about 10 gigabytes per second, or about 20,000 simultaneous output channels at .5 MB per second each.
The nCUBE hardware is unique in several respects: First, it has been around for a while, and is deliverable immediately. Second, there is a very high level of integration in the nCUBE approach which means that each processor module consists of just a processor and memory on a board that is about 2 x 4 inches, which plugs into a master board which holds up to 64 processor boards. Their costs in putting together a machine of a given size should be significantly lower than their competition. Third, instead of one high speed channel from each processor, the nCUBE approach uses multiple lower speed channels.
Considerations in using parallel processors for VOD
The big problem in parallel processing is maintaining aggregate bandwidth between all processors in the light of communication path conflicts. Different data network architectures have evolved to solve the problem. nCUBE uses a “hypercube” architecture, which is an n-dimensional interconnect scheme which is supposed to maintain guaranteed throughput regardless of how many processors are added. Intel pioneered the hypercube architecture. They have switched from hypercube to an intelligent two dimensional mesh (xy grid) architecture, with a special purpose processor at each intersection which acts as an “overpass” for data (data can go straight through, make a right turn, or go to the processor attached to the intersection, and two bidirectional data streams can pass through an intersection simultaneously without interfering with one another, each running at 200 megabytes per second bidirectionally). Intel is convinced that this approach offers at least equal performance to a hypercube architecture up to 8000 processors. TMC uses a “fat tree” interconnection architecture in the CM5, which is their fifth machine (CM 1, CM2, CM2G, CM200, CM5). The CM5 has three independent interconnection schemes: The data network, the control network, and the diagnostic network. The control network handles messages affecting the entire machine. The diagnostic network is used to verify and maintain the integrity of the machine.
In general, the guarantees of whether a parallel processor approach will work must come from each manufacturer, as their switching systems differ and the calculations to determine the suitability of their system to the task at hand are proprietary.
From discussions with salesmen it appears that Intel has the best internal bandwidth, that nCUBE has the best cost per processor, and that TMC has the biggest currently deliverable machine. All manufacturers promise bigger memories with the availability of 16Mb DRAM chips. Bigger memories per processor only help us when the memory bandwidth is there to support the necessary additional accesses. The Intel design has the most room for expansion, as their memory bandwidth is (said to be) 400MB per second.
The design of the decoder box to be located at the subscriber’s end is the most important issue facing us for this project because it will consume a great deal of the budget for the entire project. The methods of multiplexing, modulation and encryption will be fixed by considerations for the decoder box.
The use of memory to smooth out an irregular flow of information, where the ability to supply information is asyncronous with the demand for the information. The buffer acts as a reservoir. In the long run, the input data rate must be equal to the output data rate.
Cacheing: The use of fast memory to hold frequently accessed information, with the cache being reloaded from the larger, slower memory only when it does not hold the requested information. When this occurs, the oldest unused information in the cache is replaced with the newly requested information, since it is more likely to be reused. A cache can make a memory system appear much faster than the response time of the large, slow main memory behind the cache, but only if reloads occur infrequently (reloads are slower than a direct access to the main memory).
MIMD: Multiple instruction, multiple data: each processor in a multiprocessor system is able to do its own thing: Independent instructions for each processor, and an independent data steam for each.
SIMD: Single instruction, multiple data: each processor is running the same instruction at the same time, but on different data. Independent branching based on the contents of one data stream is not possible.
 (7200 seconds * 3,000,000 bits per second) / 8 bits per byte = 2,700,000,000 bytes per movie
Video on Demand Overview
Many approaches have been discussed to Video on Demand (VOD). In one, the video is transmitted as quickly as possible to a storage unit in the home, then decompressed from local storage with full VCR type controls (it was this approach for which a patent was recently granted). Another has video played centrally from a bank of tape or optical players (Queens, other field trials) and transmitted as conventional analog video. A third approach is to store compressed video as digital information on banks of disk drives (generally magnetic, in RAIDs [Redundant Arrays of Inexpensive Disks]). In different scenarios of this approach, the signal is either converted back to video at the head end and transmitted conventionally, or transmitted as digitally encoded compressed video to a decoder in the home which reproduces the video signal in real time rather than storing it. Finally, there is the approach of storing all programming in the electronic memory (RAM) of a massively parallel supercomputer (MPS), and using the computer to provide not just the playback, but the switching and multiplexing of signals for each cable neighborhood of 300 to 600 subscribers. Although this approach could be used with headend decoding to video and conventional distribution, it is really aimed at digital distribution with settop real time decoders. Playback from analog media is extremely limiting, and really suitable only for very small systems (such as an in-house hotel system) or field marketing trials. Playback from RAID arrays has several drawbacks as yet unrecognized by those proposing it as an architecture. Making the most optimistic assumptions about hard disk capability (10 GB hard disk capacity, 10 ms access time, 10 Megabytes/Second full time data output rate, zero overhead for seeking to a new part of the disk), and assuming a 3 megabit per second data rate per user, it is clear that each hard disk can support about 25 users (10 megabytes per second divided by 3 megabits per second). We have assumed that we would like to have about a terabyte of memory on line to support 300 to 500 movies, or a total of about 100 drives plus 25 for the redundancy. This implies 2500 maximum users per system if everything is operating at statistical perfection, and the drives are all reading linearly (no seeking to different cylinders until the current cylinder is finished). When realistic constraints are included (30% seek overhead, 4:1 reduction for statistical anomalies), the number of users supported on a one terabyte system sinks to 425. For a 100,000 subscriber system with an anticipated demand of 10% (our design criteria), this would imply having about 24 such physically large systems, each containing redundant information. With 425 users per system, the data transfer rate from storage to distribution would be about a gigabit per second for each system. And each system would be connected to about eight neighborhoods, leaving a big switching problem to be tackled. With this approach, programming would be stored redundantly on each system. This presents two problems: Each system would have to be updated individually with the same information; and the RAM cache buffers for each system weuld generally be holding almost the same information at any time, given the popularity curve for any set of programming. It is probable that the resources required for this type of system for our design target would be greater than the terabyte RAM storage approach. However, the RAID approach shines in a small to medium market, or for limited testing of concept in a large market. The massively parallel supercomputer approach still seems the most logical for a 100,000 subscriber system. The programming is stored in a non-redundant fashion in RAM. Although RAM is much more expensive than rotating media, it can support a much larger number of users because of its extremely fast access (<100 nanoseconds versus 10—20 milliseconds, or a factor of >10,000). And when cacheing is not required because the programming is already stored in RAM, redundant caching in multiple systems is eliminated. Further, this approach inherently solves the switching side of the distribution problem, which can rapidly exceed the storage problem in complexity. And this approach preserves the greatest amount of cable bandwidth due to its multiplexed digital compressed video output. In our discussions, the downloading approach to settop storage has been dismissed as requiring too large an investment at the subscriber end, and as being clumsy overall. However, as storage becomes orders of magnitude less expensive (with, for example, three dimensional solid state optical storage), this approach will become more practical. The same storage economies that benefit local storage will benefit central storage, however. The best system will combine technologies on a statistical basis: The most popular programming should be stored in RAM, with less popular programming (with only a few simultaneous viewers) kept in RAIDs attached to the back end of the MPS system. In large subscriber communities, most programming would be kept in RAM (intentionally stored, not cached). In smaller subscriber communities, most programming would be kept in RAID arrays. The size of the MPS would largely be determined by the number of neighborhoods, or fiber nodes, and the resulting switching complexity. The MPS approach works on the largest systems, and scales down as required, where the other approaches work on small systems, but have difficulty scaling up. Multimedia considerations: The user interface for the VOD/Multimedia system is only of concern when a program has not yet been selected, or when the service being used is not a VOD program. Once a program has been selected and its playback begun, it takes over the screen. Even if future viewing habits change and the selected program occupies only a portion (window) of the screen, this sort of manipulation should be done at the decoder box (with different screen windows representing different data streams addressed to the same subscriber). Otherwise, video has to be decoded at the head end, other windows overlaid, and the video recompressed and multiplexed before transmission. Further, different windows are bound to have widely varying data rates which could most effectively be transmitted separately. For the time being, switching from the user interface screen server to the output of the VOD system once programming is underway allows the reuse of the multimedia screen server. One thing which has become clear in the last few months is that there is a consensus protocol for data/video/audio transmission in ATM/SONET. Even if this protocol were not the most optimal for our purposes, there would be many reasons to adopt it: First, it seems to have tremendous momentum, with worldwide and industry wide support. This will insure widespread availability of ATM protocol support equipment. Its standardization means that for those situations where it makes sense, we can use other carriers to get from one of our systems to another (the regional playback center concept might make use of this, as well as allowing WAN services between existing cable systems). It is a protocol which has been designed to be equally adept at handling video, audio, and data, all areas of interest to us. It means that our VOD data stream can easily carry other services. For computer network interconnection, it means no LAN/WAN barrier, as currently exists. Because ATM separates the physical network from the logical network, a workstation may be moved anywhere on the WAN without network reconfiguration (very important for corporate data transport customers who need to move a department across town). And yet ATM allows up to a two order of magnitude difference in data rate between local area network data rates and wide area network backbones, resulting in few or no data bottlenecks regardless of the distance separating workstations and servers. The presumption is that a MPS system could prepare its output data stream preencoded for ATM cells, allowing its easy integration with other ATM services. “I Frames” (completely encoded compressed video frames in a data stream of partially encoded frames, allowing a fresh start at video decoding) are the logical dividing point in any compressed video storage scheme for our minimum package of information. For instance, if we decide that we will be sending an I Frame every two seconds, then our compressed video/audio data stream should be partitioned into two second intervals of data for storage and tracking purposes. Although compression rates are variable depending on content, resulting in variable size two second data packages, the whole system depends on knowing about time, not data rate (except as a guaranteed maximum). As mentioned in the first report, time tagging of data is critical, and we should not loose sight of this fact as we discuss various storage and distribution systems for compressed digital signals. Our initial field trials should probably make use of RAID technology for storage and playback, using conventional video channels and force-tuned receivers. This will allowing further time for the MPS market to mature, and for ATM’s apparent predominance to become clear. RAID technology is currently available, and we can focus on the user interface software and commercially successful services at a relatively low cost. Experiments can be begun with MPS manufacturers in the lab so that they can demonstrate to us that their machines can perform as advertised in our application. When MPS is ready, we will be ready, and have experience with the RAID systems we will be using for the back end. Further, we will be able to assure that the investment required by MPS technology will result in a commercially viable set of cable services. Video on Demand: Current Status (1993) Terms: VOD: Video on Demand (2 second latency from request to delivery on popular titles). Although our conversations are based on VOD, it is only one of the services made possible by the underlying technology. ATM: Asynchronous Transfer Mode, a data transmission protocol adaptable to the widest range of transmission speeds and suitable to the demands of video transmission. MPP: Massively Parallel Processor, a computer assembled from many small processing elements, each with its own memory. A primary emphasis in MPPs is switching data among elements. Assumptions: Three megabits per second is adequate to transmit an acceptable standard video signal (with stereo audio) at this time. Assuming a movie length of 120 minutes, total data storage capacity for 300 movies will need to be 300 (movies) * 120 (minutes) * 60 (seconds per minute)* (3x10e6)/8 (bytes per second), or 810 thousand megabytes -- about a terabyte of storage capacity. Each movie will use 120 (minutes) * 60 (seconds per minute) * 3 x 10e6 / 8 (bytes per second) = 2.7 gigabytes of storage. Neighborhoods of subscribers served by one fiber will range from 200 to 1000 subscribers. Peak overall demand for VOD service will be 10%, peak neighborhood demand will be 40%. The data rate required for the largest neighborhood at the period of peak demand will be 40% * 1000 (subscribers) * 3 x 10e6 (bits per second), or 1.2 gigabits per second of data. We have upped this to 1.5 gigabits per second to account for data stream overhead (addressing, error correction). Subscriber decoder equipment capable of extracting a 3 megabit per second signal from a 1 .5 gigabit per second stream will be reasonably priced. The VOD system is to be designed without taking advantage of statistical tricks which may be applied later (for example, 3 megabits/second represents a peak rather than average data rate per subscriber). Anticipated conditions and suggestions: ATM is likely to be the transmission format of choice in the future. All program material should be stored in I Frame units. If the desired granularity of service is two seconds (e.g., fast forward and rewind jump in 2 second increments with picture display), then there should be an I Frame (reference frame which is fully encoded) every two seconds, and the material should be stored in two second chunks regardless of the number of bytes in that chunk. More than one I Frame may occur in a given chunk if there is a scene transition, but the chunk always begins with an I Frame. Areas to be addressed: Storage of program material. Types of storage (RAM, magnetic, optical), nature of stored material (incremental quality vs. data bandwidth, scenes and transitions for real time assembly, live event real-time compression vs. preprocessed material, accommodation of various compression protocols and services), granularity of service, maximum delay times between users.. Scheduling, billing, and system management: Which program is loaded at what priority of service, which programs are on hold by users, has the time allotted to a user for a program been exceeded (will programs be licensed by viewing time rather than continuous play to allow for “fast forward” and “rewind”?). Accept and process user requests for service or control. Distribution of program data from storage to neighborhood server: “Broadcast” mode from storage to all neighborhood servers using one point in time, timely distribution of data from each storage unit to neighborhood servers for all of the points in time it is handling. Neighborhood servers: Assembling ATM data streams for all current users in a neighborhood, and tracking the amount of data transmitted to each user versus the size of the buffer in their decoder so that it never runs out of material, and is never sent more than it can hold. Subscriber equipment: Digital data will need to be encoded as well as digitized and compressed, as it will quickly become “easy’ to intercept and decompress any portion of a high speed data stream. The advanced technology itself is not adequate security. The home unit will have to do all that it does now, plus demodulate the digital data stream(s), and demultiplex them. It probably should have simple local data screen generation and overlay capability, as well as accommodating two way interaction (remote control with card swipe, telephone, trackball, house control). Small systems or trial systems: Transaction processing based models will probably work satisfactorily for very small systems. Figuring 2 GB per drive, and a 10 MB/sec transfer rate per drive, and disregarding any bottlenecks beyond the base sustained transfer rate per drive, then each drive can serve the equivalent of about 26 users each at 3Mb/sec (B=byte, b=bit). The 500 drives required to store a terabyte of information could serve a theoretical maximum of 13,000 users. Unfortunately, 10 MB/sec is a peak theoretical maximum sustained output, akin to Peak Instantaneous Music Power ratings in stereos. On top of that, the output rate falls to zero during those intervals that the drive is required to seek to another cylinder. Furthermore, RAIDs frequently impose a constraint on the output data rate of the array which is less than the sum of the output rates of the individual drives in the array. And all of this is before we get to the caching and switching portion of the system. I expect that the transaction processing approach will be lucky to serve 1300 users from each terabyte machine, and that number only with very large cache memories, storing the most popular programs in RAM almost all the time. Even with only 1300 users, the total data rate on the output side will be 1300 x 3 Mb/sec, or 3.9 gigabits per second, which seems beyond the capability of most machines in this class. With careful choice of equipment, a system of this type might be adequate for systems of 10,000 to 50,000 subscribers, where the systems are duplicated for each 10,000 subs. However, each of the systems will have to store identical data (a popular movie is likely to be in demand throughout the system), and the drives and cache memories of each system will have redundant information (without the normal advantages of redundancy). Soon, we will have paid more for RAM to cache multiple copies of the same movie in each machine than it might have cost to store the most popular titles in RAM in the first place in one central machine capable of operating at the required rates. Eliminating redundancy by allowing transmission between servers is the same as turning these separate systems into a single massively parallel system the hard way. Metropolitan areas or shared area systems: One class of computer architecture fits Video on Demand perfectly: Massively parallel processors (MPPs). Our computing needs would be considered trivial. What we require is storage and switching, but in vast quantities and at very high speeds. MPPs are completely designed around solving our problems. - What we need: Just as hard disk drives are limited by their sustained output rate, RAM is limited by memory bandwidth (how often a word of memory may be accessed). As a result, the memory of our ideal machine would have to be segmented, with a memory controller for each segment. The speed of accessing a different location in memory is five orders of magnitude faster than accessing a different location on a hard disk (l00 ns vs. l0 ms), and the peak output rate of RAM is around 10 MB/sec times the bytes per word of memory (with 64 bit wide memory, the output rate is about 80 MB/sec). This means that the size of a chunk of memory is limited by the number of users which consume its complete output bandwidth. Assuming 80 MB/sec, users every 2 seconds, no penalty for broadcasting data from one point in time to multiple users, and 3 Mb/sec per user, each chunk of memory would be limited to 210 users at different points in time, or about 150 megabytes. If 20% of the three hundred movies offered are popular enough to be required to remain in memory all the time, then 60 (20% of 300 movies) * 2.7 x 10e9 (bytes per movie) / 150 x 1 0e6 (bytes per controller) or around 1100 controllers will be required for this storage, each with 150 MB of RAM. Each output channel serving a neighborhood will also require a controller with adequate local buffering, as each neighborhood may require a data stream of up to 1.5 gigabits per second, or about 200 megabytes per second. A 100 megabyte buffer on the output channel controller will allow about 1/2 second of material to be assembled for all users in the peak output case. A given chunk of compressed program information may be going to several output channel controllers, and each output channel controller may be collecting information from as many as 400 storage locations (for the worst case of a 40% load in a 1000 subscriber neighborhood) (400 subs, 3 megabits I second each = 1 .2 gigabits / second plus overhead). It is clear that there will be many controllers involved in a middle switching layer. The irony is that the architecture described is exactly that of a massively parallel processor. Video on demand is not the only possible application for high speed interactive digital transport on cable. Using MPPs to accomplish the initial goal of VOD leaves the entire realm of future applications open, since MPPs are general purpose machines. But no application I am aware of better fits the architecture of an MPP than ours -- it’s as though the machines were designed for our application. Further, MPP manufacturers are hungry for applications, with Grand Challenge Science experiments having been their primary target so far. No manufacturer I have spoken with (Intel, nCube, Thinking Machines, Kendall Square, Teradata) had thought of VOD as a possible application -- and they were all interested and excited, since it could double the demand for the class of machines they manufacture. Pursuing Video on Demand using MPP technology gives us several advantages: First, although MPP technology seems ideally suited to VOD, it is not limited to VOD, and paves the way to retail services, interactive multimedia, and wide area virtual reality for education and entertainment. Second, MPP manufacturers are eager to push ahead with our project, which will give us several vendors to choose from for each new system, rather than being compelled to deal with one vendor and a “one size fits all” solution. Squeaks: There are several squeaky places in the scenario we are examining. First, new modulation techniques recently announced allow up to sixteen bits per cycle to be modulated on a carrier. With our worst case of 1 .5 gigabits per second per neighborhood, we may fill as little as 100 Megahertz of our new spectrum. And a 1 .5 gigabit per second rate is more than most equipment can cope with, meaning greater expense on both the sending and receiving ends. With 200 subscriber neighborhoods, the 40% worst case demand would translate to less than 300 megabits per second. Although this is an easier rate to cope with, it also implies that five times as many output ports are required on the source machine (up to 5000 for a million subscribers). In any case of neighborhood size versus transmission rate, the internal data rate in the source machine remains the same. Smaller neighborhoods increase the switching complexity, but reduce the bandwidth required for each output. Current real machine limits range from 32 processors to 2000 processors, and from 32 to 1 28 megabytes per processor, although all manufacturers claim that they will be scaling up very rapidly. Everyone is working on ATM output channels, but no one is delivering yet. Our job is data shuffling at very high speeds, and there are no benchmarks to prove which machines will meet our demands in terms of guaranteed throughput. With 10,000 active subscribers, the throughput from our machine will have to be 30 gigabits per second, with 100,000, 300 gb/sec. Signaling and administration will probably double that demand. Few manufacturers are planning for 500 to 5000 output ports (especially at 1.5 gb/sec each).
Senior Vice President
Engineering & Technology
T I M E W A R N E R CABLE
June 15, 1993
Mr. Steven W. Rose
P. 0. Box A
Haiku, Hawaii 96708
Dear Mr. Rose:
Please consider this a letter of reference with regard to your computer consulting services. You were of great help as Time Warner formulated its approach to the Full Service Network, which is now under construction in Orlando, Florida.
Early in the planning stages of the Full Service Network, we faced fundamental questions about the practicality and architecture of very large video servers appropriate to provide video-on-demand service to a large cable system, potentially encompassing many hundreds of thousands of subscribers. At the time, the only proposal we had in front of us was for a very large RAM server, obviously an extremely expensive approach.
In addition to obtaining sufficient storage capacity, a terabyte or more, the fundamental issue was one of providing thousands of simultaneous accesses to the same memory space, with each access being under independent control in order to allow for VCR-like services.
You confirmed to us that the construction of very large video servers of the type needed was feasible, and that an interesting avenue of exploration might be massively parallel supercomputers. This was an innovative approach, and appears to have been news even to the manufacturers of such devices, which are generally devoted to scientific computation.
As you know, we recently announced that we will undertake the Full Service Network in a large pilot program in Orlando, Florida, initially serving 4,000 homes, but scalable to a much larger number. While we will not initially use the massively parallel supercomputer approach, that is an option which we may explore in the future as this project sees mass deployment. Your help in testing the feasibility of this project was much appreciated.
Please feel free to pass along this letter to other prospective clients. I would be happy to recommend your services to them.
Very truly yours,
Time Warner Cable 300 First Stamford Place Stamford CT 06902-6732 Tel 203.328.0615 800.950.2266 Fax 203.328.0690
A Division of Time Warner Entertainment Company, L.P.