FreshPatents.com Logo FreshPatents.com icons
Monitor Keywords Patent Organizer File a Provisional Patent Browse Inventors Browse Industry Browse Agents

4

views for this patent on FreshPatents.com
updated 05/24/2013


Inventor Store

    Free Services  

  • MONITOR KEYWORDS
  • Enter keywords & we'll notify you when a new patent matches your request (weekly update).

  • ORGANIZER
  • Save & organize patents so you can view them later.

  • RSS rss
  • Create custom RSS feeds. Track keywords without receiving email.

  • ARCHIVE
  • View the last few months of your Keyword emails.

  • COMPANY PATENTS
  • Patents sorted by company.

Layered multicast and fair bandwidth allocation and packet prioritization   

pdficondownload pdfimage preview


Abstract: Embodiments include an overlay multicast network. The overlay multicast network may provide a set of features to ensure reliable and timely arrival of multicast data. The embodiments include a congestion control system that may prioritize designated layers of data within a data stream over other layers of the same data stream. Each data stream transmitted over the network may be given an equal share of the bandwidth. Addressing in routing tables maintained by routers in the may utilize summarized addressing based on the difference in location of the router and destination address. Summarization levels may be adjusted to minimize travel distances for packets in the network. Data from high priority data stream layers may also be retransmitted upon request from a destination machine to ensure reliable delivery of data. ...


USPTO Applicaton #: #20090303997 - Class: 370390 (USPTO) - 12/10/09 - Class 370 
Related Terms: Bandwidth Allocation   Congest   Congestion   Destination Address   Destination Mac   Multicast   Overlay Multicast   Routing Table   
view organizer monitor keywords


The Patent Description & Claims data below is from USPTO Patent Application 20090303997, Layered multicast and fair bandwidth allocation and packet prioritization.

pdficondownload pdf

CROSS-REFERENCE TO RELATED APPLICATION

This application is a divisional of copending U.S. patent application Ser. No. 11/342,167, filed Jan. 26, 2006, which claims priority to U.S. Provisional Application No. 60/647,601, filed Jan. 26, 2005, which is incorporated by reference as if set forth in full herein.

FIELD OF THE INVENTION

The invention relates to network management. Specifically, the invention relates to the management of data packets to support multicasting.

BACKGROUND

Despite the versatility in digital communication, interoperability and internationally accepted communication protocols of the Internet, its fundamental design has not changed much since its conception and does not excel in everything. Watching live TV for example is something which is not typically done over the Internet, even though television has been around almost twice as long as the Internet Protocol and represents a huge market. The reasons for this are based on the design of the Internet and Internet Protocol (IP).

The Internet is a packet-switching network where data is exchanged in small units or packets that are independently transported over the network and concatenated again at the receiver into its original form. A strength of packet-switching is that it allows for very flexible use of the physical network wires. When two communicating parties have no data to exchange for a certain period of time, no packets are sent and the wires can carry packets from other parties. On the Internet, bandwidth is not reserved, but available to and shared by everyone. The consequence is that it cannot guarantee a minimum amount of end-to-end bandwidth, making live video streams often appear jerky because frames are skipped due to congestion that delays or prevents delivery.

Even though with help from specialized protocols such as Distance Vector Multicase Routing Protocol (DVMRP) or Protocol Independent Multicast (PIM), the Internet Protocol allows for data packets to be multicast to a large number of receivers simultaneously, using this feature to successfully realize a live video broadcast is a challenge. A video stream is transmitted at a fixed high rate and not all parts of the network are likely to have sufficient bandwidth available to forward the stream.

When a bandwidth bottleneck is reached, the router discards the packets that cannot immediately be forwarded. This causes two problems. The data stream that is eventually received by one or more receivers further down the network is corrupt and the congestion also has a negative impact on communication sessions of other nodes that communicate through the bottleneck router. The only way to avoid this problem using the Internet Protocol and standard multicast is to find a transmission rate that is supported by all parts of the network. However, since the network is available to anyone, this rate will continuously change. A transmission rate is selected and the packet loss is accepted. However, when packets are dropped randomly by overloaded routers the data stream will suffer packet loss. If additional packets are sent through the bottleneck router, there is a larger chance that the router will choose one of these packets when ready to send another packet, implicitly rewarding heavy streams during congestion. This encourages sending redundant data thereby exacerbating the problem.

A more fundamental problem of flow control using the Internet Protocol is that slowing down the data may not be an option for certain types of live data streams. However, packet loss is unavoidable using the Internet Protocol and while data types such as audio and video data can usually withstand some packet loss without becoming too corrupted to play, this does not apply to all types of live data. Real-time financial data, for example, will become useless and even dangerous to use if random packets of trades are lost.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” embodiment in this discussion are not necessarily to the same embodiment, and such references mean at least one.

FIG. 1 is a diagram of one embodiment of an overlay multicast network.

FIG. 2 is a diagram of one embodiment of the basic components of a router daemon in the overlay multicast system.

FIG. 3A is diagram of one example embodiment of a layered multicast system network divided into logical clusters.

FIG. 3B is a diagram of a routing table for the network of FIG. 3A.

FIG. 4 is a diagram of an example network with hierarchical structure.

FIG. 5A is a diagram demonstrating summarization of an example network with an S2 domain.

FIG. 5B is a diagram demonstrating the stretch factor for the example network of FIG. 5A and shows the inner network of the S2 domain.

FIG. 6 is a diagram of an example embodiment of a multicast distribution or “sink” tree.

FIG. 7 is a diagram of an overlay multicast network with a failed link.

FIG. 8A is a flowchart of one embodiment of a process or managing congestion in the overlay multicast system.

FIG. 8B is a flowchart of one embodiment of a process for handling layer repair.

FIG. 9A is a diagram of two router daemons connected by a link.

FIG. 9B is a diagram of two router daemons connected by a link where a retransmission is requested.

FIG. 10 is a diagram that shows the process of selective packet repair and ordering.

FIG. 11A is a diagram showing thinning of a stream of data.

FIG. 11B is a diagram of a multi-level token ring structure.

FIG. 11C is a diagram of a multi-level token ring structure showing thinning of data.

DETAILED DESCRIPTION

To provide multicast services a network needs to support one-to-many communication that can send data packets from a data source to more than one receiver, ideally without putting extra stress on the network or source when the number of receivers increases. Multicast routing can be offered by different methods. One method is to let receivers tell the network, but not necessarily the source, which data streams they want to receive and let the network compute data distribution paths that deliver just the right packets to each receiver. Multicasting can also be done by letting the source encode the list of receivers in each data packet, thereby freeing the network from the potentially computationally intensive task of maintaining multicast distribution paths. However, this method does not scale to handle a large number of receivers. A third method relies on logic at the receivers by letting the network apply a broadcast mechanism whereby each packet is delivered to every connected node and letting the receivers filter out only those packets that are interesting. This method may also generate a heavy load on a larger network but it is simple.

In one embodiment, a multicast network is constructed as an overlay network. In one embodiment, the overlay network includes a number of software implemented routers connected by normal IP or TCP connections. A mesh is created in which every software router is connected to one or more other software routers by means of virtual connections that appear to be direct connections to the other software routers, but are likely implemented by a number of intervening traditional TCP/IP routers situated between the software routers. Two routers that are connected this way, are adjacent in the perspective of the overlay, but in reality are many physical hops away from one another. Also, a software router that has three software router neighbors has three independent virtual links. However, it is possible that this software router only has a single physical network connection that is shared by the three virtual links.

To the underlying network, the overlay network is nothing more than a collection of applications that send data between static pairs. Beyond each router pair (identifiable by their TCP connection) there is no relation between the individual daemons. As such, the overlay network can easily work through firewalls, NAT (IP masquerading), proxies and VPN\'s. Firewalls cannot control which software router POPs can talk to each other. The system shares some principals with a HTTP proxy server tunneling traffic to and from web browsers. In an example HTTP proxy server system, an intranet has two web browser machines. They both browse the Internet using the proxy server also on the local intranet. Although the individual web browser machines can surf the net individually, they never make a direct connection with a remote webserver, but only with the local proxy server that acts like a relay point. As such, the firewall that sits between the proxy server and the remote webserver can only choose to either allow the proxy server to talk to the webserver or deny it. It is generally unable to enforce unique policies for individual browser machines. It generally cannot tell on behalf of which client the proxy server is fetching a webpage. Returning to the overlay network, because each software router is a relay point that tunnels data traffic for different senders and receivers similar to a HTTP proxy server, firewalls have no fine-grained control over communication over the overlay network. As soon as a firewall allows for only a single TCP connection between an internal and an external POP, all software routers connected to the internal one can talk to all POPs connected to the external one, and vice versa, without restriction.

In one embodiment, the overlay multicast routing system also manages flow control and timely delivery. Non interactive live data streams do not actively anticipate network congestion. To manage congestion the network manages the available bandwidth to allow for fair or equal division among the streams. Without management, high volume streams are assigned a larger capacity percentage on overloaded links resulting in little benefit in keeping the bandwidth requirements of a stream low, as the packet loss percentage is determined by how much the network is overloaded by all streams combined and not by the requirements of the individual streams. An alternative to letting the network handle the flow control and congestion is to put the responsibility at the source and receivers. However, letting data streams anticipate network conditions requires a form of feedback information from the network or the receivers. In this case it is beneficial that the amount of feedback does not grow linearly with the size of the audience, as that would reduce the scalability of the multicast. Even when a scalable form of feedback information can be realized and the data stream adapts its transmission rate according to the network conditions, the problem remains that live streams loose their value when they are slowed down and delivered late.

In one embodiment, an overlay multicast system also implements or manages delivery of packets including an option for guaranteed delivery. It would be ideal if every receiver would receive the data stream without loss or corruption. However, when the content is ‘live’ and cannot be slowed down, but the network has insufficient capacity, packet loss is difficult to avoid. In fact, even when there is sufficient bandwidth at all times, store-and-forward packet-switched networks are not able to guarantee the delivery of all packets. For example, when a router crashes, all packets in its buffers may be irrecoverably lost. If a network uses dynamic routing, packets are also dropped when routing policies change or packets are trapped in an occasional, temporary routing loop. In cases where packets are ‘accidentally’ lost, an end-to-end mechanism of retransmissions can be applied that can compensate for the loss. However, since this requires a form of feedback information, it is beneficial for reasons of scalability that the overhead involved with retransmissions is not linearly related to the size of the audience.

End-to-end retransmission feedback may be avoided in at least two ways. First, it is possible to let the network components keep a copy of the most recently forwarded packets and let them participate in retransmissions by intercepting the retransmission requests and servicing them locally. This approach often utilizes greater storage and increased processing power requirements at the network components.

The second alternative to end-to-end retransmission requests is that of encoding redundant information in the data packets. If enough redundancy is encoded, a lost packet\'s content may be entirely recovered from the extra information in the other packets. The downside of this system is that it comes with a constant level of bandwidth overhead that is related to the level of packet loss tolerance, regardless of whether packets are actually lost. Each of these approaches to packet loss fail in the case of a live data stream that produces more bytes than the network can forward. When local or end-to-end retransmission requests are used, the problem may even be increased as the retransmission requests use extra bandwidth, causing more data packets to be lost.

Embodiments of the overlay multicast system are designed to offer an end-to-end solution to efficiently multicast live data including stock market data to any machine connected to the Internet. The system is capable of giving certain hard guarantees over what is delivered to receivers. If packets must be dropped due to congestion or other irrecoverable problems, it is done in a fully deterministic way that does not corrupt the data. Where a receiver of data such as a viewer of a film may accept the random loss of one or two video frames, this type of data loss may wreak havoc in financial data when the missed packet contains an important trade. The system supports deterministically delivery designated parts of a data stream when the network lacks sufficient capacity. The system utilizes a layered multicast with data streams subdivided into individual streams or layers. This allows receivers to subscribe to only those layers that the network can handle, so that random packet loss can largely be avoided.

In another embodiment, an enhanced form of layered multicast is used that guarantees complete delivery for certain layers to avoid random loss altogether, making it suitable for certain types of critical data such as market data. The system is characterized as controlling two primary activities. The first activity is running and managing a robust and scalable overlay network that uses its own routing algorithms and supports multicast, while the second activity is managing flow control and congestion when live streams overload the network and ensuring layered multicast can be offered with guarantees.

FIG. 1 is a diagram of one example embodiment of an overlay multicast system. In one embodiment, the system includes an overlay network 100 that includes a number of software router daemons 109, 115, that are interconnected by normal TCP connections 119 or similar reliable communication connections. The overlay multicast system forms an intricate web of routers and virtual links similar to the Internet, but operating on the application layer. This system operating at the application layer ‘overlays’ the physical network and other layers of the Internet providing its own system of routing, controlling the lower levels of the network. Any number or routers and client applications may be a part of the system. In the example, two of the routers 109, 115 are in communication with local client applications 101, 111.

In one embodiment, each of the routers have a unique, string-based addresses 117. In another embodiment, the routers may have other types of unique addresses such as numerical addresses. Each of the router daemons 109, 115 executes a routing algorithm that computes the shortest paths that allows each router to send data packets to any other router in the overlay multicast system as well as any machine in communication with these routers.

In one embodiment, the system includes a runtime client library 103, 113, application programming interface (API) or similar system for sending and receiving packets that can be utilized by user applications to communicate over the system. This library connects an application to a nearby system router daemon 109, 115 through a TCP connection 107, 121, native inter-process communication or similar communication method. The native inter-process communication method may be used if both the router and the client application run on the same physical machine. When connected to a router daemon, the client application 101, 111 can send and receive data packets from the network under the router\'s unique address. Router daemons have a unique address, while user applications connected to a router daemon are identified through logical ports.

In one embodiment, the topology of the overlay network may be configured at router daemon startup. The topology may be relatively static after configuration. In one embodiment, the relationship of router daemons to applications may be one to many, with a single router daemon serving multiple applications. In another embodiment, the relationship may be many to many.

In one embodiment, the data packets in the overlay network may be addressed to a single communication endpoint or to abstract multicast addresses. Single communication endpoints are network address used by user applications. Access to them is exclusive such that only one user application can use them for sending and receiving packets at a time.

In one embodiment, a unique network address in the overlay network that is used by an application is the combination of a logical node address assigned to the overlay router and the port name chosen by the application. If an application that is connected to a overlay router with the logical node address “n1.msx.1” wants to use a port named “myport” for receiving unicast data packets, the fully qualified network address becomes “n1.mxs.1.:myport.” Other applications connected to the overlay network that want to send packets to this application can use this as a destination address. Ports may be referred to as tports. Overlay network addresses must be bound, prior to sending or receiving packets. When a data packet is sent from an endpoint address it will contain this information as its source, allowing other routers or the receiving application to send a response.

In one embodiment, the overlay network may support packets being sent to abstract multipoint destinations, or multicast addresses. An overlay network multicast address is a destination that each machine on the network may subscribe to. The overlay network software routers ensure that a copy of each data packet published to this multicast address is delivered to every user application that subscribed to it. In one embodiment, multipoint destinations are single sourced. Only one application can publish data packets to the multicast address, making the overlay network suitable for one to many, but not many to many communications. Because a tport session of a multicast address can be freely chosen, each multicast address explicitly contains the location of the source on the overlay network. This makes multicast communication less flexible because publishing is not anonymous, but it greatly simplifies subscription management. In another embodiment, multiple applications may publish to a multicast address and many to many communication is supported.

In one embodiment, both unicast and multicast addresses are syntactically equal. For example, the address “n1.mxs.1:timeserice” could be a unicast address used and bound by an application that provides the local date and time in response to any data packet it receives. However, it could also be a multicast address that anyone can subscribe to. For example, subscription to the address may provide periodic date and time broadcasts from the user application that bound this address as a multicast address. In one embodiment, each packet may contain a flag or similar indicator that indicates whether its destination address should be interpreted as a unicast or multicast address.

In one embodiment, the overlay network provides a set of higher layered protocols as services over the packet oriented base layer. These services may be referred to as protocol endpoints. Any number of protocol endpoints may be defined and supported. In one embodiment, five protocol endpoints may be supported. A unreliable unicast protocol (UUP), a reliable unicast protocol (RUP), unreliable multicast protocol (UMP), ordered layered multicast protocol (OLMP) and reliable ordered layered multicast protocol (ROLMP) may be supported. The UUP offers a best effort unicast datagram service to applications. The RUP offers reliable unicast communication between peers. The UMP offers best effort multicast datagram services to applications on the overlay network. The OLMP offers multicast communication with receiver driven rate control. Complete delivery is not guaranteed, but the packets that are received are guaranteed to be in their original order. The ROLMP offers reliable multicast communication with receiver driven rate control. Stream layering allows each subscriber to receive the data stream in the highest possible quality, while the source never has to slow down.

FIG. 2 is a diagram of one embodiment of the basic components of a router daemon in the overlay multicast system. Each overlay network router 109 includes a packet switching core or kernel 201. In one embodiment, packets that are received, either from a connection to a neighbor router or from a connected application, pass through the kernel 201. The kernel 201 may not handled some specialized control packets. The task of the kernel 201 is to inspect the destination of each packet and use a set of routing tables to determine how to forward the packet.

In one embodiment, the kernel 201 forwards packets to neighbors through a set of interface modules 203, 205. In one embodiment, the kernel 201 may execute on its own thread and be event driven. The kernel 201 remains idle until it is notified by an interface module 203, 205 of an incoming packet or by an application 227 that is sending a packet. The kernel thread is woken up, reads the packet from the interface module or application and processes it. If the kernel decides that the packet must be sent out through an interface, it passes it to that interface and waits for the next notification.

In one embodiment, each router daemon in the network is connected to one or more neighbors. This is done by establishing connections between the routers. In one embodiment, the connections may be long lived TCP connections or similar communication connections. In one embodiment, a router daemon 109 runs one interface module 203, 205 instance for each configured neighbor or configured communication channel. In another embodiment, the router 109 may run multiple interface modules for a configured neighbor or communication channel or a single interface module for multiple configured neighbors or communication channels. The responsibility of an interface module 203, 205 is to establish a connection, e.g., a TCP connection, and to pass packets from the kernel 201 to the connection and vice versa. Packets of the first kind are referred to as outbound packets, while the latter are inbound packets.

In one embodiment, the kernel 201 maintains a unicast routing table that is used for packet switching. To make it possible for the network to find shortest paths as well as adjusting these paths when the underlying physical network\'s characteristics change, each interface module may measure the quality of its virtual connection. These measurements are passed on to the kernel 201 when the link quality is found to have changed. Inside the kernel 201, the measurements may be fed to a routing algorithm to determine if the changed link quality alters any of the entries in the routing table. If the routing table is changed, it is advertised to the neighbors by encoding the entire table or a portion of the table in a data packet and passing it to each interface. In one embodiment, if this type of packet is received from a neighbor and propagated to the kernel 201 through the receiving interface, the kernel inspects the message and analyzes it through the routing algorithm. If the new information leads to changes in routing entries, the router sends its own related routing entries to all its neighbors.

In one embodiment, if a neighbor router crashes, the interface detects this through an error on the virtual link and passes an indicator to the routing algorithm. The routing algorithm then changes the cost indication associated with the link to an infinite value or similar value indicating that the link should not be utilized. In one embodiment, the kernel 201 does not distinguish between a crashed neighbor and an infinitely slow link. The kernel 201 only detects that the link is not to be utilized when reading the entry in the routing table.

In one example, when a source application connected to router Swants to publish a live data stream to multicast group s:mygroup, where applications on node p and q want to receive it, the source may first bind the group in publish-mode. Binding a multicast group in publish-mode means that the group is owned by the binding process. Only the owner of the group is able to publish to it. The receiving or “sink” applications connected to routers p and q now bind group s:mygroup in subscribe-mode. Binding a multicast group in subscribe-mode results in the router node being connected to a virtual multicast distribution tree. The subscribers receive a copy of any packet published by the source.

In one embodiment, multicast groups do not necessarily need to be bound in publish-mode first. Any application may subscribe to any multicast group at any time. If there is no source, there will be no data to receive. Binding a multicast group either in publish-mode or subscribe-mode are distinct operations. When the source also wants to receive a copy of its own stream, it binds its group in subscribe-mode and uses the resulting tsocket to read the data back. Data packets carry both the address of their source application, as well as the network address of their destination.

For unicast packets, the router uses its unicast routing table to find the preferred next hop, while multicast packets are routed according to a subscription list in a multicast subscription table. In one embodiment, a unicast data packet contains the node address of the source router, the tport of the source application, the address of the destination router and the tport of the destination application. A multicast data packet contains the multicast group it was published to, represented by the node address of the source router and the group identifier that was bound by the source application. It does not contain any other addressing information.

In one embodiment, the overlay multicast system determines unicast routing tables and multicast subscription tables for each router. The system utilizes any type of routing algorithms or protocols to determine routing tables. Protocols that are utilized include distance-vector protocols and link-state protocols. Link-state protocols take the relatively simple approach of propagating the state of the entire network as a list of all known links in the network with their cost. Distance-vector protocols are based on the Bellman-Ford protocol. They work by letting every router periodically advertise its own routing table to its neighbors. In one embodiment, the Extended Bellman-Ford protocol, hereafter referred to as ExBF, is used as the basis of the overlay multicast network. For sake of convenience, the embodiments of overlay multicast system are described that utilize the ExBF, however, those of ordinary skill in the art would understand that other routing algorithms may also be utilized.

In the ExBF protocol, a slight increase in the amount of information is kept for every destination. Instead of storing just the distance to each destination for every neighbor, ExBF also stores the address of the pre-final node of each path. Hence, instead of storing the collection of distances {Di(j)} where j represents a destination, D the distance between this node and j, while i ranges over the node\'s neighbors, each router also stores {Pi(j)} where P is the pre-final, or so-called ‘head-of-path’ node in the shortest path to destination j. Now because the router knows the pre-final node of every destination, it can backtrack the full path to any destination by recursively looking at the head-of-path of a destination and treating it as a new destination.

In one embodiment, the interface modules 203, 205 implement a simple algorithm to establish the communication connection to the neighbor machine or device. The interface module 203, 205 attempts to connect to the configured neighbor router by actively trying to connect to its network address, e.g., its TCP/IP network address. If the connection is aborted with a connection refused or other error, it is assumed that the neighbor is not yet running and the interface module 203, 205 starts to listen on its configured IP port so that the neighbor can connect to it when it is started up. The interface module 203, 205 waits for an incoming connection request for a brief period of time. After this time period expires, the interface module 203, 205 returns to actively connecting to its neighbor. To avoid a situation where both neighbors continue to switch between the listen and active states at the same time, the duration of the listening state is influenced by a random factor. An advantage of allowing each neighbor to switch between the active connection and listening states when establishing a connection is that it allows the connection of routers even if one of the routers is on a masqueraded IP network. A router on a masqueraded IP network is unable to accept incoming connections such as TCP connections.

In one embodiment, interface modules 203, 205 can be configured at router deployment through configuration files. This will make the router daemon automatically create the specified interfaces at startup. If the configured neighbor routers are online, all long-lived connections will automatically be established. It is also possible to add new neighbor connections and interface module instances dynamically at runtime. This way the overlay network topology can be changed flexibly and new routers can be added to the network. If an interface module 203, 205 manages to establish a connection with a neighbor router, the interface modules of the routers exchange overlay network node address information to inform each other of their presence. When the interface module 203, 205 receives the node address of its neighbor, it passes this information, together with an announcement that a connection has been made to the kernel 201. This information allows the kernel routing algorithms to build dynamic routing tables.

In one embodiment, a role of the interface modules 203, 205 is to establish the connection with a configured neighbor router and to send data packets from the router kernel 201 to the neighbor connection and vice versa. The interface module 203, 205 incorporates a framework that allows custom software plug-ins to influence the stream of packets that flows between the network and the kernel 201. This mechanism is referred to as the interceptor pipeline 251, 257. In one embodiment, each software plug in component that needs to control the packet flow is a class that implements a simple programming interface. In another embodiment, the software plug ins may have any implementation structure including objected oriented structures. This interface allows interceptor instances to be chained together, forming a packet processing pipeline. The contract of an interceptor is that it receives a packet, applies it operations and then passes the modified packet to the next interceptor in the chain.

In one embodiment, an interceptor pipeline 251, 257 sits between the router switching core and the network connection. When the router kernel 201 delivers a packet to the interface module 203, 205 for transmission to the neighbor router, the interface module runs the packet through the interceptor pipeline, giving the interceptors the chance to modify the packet. Each packet that comes out of the pipeline is transmitted to the neighbor router.

In one embodiment, each interface module 203, 205 has two interceptor pipelines 251, 257. The first interceptor pipeline 257 is used to process outbound packets. The second interceptor pipeline 251 is used for inbound packets. These pipelines are independent of one another, the ordering of the interceptors and the number of processing steps can be different for inbound and outbound packets. Each pipeline can be configured uniquely. Interceptor pipelines may have any number of interceptors, provided they do not add too much latency.

In one embodiment, an example of an interceptor is one that filters packets from a specific overlay network router. When this interceptor is implemented as a manageable component that can be configured dynamically at runtime, it can be used to implement basic firewall functionality on the overlay network. If it receives a packet that matches its rejection pattern, it discards the packet by not passing it to the next interceptor in the pipeline. Another type of interceptor that may be used is a traffic monitor that counts the size of each packet that passes by and uses this to log traffic activity and compute bandwidth statistics. This plug in mechanism allows an overlay network router to be extended with additional functionality without modifications to the underlying software.

In one embodiment, the interceptor pipelines 251, 257 act as a packet buffer between the router kernel 201 and the network. The interface modules 203, 205 temporarily buffer inbound and outbound packets. This ensures that the kernel 201 is not blocked when sending a packet to an interface. In one embodiment, the interface modules have a separate thread or set of threads that continuously dequeue and serialize packets from the buffer and write them to the communication connection or enqueue into a buffer received packets.

In one embodiment, the interceptor pipeline provides a temporary packet buffer and offers inter-thread communication between the kernel thread and the interface thread. Interceptors may be divided into two categories: interceptors that block and interceptors that return immediately. The first category is referred to as blocking or synchronous interceptors. In one embodiment, to avoid situations where a router kernel 201 is blocked for an arbitrarily long time, an interceptor pipeline may contain one non-blocking interceptor. A non-blocking interceptor guarantees immediate return of control to a caller by storing the packet in an internal buffer. The packets in the buffer may be discarded if the buffer exceeds a certain threshold size.

In one embodiment, a maximum size limit is placed on the packet buffers to prevent them from exhausting the router\'s memory. Storing packets before processing them means that their delivery will be delayed. The larger the buffer gets, the longer the packets are delayed. Because of this, the interceptor drops packets when the buffer reaches its maximum size. In one embodiment, the system uses reliable TCP or similar connections for transmitting packets between routers in which packets are only dropped inside the buffer interceptors of the interface modules.

In one embodiment, the interceptor plays a role in making packet loss deterministic. Packets are not explicitly dropped in any other part of the layered multicast system network, except the buffer interceptor in the interface pipelines. However, that may not guarantee that a packet that successfully makes it through all buffer interceptors of the network\'s routers is delivered at its destination. Aside from controllable packet loss, the network may also occasionally lose packets in a non-deterministic way, for example when a router crashes with pending packets in its buffers, or when a connection between adjacent routers is unexpectedly closed during transmission of a packet.

In one example, an inbound interceptor pipeline 251 may be structured such that traffic throughput monitor interceptors 253 are positioned after a buffer interceptor 255 and a firewall interceptor 265 may be positioned before for the buffer interceptor 255. In an example outbound interceptor pipeline 257, a throughput limiting interceptor 259 may be positioned before a buffer interceptor 261 and a traffic monitor interceptor 267 may be positioned after the buffer interceptor 261.

In one embodiment, the overlay network is accessible to applications through the use of the client-side programming library 221. This library connects to a router daemon 109 at application startup and communicates with it using remote procedure protocol or similar communication protocols. The communication between the library 221 and the router daemon 109 is driven by the application 227. The application 227 invokes functions in the router 109 through the library 221. In one embodiment, when the kernel 201 receives packets addressed to a tport that is bound by the application 227, it stores them until the client actively picks them up. The application 227 through the library 221 continuously polls the router for packets. To minimize the overhead of the polling mechanism, the router poll function does not return until there is at least one packet delivered by the kernel 201. If more than one packet is waiting, the poll function returns all waiting packets at the time it is invoked. In another embodiment, the router may send an indication such as an invoking a marshaled stub from the client, event notification or similar indicator to the application 227 through the library 221 to indicate the reception of a data packet for the application 227.

In one embodiment, a user application 227 uses a client library 221 through instances of the overlay multicast system communication sockets 225. If a user wants to be able to receive packets, they reserve a network address. In one embodiment, a network address in the overlay multicast system is represented as a combination of the address of the router and a unique port identifier, reserved for the socket of the router. For sake of convenience, a port in the layered multicast system will be referred to as a tport and a socket as a tsocket. When a user application creates a tsocket for receiving normal unicast packets, the tsocket automatically binds a local tport at the router daemon through a remote call to the client IO multiplexor 207 at the router. In one embodiment tports are bound exclusively and other clients may not use it concurrently.

In one example embodiment, communication sockets 225 communicate with a local input output (IO) multiplexor 229 that coordinates the handling of communication between the sockets and the router kernel through the formation of RPC calls, native inter-process communication or similar systems. A local IO multiplexor 229 utilizes an RPC stub 223 or similar program to communicate with the router via an RPC skeleton 219 and client adaptor 209. A client IO multiplexor 207 at the router manages the relay of these socket requests to the kernel 201.

In one embodiment, the router daemon 109 uses a packet buffer to temporarily store packets for each connected client application until they are picked up. An interceptor pipeline in the client adaptor 209 or similar process may be utilized for this buffering function.

To overcome the problem of growing routing tables, computation time and excessive advertisement overhead, large networks can be partitioned into smaller sub sections, connected by gateways. While the gateway routers maintain routing information necessary to route packets to nodes in other network sections, nodes inside a section or domain only maintain information for those nodes inside the same domain. By substituting logical ranges of hosts in the routing table by one single condensed entry, the size of the routing table is reduced. This is called address summarization. This mechanism introduces a form of hierarchy that allows the network as a whole to grow far beyond the practical limits of standard distance-vector or link-state algorithms. The farther a destination host is away, the more efficient it can be condensed together with other remote hosts. The more summarization is applied, though, the less efficient the paths become on average. The factor by which the actual data paths on summarized networks differ from the optimal paths, is known as the stretch-factor: the maximum ratio between the length of a route computed by the routing algorithm and that of a shortest path connecting the same pair of nodes.

In one embodiment, the overlay multicast system takes a relatively straightforward approach to address summarization. An administrator decides at deploy time which nodes form clusters and which clusters form aggregations of clusters. This is done by encoding hierarchy in the layered multicast system node addresses using a dotted or similar notation. Node addresses may be ASCII strings. In one embodiment, the strings are limited to at most 127 characters. In one embodiment, only [a-z] and [0-9] are available. In another embodiment any characters, numbers of similar symbols may be utilized. Addressing may be case-sensitive or case-insensitive.

FIG. 3A is diagram of one example embodiment of a layered multicast system network divided into logical clusters. The example illustrates a network of eight nodes, divided into three clusters. Assigning nodes to clusters may be based on geographical properties, administrative boundaries, wide area links and similar considerations. For example, nodes inside a corporate network are all assigned the same logical domain, whereas a network that connects nodes from different corporations, would usually assign a separate domain to each corporate network. Another criterion for assigning nodes is that nodes that often disconnect and reconnect again later, possibly because the overlay multicast system routers run on personal computers, are placed in a subdomain, to avoid the routing updates triggered by their changing state to propagate far into the network.

Given the logical domains that cluster groups of nearby nodes, each node can treat domains other than its own as a single entity and use a wildcard address that matches every host inside that domain. This reduces the size of the routing table, as well as the amount of routing information that needs to be exchanged between the nodes when the topology changes inside a domain. For example, when a new host is added to domain S1, there is no need to propagate that information to the other domains, as they already have a wildcard entry that will match the new address.

FIG. 3B is a diagram of a routing table for the network of FIG. 3A. The example routing table shows the effect of address summarization in this network on the routing table of node S2.C. The cost value that is listed in the third column represents the cost of the shortest path to the nearest node inside that domain. In the fourth routing entry, the number in the cost column is the cost to reach S1.R from S2.C. If it is assumed that both interdomain links S2.C-S1.R and S2.A-S1.R have the same weight or cost, S2.C will route all traffic for domain S1 through neighbor S1.R.

In one embodiment, when nodes exchange distance vectors, either because of a link change, or as a normal, periodic exchange, the receiving node first summarizes the destination address of each distance vector relative to its own address. Summarization is done by matching the destination address with its own address field-by-field and when a field does not match, the rest of the address is substituted by a wildcard. For example, when S2.C receives a distance-vector from S1.R that contains a path to destination S1.P, it is immediately summarized to S1.* upon arrival at node S2.C. This is because the first field differs from the first field of the local address and as such the remaining fields are replaced by a wildcard. This wildcard value is then fed to the distance vector algorithm that checks whether the new cost or path-length is shorter than the cost of the entry that was already in the routing table. In the present example there already is an S1.* wildcard entry in the routing table that was derived from neighbor destination S1.R. Since the path to S1.P runs through S1.R, the path to S1.R will always be shorter than the path to S1.P, so the entry in the routing table will not be changed and no further routing updates will be propagated to S2.C\'s neighbors.

In general, when a local address is 1.2.3.4 and an incoming distance-vector advertises destination 1.2.2.2, it will be summarized to 1.2.2.*. Destination 2.1.6 becomes 2.*, destination 1.2.3.4.5.6 becomes 1.2.3.4.5.*, destination 1.2.3.5 stays 1.2.3.5 and destination 1.2.3.4.5 also stays 1.2.3.4.5. Note that 1.2.3 and 1.2.3.* are two different addresses. The first only matches the exact address 1.2.3 while the second is a wildcard that matches everything that starts with 1.2.3 and has at least 4 address fields. This includes 1.2.3.4 and 1.2.3.4.5.6, but not 1.2.3. If this mechanism of address summarization is used in an ExBF implementation that carries tuples containing destination, cost and head-of-path attributes in its vectors, then the destination address is summarized, as well as the head-of-path address.

FIG. 4 is a diagram of an example network with a hierarchical structure. When the example network first starts to converge using the ExBF algorithm, node B.X advertises the following routing information to neighbor A.S: DVB.X.A.S: {(B.X, *, 0), (B.Y, B.X, 1), (B.Z, B.X, 1)}. When A.S receives the vectors, it summarizes its entries. What remains is the single vector DVBX.AS: {(B.*, *, 0)}. The routing table of node A.S now contains RTAS: {(A.S, *, *, 0), (A.R, A.R, A.S, 1) (B.*, B.*, A.S, 1)}. All addresses are summarized before processing. This includes neighbor addresses. The consequence of this is that when a node has more than one connection with a foreign domain, both neighbor addresses will be summarized into the same wildcard. This leads to ambiguities and non-deterministic routing when this wildcard is listed as the preferred hop in a routing entry, as it can not identify a single outgoing link. This problem is solved by assigning a local identifier to each link and using these numbers in the preferred hop column, rather than the addresses of the links peers. The routing update A.S sends to A.S.Y contains DVAs,A.S.Y: {(A.S, *, 0), (A.R, A.S, 1), (B.*, A.S, 1)} and leads to A.S.Y\'s routing table RTA.S.Y: {(A.S.Y, *, *, 0), (A.S.X, A.S.X, A.S.Y, 1), (A.S.Z, A.S.Z, A.S.Y, 1), (A.S, A.S, A.S.Y, 1), (A.R, A.S, A.S, 2), (B.*, A.S, A.S, 2)}. When A.S.Y has finished updating its routing table, it advertises DVA.Y.A.S: {(A.S.Y, *, 0), (A.S.X, A.S.Y, 1), (A.S.Z, A.S.Y, 1), (B.*, *, *), (A.S, *, *), (A.R, *, *)} back to neighbor A.S where the asterisk indicates unreachable in the last three records to avoid long-lived loops. These loops were detected through the normal back trace mechanism of ExBF.

In one embodiment, summarization means that only a single path for a range of remote nodes is maintained. The consequence of this is that packets will not always be routed according to the real shortest path between source and destination. To illustrate this effect in the multicast network, consider the example network of FIG. 4. Node S1.R receives the distance-vectors of both S2.A and S2.C. And after summarization learns that both neighbors offer a route for wildcard S2.*. If it is assumed that both interdomain links (S1.R-S2.A and S1.R-S2.C) have equals costs, then S1.R will choose S2.A to be the preferred hop for S2.* because of the fact that its address logically comes first. When S1.R needs to forward a data packet to S2.B, it uses the S2.* wildcard entry and forwards the packet to neighbor S2.A. Unfortunately this is not the shortest path to S2.B, as S2.A first has to route the packet through S2.C. Instead, S1.R should have sent the packet directly through neighbor S2.C. This ratio between the length of the actual path and the length of the optimal path between the endpoints are called the stretch factor.

In one embodiment, path stretching on the overlay multicast system occurs when packets are forwarded between nodes that are in different logical domains. An entire subnet is treated as a single node with several outgoing links. Because a virtual node often contains a large number of nodes, connected by its own internal network structure, it is sometimes better to choose a different interdomain link when sending packets to the domain.

Since the overlay network runs its own adaptive routing algorithms, the content streams through the network are constantly rerouted to avoid the network\'s hot spots and congestion. This can be particularly useful on wide area networks that are used for very different types of applications and hot spots are dynamic (i.e., moving around). Another advantage of having custom routing algorithms is the freedom of substituting them with others in the future.

In one embodiment, the software routers implement load-balancing inside the network. Traditionally, routing algorithms seek for “best paths” through the network and send datastreams over these. However, it is sometimes much more desirable not to just send streams over this optimal path, but to also select several sub-optimal paths and divide the stream over all of them. This also avoids the optimal path from getting congested when the stream requires more bandwidth than this single path can provide.

FIG. 5A is a diagram demonstrating summarization of an example network with an S2 domain. When node S1.P needs to forward a packet for destination S2.F, it will send the packet directly down its interdomain link to the S2 domain. In this example both interdomain links are assumed to have equal costs. FIG. 5B is a diagram demonstrating the stretch factor for the example network of FIG. 5A and shows the inner network of the S2 domain. FIG. 5B clearly shows that the stretch-factor is quite high for packets from S1.P to S2.F, as the optimal path runs through S1.R instead. Although the overlay multicast system address summarization technique can not guarantee a maximum upper bound on the stretch-factor, it can manipulate the stretch-factor by changing the summarization opacity.

By default, an address is summarized after the first field that differs from the local address. However, if that is changed to the second field, the overlay multicast system can look inside other domains for one level. A node with an address 1.2.3.4 will then summarize 2.3.4.5 into 2.3.*, rather than 2.* and 1.3.4.5 into 1.3.4.*, rather than 1.3.*. Doing this at least at the border nodes in the overlay multicast network that have the interdomain links reduces the stretch-factor under certain circumstances. As the overlay multicast network was designed to be administrated by independent parties, administrators are free to experiment with different summarization opacity levels without jeopardizing the other subnets or domains.

One advantage of the summarization of addresses in routing tables is that the number of links that are present between nodes from different logical domains is irrelevant with respect to the number of routing entries in the routing table of a distance-vector protocol or similar protocols. Since these tables only contain destinations with a single forwarding strategy, the number of interdomain links does affect the size of the routing tables. If y is identified to be the number of entries in a node\'s routing table, x to be the total number of nodes in the entire network, n to be the number of entities (nodes or nested domains) inside a domain and m to be the depth of the hierarchy. From this it follows that the total number of nodes in the network can be calculated with:

∀x=nm where nε|N̂n>1 and mε=|N̂m>0

The relation between domain depth, number of nodes and domain density can be expressed by the following formula for a network with a uniform topology:

y = ( ln  ( x ) ln  ( n ) ) * ( n - 1 ) + 1

Given this formula of the address summarization\'s effectiveness, it can be shown that routing entries in any network node remains under 100 even when the network as a whole grows to well over 10 million nodes. It can also be shown that the domain size n has no spectacular effect on the scalability. While a small value n yields a relatively deeply nested network hierarchy, which implies more routing entries, it also means that each hierarchy level only contains a small number of entities. A large value n in a network of the same size yields a relatively flat hierarchy with few levels, each level contains a large number of entities. As a routing table can only contain a natural number of entries, its number remains constant as nodes are added to domains that are already captured by the existing wildcard entries.

An advantage of using a string based node addresses scheme is its large address space and flexible naming. New subdomains may be added at any time as there is no need for predefined or fixed hierarchical levels. All nodes inside a flat domain can be reconfigured into nested subdomains without the need to reconfigure routers or compute netmasks. Also, as address fields are not bound to a limited number of characters, they contain location or company names to make administration of the network easier and more transparent. An example might be the address n1.amsterdam.level3.marketxs.node4, or simply n1.ams.13.mxs.n3 to keep the address relatively short. The drawback of these kinds of unrestricted addresses is the amount of bytes they require. Because every data packet always contains a source and a destination address, the amount of overhead per packet is excessive at times.

In one embodiment, the string based node addresses scheme is implemented by letting the first bytes of each address header specify the length of the address in bytes, followed by the address itself as a normal 8 bit ASCII string with a maximum of, for example, 256 bytes. In one alternative, null-terminated strings are used.

In another embodiment, a sufficiently large address space is used and divided it into logical subsets, similar to IPv4 and IPv6 addresses. In one embodiment, a 64 bit address space is used and divided into 8 bit fields (identical to IPv6) used to identify logical subnets. In this embodiment, addresses might look like 120-23-61-201-43-146-128-132. To make distinct addresses, IPv4 uses the dot “.” for separation, IPv6 uses the colon “:” and the present system uses a hyphen “-”. When each address field represents exactly one logical subnet level, this scheme provides a simple, but at the same time somewhat limited, way of addressing. It means that the complete network can address as many individual hosts as IPv6, while grouping them in at most 8 levels of sub domains, where each sub domain can contain a maximum of 256 hosts or sub domains. The sub domains can be utilized to indicate a network hierarchy in the topology, including a geographical or similarly based hierarchy.

In one embodiment, to reduce the amount of bandwidth wasted on this overhead, a mechanism is used for substituting the address strings for small 16 bit values or similar small sized value when transmitting a packet to a neighbor. Also, a translation table is kept at both sides of the connection that is used to put the original address string back in the packet before passing it to the router\'s packet switching kernel. This optimization works on a per-link basis and may be totally independent from the communication with other neighbors. When it is used on more than one interface, each interface maintains its own translation table.

If a packet is transmitted to a neighbor, all address strings are stripped from the packet header. For each address string a 16 bit or similar value is generated and stored in the translation table. These 16 bits or similar values replace the original addresses and before the packet is transmitted, its protocol version field in the first header of its wire-level representation is changed to a higher version. This version number is reserved for optimized packets that would be unreadable for routers without support for address substitution. To ensure proper translation by the peer, the interface first sends a special packet that contains the new address substitutions. Such a packet is tagged with the higher protocol version number and contains one or more tuples of address strings and their substitute values. In one embodiment, such a packet takes the form:

Type 0x1 is used for issuing new substitutions and type 0x2 is used to invalidate an earlier or unknown substitution. The system attempts to ensure that this (or similar) packet is received by the peer prior to the data packet itself, which is something that cannot be guaranteed by every transport layer. TCP or similar guaranteed communication method may be used for communication between neighbor nodes in the overlay multicast network.

In one embodiment, configuring an interface for address substitution can be done manually, or automatically. In the latter case the interface uses a special handshake packet that is part of the higher protocol version. In one embodiment, the handshake packet takes the form:

The version field has the higher protocol version. The type for the handshake packet is 0x0 and the length is 4. A peer transmits such a (or similar) packet when the connection is first established. When the packet is echoed back by the peer, the interface knows that the neighbor also supports substitution and goes into substitution mode. After a connection with a neighbor has been lost and is re-established, the handshake is performed again as the peer could have been replaced by software that does not support substitution.

In one embodiment, entries in the translation table all have a time-out or similar tracking mechanism to ensure timelines and accuracy of entries. After this period the entry is removed from the table and any packet still using the substituted value will not be processed and cause the receiving node to respond with a special packet containing the substituted values that weren\'t recognized, allowing the peer to synchronize translation tables by sending a packet containing the substitution tuples. Since the invalidation of timed-out entries is automatic, both peers use the same time-out or similar values in tracking mechanisms. In one embodiment, when the translation table is full, entries are removed according to a last recently used scheme. The address substitution is implemented in a substitution interceptor that is the lowest interceptor or a part of the lowest interceptor or other interceptor reconstructing packets immediately on arrival in an interface module of a router. Substitution introduces additional processing overhead and enabling it is optional for each node.

In one embodiment, the overlay multicast system may be used solely for routing. The overlay multicast system offers multicast support over existing IP networks. The overlay multicast system goes beyond offering plain multicast functionality and additionally focuses on supporting live multicast streams with limited jitter and delay through the combination of overlay multicast, multicast congestion control and layered multicast.

In the overlay multicast system the ExBF protocol or similar routing algorithm implicitly contains information on which links in the network are child links. For example, this is a by-product of the long-lived loop avoidance mechanism in ExBF. In ExBF, a node will inform its neighbors about the fastest route it offers to a destination S, except the neighbor that is in the path towards S. Instead, an infinite distance is advertised to this neighbor. This information is used by a router to conclude that each neighbor advertising that node S is unreachable, is actually using it as their parent node towards S and hence will immediately mark those links as child links in the shortest path tree rooted at S.

Neighbors that do advertise reachable routes to destination S are not flagged as child links and therefore not used when forwarding multipoint packets from S. In another embodiment, this implicit information is available through any distance-vector protocol that uses poisoned reverse or similar techniques. Both methods are used to quickly deliver a copy of any multipoint packet to all nodes in the network. A disadvantage of both mechanisms however is that a copy of every packet is delivered to every node, regardless of whether or not that node is actually interested in multipoint packets from that source.

In one embodiment, the overlay multicast system is equipped with a sparse forwarding mechanism. The overlay multicast system network is used for a single source multicast for high volume market data, video streams or similar data to be sent to large numbers receivers. In these types of uses it is not necessary that receivers themselves or other nodes be capable of publishing data packets to the multicast group. Instead, for security reasons it is better to know that only the real source can publish to the group. Given these requirements, together with the fact that multi-source protocols are more complex than their single-source alternatives, in one embodiment, the overlay multicast system preferably uses a custom single-source protocol for multicast distribution on the network. In another embodiment, the overlay multicast system supports multiple sources. For sake of convenience an embodiment of a single source multicast network is discussed. One of ordinary skill in the art would understand that the principles of operation can also be applied to a multi-source application as well.

FIG. 6 is a diagram of an example embodiment of a multicast distribution or “sink” tree. In the example embodiment, the multicast system starts with a source q that is active, but in this example, a sparse-mode protocol, with no receiver yet subscribed, no distribution tree exists. When the first receiver application p0 subscribes, its router initiates the reverse computation of one branch of the sink tree rooted at source q. It does so by first marking that p0 has a local subscriber for q:G by setting the bit in Lsp0[q,S] (LSp[ ] is a local array at node p that keeps track of all subscriptions of the local user applications connected to router p) to true and then sending a join (e.g., a <join (q:G)>) packet to the preferred hop towards source q. This neighbor p1 receives the <join (q:G)> packet and marks the link with p0 as a child link for multicast group q:G. It then forwards the packet to the next hop p2 in the shortest path towards q. Eventually node pn sends the <join (q:G)> packet to q. On receipt, q marks the link on which the packet was received as a child link for its local multicast group G and starts forwarding all packets addressed to G over the link to node pn. The multipoint packets addressed to G are inserted into the overlay multicast network at node q by the user application that previously bound the local multicast group q:G in publish mode. Nodes pn, pn-1, . . . , p1 all forward the packets to their neighbors from which a <join (q:G)> packet was received earlier. Node p0 has no child links for q:G, but does have Lsp0[q:G] set, so it delivers the packets to the local application. When another node p1 becomes interested in q:G while LSp1[q:G] is not already set, it sends a <join (q:G)> packet to its preferred hop towards q. Let pi+n be in the path from pi to q that receives the <join (q:G)> packet and suppose that pi+n is also in the path between p0 and q. In this case, pi+n is already in the established part of the sink tree of q, so it does not need to forward the <join (q:G)> packet further to q. Instead, it only marks the link on which it received the packet as a child link for q:G. In general, a node u only forwards a <join (S:G)> packet towards S if it is not already subscribed to S:G (hence, LSu[S:G] is not set and the collection of child links for S:G is empty).

The example sparse-mode distribution tree generated when processes subscribe to a S:Q group originally equals a part or all of the optimal sink tree rooted at S. However, when the underlying unicast routing algorithm detects changes in the network performance and updates some of the routing table entries, the optimal sink tree changes accordingly and may no longer be matched by the multicast distribution tree, rendering the tree as well as the multicast performance sub-optimal. An extreme example of this is when a link that is part of the multicast distribution fails entirely.

FIG. 7 is a diagram of an example overlay multicast network with a failed link. In the example a node u currently in the distribution tree for q:G only has a single route to q via w. The link between u and w fails. The node u concludes that it can no longer be part of the distribution tree (as U and q are no longer connected) and invalidates all child link subscription information. In case LSu[q:G] is set (i.e., the subscription table indicates that U is subscribed to group G from node q), u will rejoin the distribution tree as soon as a new path to q is found. In another embodiment, the algorithm sends a notification to the locally subscribed application to indicate that u is no longer connected to q.

In the example, it is assumed that all links have weight 1, while the link between p and w has weight 100. The result is that the link connecting p and w is not in the optimal sink tree rooted at q. Since only t and v have a local subscriber (LSt[q:G] and LSv[q:G] are set), the sparse distribution tree in this example equals the dotted arrows of FIG. 7.

In the example, nodes w, u, r and s all have only a single route to q. Node p has a path through both s and w, while s is the preferred hop towards q. When the link between u and w fails, u becomes disconnected from q and invalidates the subscription information for link u-r. Node u informs neighbor r of this fact as part of an immediate distance-vector exchange. This routing update implicitly tells node r that its join for q:G through u is no longer valid and that it should look for an alternative neighbor to rejoin.

In the example, an explicit leave packet to u is unnecessary. Since r is also left with no alternative path to q and has no locally subscribed applications, it invalidates its q:G subscription and is no longer part of the distribution tree. Neighbors t and s are implicitly informed about this through the ExBF routing update or similar routing update sent by r. Eventually w, u, r and s all leave the distribution tree, while nodes t and v schedule a rejoin when a new path to q is found. Again, they inform their local subscriber applications about this. When p receives the routing update from s, it looses its preferred hop for q and switches to neighbor w. The new cost to q is now 100 plus distance (w, q) and neighbor s is informed. Upon receipt, s discovers the new path to q and informs its neighbors r and v. Node v then sends the <join (q:G)> packet to S. Nodes s, p and w then construct the first part of the new, optimal distribution tree rooted at q. When the new path reaches r and t, t sends a join packet to r and r to s, reconnecting all receivers to the data stream.

In one embodiment, recovering from link failure or similar communication errors is divided in three steps. In the first step the underlying ExBF or similar routing protocol starts a wave that propagates a link failure through the network. Depending on the actual topology, this takes up to N−1 packets, where N represents the number of nodes in the network. The worst case time complexity for recovering from a link failure is 3N when ExBF or similar algorithm is used as the underlying unicast routing algorithm. A best case time complexity would be 2N.

The example discusses a scenario where the system recovers from a failure of a link that is part of the multicast distribution tree. More often however, links will not fail completely, but rather fluctuate in quality, causing the underlying routing algorithm to reassign preferred neighbors. In this example case it is not a requirement that the distribution tree is changed to reflect these changes, as it is not partitioned. However, since the tree becomes sub-optimal, it is changed.

In one embodiment, the decision whether or not to recompute a distribution tree after a routing table update, is made according to the quality of the used path. When a node p in the tree detects that the neighbor in the optimal path towards the multicast source has a cost that is only marginally lower than the neighbor that is currently p\'s parent in the actual distribution tree, the overhead of recomputation and the risk of packet loss outweighs the increased performance of the new tree. Also, since the quality of each link is continuously re-evaluated, the updated sink tree may only be temporal. Two straightforward solutions to this can be used. Either the algorithm sets a threshold on recomputation so that subscriptions are only moved to the neighbor with the currently optimal path towards the source if the decrease in cost is at least a ratio x, where x>1. A larger value of x then postpone tree adjustments until a substantial improvement can be gained, while a smaller x makes the tree actively follow the changing characteristics of the underlying network.

Another solution is to postpone recomputation of the tree for a period inversely proportional to the cost decrease ration x. The latter has the advantage that the optimal distribution tree is always guaranteed to be reached in finite time since the last routing table change.

In one embodiment, to run the tree building protocol, two new messages are introduced: the join message and the leave message. Both messages use a GroupData control message of the following format:

This message is used to send a list of subscriptions to a neighbor node or to cancel an aggregated list of subscriptions with a neighbor. The action field (9th byte) is used to distinguish between join and leave. The 10th and 11th byte indicate the number of (S, G) groups in the message. The action field is either “join” (0) or “leave” (1). In an alternative embodiment, this is extended to “stale”, “dead”, etc, to indicate the state of the publishing application.

Many multicast applications require some level of reliable communication. Examples of this are uploading files to several receivers simultaneously or replicating web server caches. Without guaranteed delivery, a multicast transport service has limited applicability. In one embodiment, the overlay multicast network utilizes standard delivery control based on moderating the sending of data based on the lowest bandwidth available on the route to a destination. However, this delivery system is unsuitable for certain applications.

One example is financial data distribution, it needs a transport service that can multicast live data without packet loss or corruption and without substantial end-to-end delay when parts of the network are temporarily or permanently congested. While live market data cannot tolerate random packet loss, it can be thinned. Under certain circumstances it is acceptable to omit certain stock quote updates. An example is a desktop application that displays real-time updates for hundreds or thousands of financial instruments. If all quote updates were delivered to this application, this would require a substantial amount of bandwidth and would cause the value of the more volatile instruments to change faster than a human can read. The data may be thinned through a process that involves inspection of the individual stock quotes and encoding them in individual data packets, tagging each with a priority value. For this application, as long as all packets labeled with the highest priority number are received, the partial stream can be considered intact. Additionally, when all updates of the second highest priority are also received, the quality of the partial stream is increased usually in the form of less latency.

In one embodiment of the overlay multicast system, priority numbers are associated with data packets. The priority numbers represent a logical layer inside a data stream. Data provided by a multicast application may be in the form of a stream of data. This data is subdivided into categories or priorities based on the nature of the data. When packets in a data stream are labeled with priorities in the range 0 to 3, the stream is said to have 4 layers. Also, the convention is to treat 0 as the highest priority and 3 as the lowest. Any other system of identifying priority levels may be utilized including alpha numeric indicators or similar identifiers. If a stream only contains a single layer, all packets are labeled with priority 0.

To software router daemons, a priority value of a packet is relative to the stream and becomes relevant when a decision to discard data at a congested router is made. The priority value has no meaning other than as a criterion for discarding packets on congested links. When an outgoing link of a router has insufficient bandwidth to transmit all pending packets, it forwards only those packets with a designated priority or the highest priority. This technique is also applied to any data type, including audio/video data and financial data. Using this system, the packet priority numbers cannot be misused by sources to give their data packets a greater chance of prioritized transmission by giving them the highest priority to gain advantage over other sources. Packet priorities are only compared between packets that are part of the same data stream.

In one embodiment, knowing that routers will use the packet priority numbers when making forwarding selections on congested parts of the network, a source carefully divides its data packets over different layers or priorities, in a way that a subset of the layers still contain a useful, uncorrupted representation of the data. This system is especially useful when multicasting a live data stream with a high data rate to a large number of receivers, scattered over a heterogeneous wide area network. Receivers that are on congested parts of the network will then receive the highest priority parts of the data only. This eliminates the need to lower the publishing rate to match the slowest receiver, while still being able to offer live, uncorrupted, thinned data streams to clients suffering from insufficient bandwidth. Example applications of the system include audio and video codecs that divide live multimedia content over layers to enhance the user experience over wide area networks.

In one embodiment, incoming messages are sorted by unicast or multicast sender address. As described above, a sender address is the combination of source router address and application session. An example of a source address is n1.mxs.office.erik:video.an2 which could be used by a user application broadcasting a video channel. In this case, only the source address is relevant, not the (uni- or multicast) destination address. Each incoming message is added to the queue that holds messages sent by that particular sender. If this is the first message from a sender, a new queue is automatically created to store it.

FIG. 8A is a flowchart of one embodiment of a process for managing congestion in the overlay multicast system. The congestion can be managed at each individual router. In one embodiment, the congestion is managed at the interface module level in each router. Each interface module has an inbound pipeline and outbound pipeline discussed above for processing inbound and outbound data. Each pipeline buffers data that is awaiting further processing. However, if either pipeline is unable to keep up with the pace of incoming data that needs to be processed some data must dropped.

In one embodiment, data is received as a set of data streams at each router (block 851). The data is then buffered in the inbound pipeline buffer (block 853). The same process applies to outbound data that is received from the kernel by the outbound pipeline. This data is stored in the outbound buffer. After the data has been stored a check is made of the inbound or outbound buffer to determine if it is full (block 855). In one embodiment, data is stored in the data structures in the buffers that organize the data packets into a set of queues. Each source address, data stream or layer in a data stream has a separate queue. A queue is sorted with highest priority and oldest data packets at the front of the queue. If the buffer is full then a decision is made to drop a designated amount of data in the form of packets from the buffer to make room for incoming data packets (block 857).

In one embodiment, a queue is chosen randomly or in a round robin to have data dropped. In another embodiment, the queue with the most data is chosen to have data dropped. In a further embodiment, a weighting factor is calculated to determine which queue is selected to have data dropped. The weighting factor is based on the amount of data in a queue, size of packets in a queue and similar factors. The weighting factor counteracts unfair distribution that is caused by selecting a queue by a round robin, random or similar method of selection. Data streams with large packets are unfairly affected by other methods because a disproportionate amount of data is dropped in comparison with other queues with smaller packets. Queue selection is also influence by the size and the amount of data that is intended to be dropped. A large queue that can drop close to the amount of data desired is weighted for selection.

In one embodiment, data is dropped if a total amount of data stored in all queues exceeds a threshold value. This threshold value is set by an administrator or is a set value. These systems enforce the fair allotment of bandwidth between data streams. Also, this system implements the prioritization of logical layers by dropping lower priority level layers when congestion occurs. If the buffer is not full then the pipelines may continue to store data in the buffers (block 851).

In one embodiment, while the queues grow in size as packets are received, a background thread constantly dequeues packets from the queues and transmits them over the network.

In one embodiment, the system that manages in- and output for the queues is divided in two parts. The first part is run by the background thread that constantly dequeues packets from the queues and transmits them over the network, while the second part is in charge of queuing new packets that are to be transmitted. The latter also implements the logic that defines when and which packet should be discarded (due to a buffer overflow).

The dequeuing part of the system, when selecting a packet for transmission, only looks at the first packet (most urgent) of each queue. To make sure each stream gets an equal share of bandwidth, it takes the individual packet sizes into account. When all packets (note that only the first packet of each queue is observed) have equal size, the dequeuing thread simply selects a random queue with equal probability and dequeues one of its packets. Since each queue has the same probability of being selected, each queue will deliver an equal amount of packets per time unit. Hence, each source will transmit an equal amount of bytes per second.

Since individual packets can have any size between 1 and some determined maximum number of bytes, their size must be considered when the transmit thread selects queues. Queues with many large packets should generally have a smaller probability of getting selected than queues with lots of small packets in order to keep the bandwidth division fair.

In one embodiment, only the first packet of each queue is considered to represent its queue and give it a probability of setting selected that is reversely proportional to its size in bytes. In an example where there are three queues, P, Q and R where the first packet in P is 100 bytes, the first packet from sender Q is 300 bytes, while the packet from R is 500 bytes, to calculate the selection probabilities, first their respective selection weights are defined. For example, the weight of the packet in P is computed by dividing the total size of P, Q and R by P\'s size: (100+300+500)/100=9. Q\'s weight is 3 and R\'s weight is 9/5. The weights are converted into selection probabilities by dividing them by the sum of all weights. This gives P a probability of 9/(9+3+1.8)=0.65, Q a probability of 0.22 and R of 0.13. When a queue becomes empty after dequeuing, it is removed.

In one embodiment, to keep the delay introduced by buffering packets under control, the second part of the system enforces a maximum total queue size. The sum of all packets in all queues may never exceed this threshold. When a new packet comes in, it is always accepted at first. If necessary a new queue is created for it, or it is added to its designated, existing queue. After adding a new packet, the total size of all queues is checked. If it is larger than the configured maximum, the algorithm runs a removal round in which it first selects a queue and then tells that queue to remove one packet. If the queues are still to big after removal of one packet, the process of selecting a queue and removing a packet is repeated iteratively until the total queue size is smaller than or equal to the configured maximum.

In one embodiment, shrinking the queues is always done after a packet was added, never preemptively. The reason for this is to let the new packet immediately participate in the removal selection process, rather than discarding it or making room for it at the expense of the other queues. Contrary to de-queuing packets, where the algorithm tries to select the “most urgent” packet, we now need to select the “least urgent” one. This packet is determined by looking at the size of the individual queues (the largest queue should generally be shrunk to keep resource division among the streams fair) and which queue can match the required size most accurately. This comes from the fact that individual packets can differ greatly in size, so removing the “least urgent” packet from queue P could result in freeing 8 kilobytes, while removing the “least urgent” packet from queue Q yields 40 bytes of space for example. Now if the queues in total exceed the maximum size by only a couple of bytes, it seems logical to remove the small packet from Q. This policy comes with a consequence, namely that it implicitly promotes the use of larger packets. After all, streams with lots of very small packets are more likely to accurately match the amount of buffer space that must be freed than stream queues with few very large packets. This property may encourage developers to use larger packets, increasing the efficiency and throughput of the network. In short, when selecting a queue for shrinking, the system favors queues that are will expunge the least amount of bytes in order to match the total buffer capacity and are large compared to the others.

In one embodiment, Before the selection can be made, the absolute weight factor for each queue is computed. The weight factor combines the queue\'s size and ability to accurately match the overcapacity of the buffer by removing packets. The latter is exposed in v. It represents the sum of the sizes of the packets that a queue must remove in order to eliminate the buffer\'s overcapacity. For example, if a queue P has 10 packets of 100 bytes each while the buffer currently has an overcapacity of 170 bytes (suppose the maximum is 10000 bytes, while all queues combined add up to 10170 bytes), P would need to remove at least 2 packets (2*100 bytes). In this case, vp is 200. The formula to compute the absolute weight factor of queue m is defined:

w n = ( ∑ i = 0 n  v i ) · s m v m

Download full PDF for full patent description/claims.




You can also Monitor Keywords and Search for tracking patents relating to this Layered multicast and fair bandwidth allocation and packet prioritization patent application.

Patent Applications in related categories:

20130121336 - Data plane independent assert election - Avoiding duplicative forwarding of multicast traffic is disclosed. A join message received at a first router is directed to a peer router of the first router from a first downstream node of the first router and the peer router without passing through the peer router. The join message indicates that ...

20130121335 - Dynamic multicast mode selection in a communication network - In one embodiment, a network device selectively operates according to a sparse multicast mode where the network device stores individual devices interested in one or more multicast groups and distributes corresponding multicast group traffic based on the individual devices. Alternatively, the network device selectively operates according to a dense multicast ...


###
monitor keywords

Other recent patent applications listed under the agent :



Keyword Monitor How KEYWORD MONITOR works... a FREE service from FreshPatents
1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored.
3. Each week you receive an email with patent applications related to your keywords.  
Start now! - Receive info on patent apps like Layered multicast and fair bandwidth allocation and packet prioritization or other areas of interest.
###


Previous Patent Application:
Communication device with a path protection function, and network system using the communication device
Next Patent Application:
Methods of broadcastng and receiving scrambled multimedia programs, a terminal and a network head end for said methods
Industry Class:
Multiplex communications

###

FreshPatents.com Support - Terms & Conditions
Thank you for viewing the Layered multicast and fair bandwidth allocation and packet prioritization patent info.
- - - AAPL - Apple, BA - Boeing, GOOG - Google, IBM, JBL - Jabil, KO - Coca Cola, MOT - Motorla

Results in 0.88648 seconds


Other interesting Freshpatents.com categories:
Novartis , Pfizer , Philips , Procter & Gamble , g2