An overlay network is a mobile social network (MSN) that is created on top of existing mobile networks and allows users to share content without relying on cellular links.
7 Case study: performance evaluation of an overlay network
In this section, a simulation to assess the performance of an overlay network for supporting synchronous e-training activities is described. In general terms, an overlay network is a virtual network deployed on top of an existing network to provide additional services. In this case, the overlay is used to provide a multicast delivery service between the participants of an e-training activity, so the multimedia data streams generated by a participant are delivered efficiently to the rest of the participants. This simulation is a simplified version of that carried out in [37].
7.1 Overlay architecture
The overlay network is organized in three planes. A simplified illustration of the overlay is depicted in Figure 7.16. These planes are managed by a central entity called the Rendezvous Point (RP).
Figure 7.16. Architecture of the overlay network.
First, the relay mesh is composed of several relays that are responsible for forwarding data to participants. Each of these relays is located in a multicast island where IP multicast is available and they are all interconnected through the Internet forming a full-mesh topology to reduce latency. Multimedia data streams are delivered using RTP, so the relays must forward RTP streams between participants. Various media types are used in e-training sessions, with audio, video, instant messaging and shared whiteboard annotations among them [38], and an RTP session is used for each media type.
The participants use IP multicast to send their data streams to the relay located in the same multicast island, which forwards them to the rest of the relays in the mesh. The streams are subsequently forwarded to the participants in the multicast islands of the relays. The efficiency of the relay mesh relies heavily on the availability of native multicast in the underlying network and the clustering of participants (number of multicast islands).
Second, a signaling plane is established between the participants of an e-training activity and the RP. This link allows participants to join the activity and facilitates the negotiation of the multimedia configuration (media types, encoding parameters, IP addresses and ports, etc.). The Session Initiation Protocol (SIP) is used as the signaling protocol. The RP plays the role of an SIP focus to implement a tightly coupled conferencing service, so every participant establishes a SIP dialog with the RP while the activity is running. The configuration of the multimedia sessions is described using the Session Description Protocol (SDP).
Finally, a mesh control plane is used to communicate the RP with the relays. The RP uses these links to reorganize the relay mesh according to the joining and leaving of participants and failures in the overlay in order to keep the real-time data delivery service efficient. Relays register with the RP as soon as possible, so the RP can establish TCP connections with them during activities. These connections are used as a keep-alive mechanism to regularly gather link status information from the relays. It is not necessary that all the relays participate in the ongoing activity. In fact, a relay will not be part of the relay mesh if there is no participant in its multicast island.
The overlay network includes a self-optimization technique that ensures a minimum number of streams interchanged among multicast islands. As previously mentioned, the RP maintains a list of registered relays, associating them with the identifier of the multicast island in which they are located. A relay is active if at least one participant joins the activity from the multicast island where the relay is located. The RP is responsible for including and excluding relays from the mesh as participants join and leave, so the relay mesh is only composed of active relays.
7.2 Simulation
One of the main reasons for using simulation to assess the performance of the overlay is the impossibility of requesting a high number of users to participate in synchronous e-training activities for testing. This would be extremely expensive and disruptive for users. Besides, the network resources required to evaluate the performance of the overlay thoroughly might not be available or may be too expensive.
The ns-3 simulator is used to simulate the operation of the network of an organization geographically dispersed in several sites, the entities of the overlay and the participants within a synchronous e-training activity. The ns-3 simulator includes a vast number of models of wired and wireless channels and many of the elements that can be found in modern networks. It is written in C++ and uses the combined multiple-recursive random number generator MRG32K3a proposed in [39]. The whole TCP/IP stack is implemented on top of the channel models, so programmers can reuse the algorithms (even the code) of their implementations to build their ns-3 models. Thus, existing RTP and SIP libraries are integrated in the simulator by using the socket library of the simulator instead of the socket library of the operating system in order to simulate the overlay.
The model assumes that there exist various sites where IP multicast is available and there is an RTP relay in each site. An ideal wide area network (WAN) connects every site through 20 Mbps network links with a delay of 20 ms. The data rate of the network link connecting the RP to the WAN is 10 Mbps and introduces no delay in communications. These parameters are obtained empirically from the corporate network of a large corporation [38]. A CSMA channel is used to simulate the WAN between the sites. This channel is modeled as a simplistic Ethernet-like network in which the state of the medium is instantaneously shared among all the devices, so there is no need for collision detection. A queuing approach is used in this model. The local area network (LAN) of each site is also modelled as a CSMA channel. A point-to-point channel, modeled as a simplistic point-to-point serial line link, interconnects each LAN with the ideal WAN. Both the CSMA and point-to-point channels model both the physical and link layers of the protocol stack.
The range of possible overlay deployments is extremely high. Many parameters such as the number of relays in the mesh, the number of participants and the balance of participants between relays can be varied. However, in order to evaluate the impact of each parameter individually, only one is varied at a time during the simulations. For simplicity, only balanced deployments are considered—that is, all the relays serve the same number of participants.
A participant has to be modeled to simulate the traffic generated during a synchronous e-training activity. The participant can use several kinds of media types to communicate with other participants in an activity. In this case, the participant is represented as an entity that generates one audio stream, one video stream, annotations on the shared whiteboard and telepointer movements, resembling the data generated by a synchronous e-training tool [38].
The key decision is how to model the traffic generated by a participant. Traces reporting the behavior of users can be collected from real e-training activities, so the model of a participant can be derived. For example, a participant may join the activity after a waiting time, simulated using an exponential random variable Exp(1/45) s. Then, the participant establishes an SIP dialog with the RP to join the activity and negotiate the media configuration. The audio and video streams of a participant are activated regularly. A video stream is a simulated 160×120 variable bitrate H.264 video stream at 10 frames per second (the size of the video packets sent to the network also has to be modeled). An audio stream is a simulated constant bitrate iLBC audio stream with a packetization time of 20 ms. Both the interval between activations and the duration of the streams are assumed to be normal random variables. Thus, a participant generates an audio stream of duration N(15, 2.6) s after an elapsed time N(25, 3.3) min, and generates a video stream of duration N(30, 9.5) s after an elapsed time N(40, 8.2) min.
The formats of the annotations on the shared whiteboard and the telepointers used by participants are described in [38]. A participant generates an RTP stream containing their annotations on the shared whiteboard, while another RTP stream is used to convey the information of the telepointer. Similarly to audio and video, the activation time and the duration of these streams are assumed to be normal random variables. A participant generates an annotation stream of duration N(60, 10) s after an elapsed time N(16.6, 1.3) min, and generates a telepointer stream of duration N(15, 2.5) s after an elapsed time N(20, 1.6) min.
Real e-training activities are moderated in order to manage the interactions between participants and avoid excessive network resource consumption due to many multimedia streams. Floor control protocols can be used to share the data channels between the participants of the activity. The maximum number of data streams that can be activated simultaneously in the simulations are: 4 audio streams, 4 video streams, 2 annotation streams and 2 telepointer streams.
Moreover, two roles are usually identified in the participants of a synchronous e-training activity: one instructor and many regular participants. The instructor is modeled as a participant who uses all the media types continuously during the activity. The rest of the participants issue floor requests to use the data channels. This situation closely resembles the behavior of users in synchronous e-training activities where participants usually interrupt the activity to ask questions and the instructor can be seen and heard throughout the duration of the activity. In all cases, the requests by participants to use the data channels are granted in a first-in first-out (FIFO) order.
Various simulations have to be conducted to assess the performance of the overlay when varying the number of participants and the number of sites where they are located. The duration of a typical e-training activity is one hour.
7.3 Analysis of results
Different performance metrics can be used to analyze the overall performance of the overlay. The workload at the relays when forwarding streams can be estimated using the average bandwidth consumption. The user experience when participating in synchronous e-training activities supported by the platform can be estimated using the average packet loss and the interarrival jitter of the audio and video packets. As an example, Figure 7.17 shows the cumulative average network bandwidth consumed in the links of the sites to the Internet when varying the number of participants in the activity and the number of relays in the mesh. This proves that the network bandwidth consumed in the links of the sites grows asymptotically due to the floor control policy limiting the maximum number of active streams in the overlay.
Figure 7.17. Cumulative average network bandwidth consumption in the network links of the sites.
Since stochastic processes modeling different entities of the overlay are involved in the simulations, confidence intervals are very useful to describe the accuracy of the results. As can be seen in Figure 7.18, these intervals are quite small due to the huge number of replicas in the tests.
Figure 7.18. Average network bandwidth consumption in the network links of the sites for each media type.
An overlay network is a computer network that is built on top of another network. An example is shown in Fig. 2.16. Nodes in the overlay network can be thought of as being connected by virtual or logical links, each of which corresponds to a path, perhaps through many physical links, in the underlying network. For example, distributed systems such as peer-to-peer networks are overlay networks because their nodes run on top of the Internet. The Internet was originally built as an overlay upon the telephone network, while today (through the advent of Voice over IP (VoIP)), the telephone network is increasingly turning into an overlay network built on top of the Internet.
Figure 2.16. An illustrative example of overlay networks.
Overlay networks builds the foundation of virtual networks and they run as independent virtual networks on top of a physical network infrastructure. These virtual network overlays allow resource providers, such as cloud providers, to provision and orchestrate networks alongside other virtual resources. They also offer a new path to converged networks and programmability.
A virtual network is a computer network that consists, at least in part, of virtual network links. A virtual network link is a link that does not consist of a physical (wired or wireless) connection between two computing devices but is implemented using methods of NV.
The two most common forms of NV are protocol-based virtual networks, such as Virtual LANs (VLANs) [7], Virtual Networks (VNs) and Virtual Private Networks (VPNs) [51], Virtual Private LAN Services (VPLS) [173], and virtual networks that are based on virtual devices (such as the networks connecting VMs inside a hypervisor). In practice, both forms can be used in conjunction.
VLANs are logical local area networks (LANs) based on physical LANs as shown in Fig. 2.16. A VLAN can be created by partitioning a physical LAN into multiple logical LANs using a VLAN ID. Alternatively, several physical LANs can function as a single logical LAN. The partitioned network can be on a single router, or multiple VLANs can be on multiple routers just as multiple physical LANs would be.
A VPN consists of multiple remote end-points (typically routers, VPN gateways of software clients) joined by some sort of tunnel over another network, usually a third-party network. Two such end points constitute a “Point to Point Virtual Private Network” (or a PTP VPN). Connecting more than two end points by putting in place a mesh of tunnels creates a “Multipoint VPN.”
A VPLS is a specific type of Multipoint VPN. VPLS are divided into Transparent LAN Services (TLS) and Ethernet Virtual Connection Services. A TLS sends what it receives, so it provides geographic separation, but not VLAN subnetting. An Ethernet Virtual Connections (EVCS) adds a VLAN ID, so it provides geographic separation and VLAN subnetting.
A common example of a virtual network that is based on virtual devices is the network inside a hypervisor where traffic between virtual servers are routed using virtual switches (vSwitches) along with virtual routers and virtual firewalls for network segmentation and data isolation. Such networks can use nonvirtual protocols such as Ethernet as well as virtualization protocols such as the VLAN protocol IEEE 802.1Q [146].
MSNs are overlay networks built on top of existing mobile networks. As illustrated in Fig 2, mobile nodes in the underlying mobile network are connected by physical wireless links, while nodes in the MSN overlay networks are connected by social relationships, which can be regarded as a virtual link consisting of several physical links in the underlying mobile network. Each node in the MSN maintains its social relations with other nodes, and the information exchange between them can be fulfilled through opportunistic networking schemes when the Internet is unavailable.
Some architectures use Blockchain as an overlay network (Dorri, Kanhere, Jurdak, & Gauravaram, 2019). In these scenarios, healthcare data are stored on cloud storage infrastructure, and cloud servers compute the hash value of stored data and send it to an overlay network. The overlay network is implemented as a peer-to-peer network based on decentralized architecture. This model is shown in Fig. 10.5, consists of four parts: Patient equipped with healthcare sensors, Smart contracts, Cloud storage, and Overlay network powered by Blockchain technology.
Figure 10.5. Using Blockchain as an overlay network to provide privacy-preserving.
The function of each part is as follows:
1.
Patient: the patient has connected to essential sensors. These sensors collect all health data such as heartbeats, sleeping conditions, blood pressure, blood glucose, etc.
2.
Smart contracts: Smart contracts allow defining conditions for data gathering, such as the highest and lowest patient vital signs. For example, the highest and lowest acceptable blood pressure is defined in the smart contract. Once sensed and gathered data from wearable sensors exceeds these boundaries, the smart contract will send these abnormal data to the cloud. Smart contracts run on smartphones, PDAs, or desktops that are connected to sensors. These devices send data on a secure channel to cloud storage.
3.
Cloud Storage: healthcare data need massive storage capacity and reliable saving media, so cloud storage is used to manage them.
4.
Overlay network: overlay network is a peer-to-peer network that is based on a distributed architecture consisting of computers, smartphones, tablets, or any other devices.
When health information is gathered from wearable sensors and analyzed as abnormal data by smart contracts, it will be formatted as healthcare records (EMRs). These records are sent to the cloud. Cloud servers calculate their hash index of them and save the hash index on the overlay network. In this way, tampering and manipulation of saved data would be impossible. Considering the case when a patient decided to share his data with someone, he will create a request, sign it, and send it to the network. The receiver’s public key defines the destination of this transaction. Overlay network verifies patient’s signs and then broadcasts patient’s request to all nodes. Each node will receive the transaction, if it belongs to him, he will process it, and then he retrieves patient data, analyses it, creates a reply, and sends it to the patient. All these transactions are saved on the Blockchain.
As mentioned before, three main security requirements need to be addressed by models: ity, Integrity, and Availability. ity ensures authorized users can access data. Integrity guarantees data has no changes between sender and receiver when it is transferred, and availability means data is available for granted to users anytime and anywhere. In this model, only authorized users can join the network. Thus, data are exposed to registered users if it is shared with them by the data owner. So, this mechanism will guarantee ity. Blockchain properties, integrity archived by Blockchain properties, and hash index of all data are saved on Blockchain. If data changes, it should be approved and accepted by PoW on Blockchain which guarantees the integration of data. Using cloud infrastructure and distributed ledger mechanism on Blockchain guarantees availability on this model. As a drawback of this depicted proposal, the mining algorithm can become a bottleneck for the system. Some proposal divides mining node into separate clusters and implements an election algorithm to elect cluster head. By this, they reduce network traffic and increase the throughput of the model.
We use JSON structure to save patient properties and diagnosed data. Most of the researchers have used XML format to manage gathered and diagnosed data. Our proposed data structure has shown in Fig. 10.6.
In recent years, overlay networks and overlay routing have received considerable attention. From our discussion so far on multilayer routing, you can see that the notion of overlay has been around for quite some time. For instance, consider the telephone network over the transport network, or Internet over the transport network; we can say that any such “service” network is also an overlay network over the telecommunication transport network. Understanding the interaction of such overlay networks over the telecommunications transport network has been studied for quite some time. One of the key issues to understand is how a failure in the underlying transport network, for example, due to a fiber cut, can impact rerouting in the service network [224], [295], [320], [559], [564], [565], [566], [590], [844], [886]. Any such routing decision also needs to consider shared risk link groups, both in terms of reaction after a failure and also to do preplanning during route provisioning through diversity or capacity expansion. For instance, consider Figure 24.10 in which IP links R1–R2 and R1–R4 would likely be routed on SONET routes S1–S5–S2 and S1–S5–S4, respectively; here, link S1–S5 falls into the shared risk link group category since the failure of this link will affect multiple IP network links; in fact, it would isolate router R1. Thus, to protect against such situations, the SONET network should provide diversity by adding, say link S1–S4.
The overlay concept is, however, not limited to just two layers. Consider the three-layer network architecture such as IP over MPLS over WDM. In this case, the MPLS network is an overlay over the WDM network while it is, in turn, an underlay to the IP network; in other words, the IP network is an overlay over the MPLS network. It is important to recognize that each such network can employ routing within its own context; typically, however, the time granularity of routing decision in each such network could be on different time scales. Regardless, when a failure occurs, each such network might decide to react based on its own knowledge, which could lead to instability in the overall infrastructure. As of now, there is very little protocol-level coordination between networks in different layers to deploy an orchestrated recovery for an overall benefit.
The recent use of overlay networking is in the context of a virtualized network on top of the Internet. In this case, nodes can be set up that act as overlay network routing nodes, where a logical path is set up between any two such nodes over the Internet, for example, using a TCP session. In this case, overlay network nodes are computers that need not be directly connected to a router in the Internet. Since a TCP session is setup between two computers to form a virtual link for the overlay network, any traffic can be transported over this virtual TCP-based link. Not only that the computers can be used to run their own routing protocol on this overlay network topology.
You may wonder what is the advantage of creating such overlay networks. An important point to note is that routing in the core of the Internet does not change much in a short term duration. In other words, we may assume the routing path to be static in a specific time window. A problem with this is that if there is a failure, it may take a while to recover from the failure. In other words, the network may not be responsive as quickly as you would like it to. On the other hand, using an overlay network, frequent probes can be generated to learn about a failure as soon as possible, and thus, allowing for rerouting using the overlay nodes. Consider Figure 24.11. As we can see, the core Internet is shown with a number of routers. Then, we list four computers serving as overlay network nodes at the edge of the network, marked as O1, O2, O3, and O4. Any two of them are connected through a TCP session forming a virtual tunnel/link. We can see that the logical link between O3 and O4 uses the link between routers R2 and R3. If this physical link fails, O3 and O4 would immediately recognize the failure from the probing module. Then, O4 may route its traffic to O2 to deliver to O3. Thus, an overlay network may be formed to provide resilience [30].
Figure 24.11. Overlay Network over Internet.
From the perspective of the overlay network, an estimate on logical link bandwidth may need to be assessed frequently, so that the information is as accurate as possible in the absence of specifics about the underlying topology; this would then be useful for the benefit of services that use the overlay network [894]. Similarly, the delay estimate might be necessary to know for some applications that use the overlay network. Thus, for estimating bandwidth, delay or whether the connectivity is up, probe packets can be periodically sent. To even out unusual fluctuations, it might be useful to smooth the available bandwidth or the delay estimate using the exponentially weighted moving average method (see Appendix B.6). Such smoothed estimates can be periodically communicated between the overlay network nodes using a customized link state protocol so all nodes have a reasonably accurate view. In turn, based on the information obtained by overlay network nodes, a routing decision for services that use the overlay network can be made. This would depend on the scope of the service though. If, for example, a service requires bandwidth guarantee, then a QoS routing based approach can be employed (refer to Chapter 21), which may involve alternate routing through overlay network nodes; in this case, a performance measure such as the bandwidth denial ratio would be important to consider. If, however, services that use such an overlay network require only a soft guarantee, then performance measures, other than bandwidth denial ratio, such as throughput, would be necessary to consider [894]. In addition, understanding the interaction between overlay and underlay in terms of routing and the impact on performance is an important problem to consider [503], [739].
It may be noted that the customized link state protocol for the overlay network does not need to use the flooding mechanism that we see used with link state routing protocols such as OSPF and IS–IS. The reason is that the overlay network may be set up as a fully-mesh topology; therefore, every overlay node is logically connected to every other node. Thus, a simple approach can be used to communicate what a node has learned from another node through probing to let a third node know about the status of a virtual link.
An important problem is to understand the interactions between the overlay network and the underlaying network. In particular, each may have a different goal. Thus, such an interaction can be examined using a game theoretic approach due to different goals. For example, the overlay network may wish to reduce traffic delay, while the objective of the underlay network is to minimize network cost. Due to such differing goals, the interaction problem can be formulated as a two-person, non-cooperative, non-zero sum game; see [502] for details.
At a conceptual level P2P/Overlay networksresemble the overlay networks presented in detail throughout this book. Just as the data center virtual networks are overlaid on a physical infrastructure the details of which are masked from the virtual network, so also is the P2P/Overlay network overlaid over the public Internet without concern or knowledge of the underlying network topology. Such networks are comprised of a usually ad hoc collection of host computers in diverse locations owned and operated by separate entities, with each host connected to the Internet in either permanent or temporary fashion. The peer-to-peer (and hence the name P2P) connections between these hosts are usually TCP connections. Thus all of the hosts in the network are directly connected. Napster is the earliest well-known example of a PTP/Overlay network.
We introduce P2P/Overlay networks here primarily to distinguish them from the overlay networks we describe in SDN via Hypervisor-based Overlays. Although the nature of the overlay itself is different, it is interesting to consider where there might be some overlap between the two technologies. Just as scaling SDN will ultimately require coordination of controllers across controlled environments, there is a need for coordination between P2P/Overlay devices. SDN helps move up the abstraction of network control, but there will never be a single controller for the entire universe, and thus there will still need to be coordination between controllers and controlled environments. P2P/Overlay peers also must coordinate among each other, but they do so in a topology independent way by creating an overlay network. A big distinction is that Open SDN can also control the underlay network. The only real parallel between these two technologies is that at some scaling point, they must coordinate and control in a distributed fashion. The layers at which this is applied are totally different, however.
The primary attribute of SDN via overlays as it addresses adds, moves, and changes is that the technology revolves around virtualization. It does not deal with the physical infrastructure at all. The networking devices that it manipulates are most often the virtual switches that run in the hypervisors. Furthermore, the network changes required to accomplish the task are simple and confined to the construction and deletion of virtual networks, which are carried within tunnels that are created expressly for that purpose. These virtual networks are easily manipulated via software.
Consequently, the task of making adds, moves, deletes, and changes in an overlay network is quite straightforward and is easily automated. Because the task is isolated and constrained to tunnels, problems of complexity are less prevalent compared to what would be the case if the changes needed to be applied and replicated on all the physical devices in the network. Thus, many would argue that overlays are the simplest way to provide the automation and agility required to support frequent adds, moves, deletes, and changes.
A downside of this agility is that since the virtual networks are not tightly coupled with the physical network, it is easy to make these adds, moves, and changes without being certain that the underlying physical network has the capacity to handle them. The obvious solution is to over-engineer the physical network with great excess capacity, but this is not an efficient solution to the problem.
From its inception, the Internet has adopted a clean model, in which the routers inside the network are responsible for forwarding packets from source to destination, and application programs run on the hosts connected to the edges of the network. The client/server paradigm illustrated by the applications discussed in the first two sections of this chapter certainly adhere to this model.
In the last few years, however, the distinction between packet forwarding and application processing has become less clear. New applications are being distributed across the Internet, and in many cases these applications make their own forwarding decisions. These new hybrid applications can sometimes be implemented by extending traditional routers and switches to support a modest amount of application-specific processing. For example, so-called level-7 switches sit in front of server clusters and forward HTTP requests to a specific server based on the requested URL. However, overlay networks are quickly emerging as the mechanism of choice for introducing new functionality into the Internet.
You can think of an overlay as a logical network implemented on top of some underlying network. By this definition, the Internet started out as an overlay network on top of the links provided by the old telephone network. Figure 9.19 depicts an overlay implemented on top of an underlying network. Each node in the overlay also exists in the underlying network; it processes and forwards packets in an application-specific way. The links that connect the overlay nodes are implemented as tunnels through the underlying network. Multiple overlay networks can exist on top of the same underlying network—each implementing its own application-specific behavior—and overlays can be nested, one on top of another. For example, all of the example overlay networks discussed in this section treat today's Internet as the underlying network.
Figure 9.19. Overlay network layered on top of a physical network.
We have already seen examples of tunneling, for example, to implement virtual private networks (VPNs). As a brief refresher, the nodes on either end of a tunnel treat the multi-hop path between them as a single logical link, the nodes that are tunneled through forward packets based on the outer header, never aware that the end nodes have attached an inner header. Figure 9.20 shows three overlay nodes (A, B, and C) connected by a pair of tunnels. In this example, overlay node B might make a forwarding decision for packets from A to C based on the inner header (IHdr), and then attach an outer header (OHdr) that identifies C as the destination in the underlying network. Nodes A, B, and C are able to interpret both the inner and outer header, whereas the intermediate routers understand only the outer header. Similarly, A, B, and C have addresses in both the overlay network and the underlying network, but they are not necessarily the same; for example, their underlying address might be a 32-bit IP address, while their overlay address might be an experimental 128-bit address. In fact, the overlay need not use conventional addresses at all but may route based on URLs, domain names, an XML query, or even the content of the packet.
Figure 9.20. Overlay nodes tunnel through physical nodes.
Overlays and the Ossification of the Internet
Given its popularity and widespread use, it is easy to forget that at one time the Internet was a laboratory for researchers to experiment with packet-switched networking. The more the Internet has become a commercial success, however, the less useful it is as a platform for playing with new ideas. Today, commercial interests shape the Internet's continued development.
In fact, as far back as 2001, a report from the National Research Council pointed to the ossification of the Internet, both intellectually (pressure for compatibility with current standards stifles innovation) and in terms of the infrastructure itself (it is nearly impossible for researchers to affect the core infrastructure). The report went on to observe that, at the same time, a whole new set of challenges were emerging that may require a fresh approach. The dilemma, according to the report, is that
… successful and widely adopted technologies are subject to ossification, which makes it hard to introduce new capabilities or, if the current technology has run its course, to replace it with something better. Existing industry players are not generally motivated to develop or deploy disruptive technologies…
Finding the right way to introduce disruptive technologies is an interesting issue. Such innovations are likely to do some things very well, but overall they lag current technology in other important areas. For example, to introduce a new routing strategy into the Internet, one would have to build a router that not only supports this new strategy but also competes with established vendors in terms of performance, reliability, management toolset, and so on. This is an extremely tall order. What the innovator needs is a way to allow users to take advantage of the new idea without having to write the hundreds of thousands of lines of code needed to support just the base system.
Overlay networks provide exactly this opportunity. Overlay nodes can be programmed to support the new capability or feature and then depend on conventional nodes to provide the underlying connectivity. Over time, if the idea deployed in the overlay proves useful, there may be economic motivation to migrate the functionality into the base system—that is, add it to the feature set of commercial routers. On the other hand, the functionality may be complex enough that an overlay layer may be exactly where it belongs.
9.4.1 Routing Overlays
The simplest kind of overlay is one that exists purely to support an alternative routing strategy; no additional application-level processing is performed at the overlay nodes. You can view a virtual private network (see Section 4.1.8) as an example of a routing overlay, but one that doesn't so much define an alternative strategy or algorithm as it does alternative routing table entries to be processed by the standard IP forwarding algorithm. In this particular case, the overlay is said to use “IP tunnels,” and the ability to utilize these VPNs is supported in many commercial routers.
Suppose, however, you wanted to use a routing algorithm that commercial router vendors were not willing to include in their products. How would you go about doing it? You could simply run your algorithm on a collection of end hosts, and tunnel through the Internet routers. These hosts would behave like routers in the overlay network: As hosts they are probably connected to the Internet by only one physical link, but as a node in the overlay they would be connected to multiple neigrs via tunnels.
Since overlays, almost by definition, are a way to introduce new technologies independent of the standardization process, there are no standard overlays we can point to as examples. Instead, we illustrate the general idea of routing overlays by describing several experimental systems that have been built by network researchers.
Experimental Versions of IP
Overlays are ideal for deploying experimental versions of IP that you hope will eventually take over the world. For example, IP multicast (Section 4.2) started off as an extension to IP and even today is not enabled in many Internet routers. The MBone (multicast backbone) was an overlay network that implemented IP multicast on top of the unicast routing provided by the Internet. A number of multimedia conference tools were developed for and deployed on the Mbone. For example, IETF meetings—which are a week long and attract thousands of participants—were for many years broadcast over the MBone.
Like VPNs, the MBone used both IP tunnels and IP addresses, but unlike VPNs, the MBone implemented a different forwarding algorithm—forwarding packets to all downstream neigrs in the shortest path multicast tree. As an overlay, multicast-aware routers tunnel through legacy routers, with the hope that one day there will be no more legacy routers.
The 6-BONE was a similar overlay that was used to incrementally deploy IPv6. Like the MBone, the 6-BONE used tunnels to forward packets through IPv4 routers. Unlike the MBone, however, 6-BONE nodes did not simply provide a new interpretation of IPv4's 32-bit addresses. Instead, they forwarded packets based on IPv6's 128-bit address space. The 6-BONE also supported IPv6 multicast.
End System Multicast
Although IP multicast is popular with researchers and certain segments of the networking community, its deployment in the global Internet has been limited at best. In response, multicast-based applications like videoconferencing have recently turned to an alternative strategy, called end system multicast. The idea of end system multicast is to accept that IP multicast will never become ubiquitous and to instead let the end hosts that are participating in a particular multicast-based application implement their own multicast trees.
Before describing how end system multicast works, it is important to first understand that, unlike VPNs and the MBone, end system multicast assumes that only Internet hosts (as opposed to Internet routers) partici\-pate in the overlay. Moreover, these hosts typically exchange messages with each other through UDP tunnels rather than IP tunnels, making it easy to implement as regular application programs. This makes it possible to view the underlying network as a fully connected graph, since every host in the Internet is able to send a message to every other host. Abstractly, then, end system multicast solves the following problem: Starting with a fully connected graph representing the Internet, the goal is to find the embedded multicast tree that spans all the group members.
Since we take the underlying Internet to be fully connected, a naive solution would be to have each source directly connected to each member of the group. In other words, end system multicast could be implemented by having each node send unicast messages to every group member. To see the problem in doing this, especially compared to implementing IP multicast in routers, consider the example topology in Figure 9.21. Figure 9.21(a) depicts an example physical topology, where R1 and R2 are routers connected by a low-bandwidth transcontinental link; A, B, C, and D are end hosts; and link delays are given as edge weights. Assuming A wants to send a multicast message to the other three hosts, Figure 9.21(b) shows how naive unicast transmission would work. This is clearly undesirable because the same message must traverse the link A–R1 three times, and two copies of the message traverse R1–R2. Figure 9.21(c) depicts the IP multicast tree constructed by the Distance Vector Multicast Routing Protocol (DVMRP). Clearly, this approach eliminates the redundant messages. Without support from the routers, however, the best one can hope for with end system multicast is a tree similar to the one shown in Figure 9.21(d). End system multicast defines an architecture for constructing this tree.
Figure 9.21. Alternative multicast trees mapped onto a physical topology.
The general approach is to support multiple levels of overlay networks, each of which extracts a subgraph from the overlay below it, until we have selected the subgraph that the application expects. For end system multicast, in particular, this happens in two stages: First we construct a simple mesh overlay on top of the fully connected Internet, and then we select a multicast tree within this mesh. The idea is illustrated in Figure 9.22, again assuming the four end hosts A, B, C, and D. The first step is the critical one: Once we have selected a suitable mesh overlay, we simply run a standard multicast routing algorithm (e.g., DVMRP) on top of it to build the multicast tree. We also have the luxury of ignoring the scalability issue that Internet-wide multicast faces since the intermediate mesh can be selected to include only those nodes that want to participate in a particular multicast group.
Figure 9.22. Multicast tree embedded in an overlay mesh.
The key to constructing the intermediate mesh overlay is to select a topology that roughly corresponds to the physical topology of the underlying Internet, but we have to do this without anyone telling us what the underlying Internet actually looks like since we are running only on end hosts and not routers. The general strategy is for the end hosts to measure the roundtrip latency to other nodes and decide to add links to the mesh only when they like what they see. This works as follows.
First, assuming a mesh already exists, each node exchanges the list of all other nodes it believes is part of the mesh with its directly connected neigrs. When a node receives such a membership list from a neigr, it incorporates that information into its membership list and forwards the resulting list to its neigrs. This information eventually propagates through the mesh, much as in a distance vector routing protocol.
When a host wants to join the multicast overlay, it must know the IP address of at least one other node already in the overlay. It then sends a “join mesh” message to this node. This connects the new node to the mesh by an edge to the known node. In general, the new node might send a join message to multiple current nodes, thereby joining the mesh by multiple links. Once a node is connected to the mesh by a set of links, it periodically sends “keep alive” messages to its neigrs, letting them know that it still wants to be part of the group.
When a node leaves the group, it sends a “leave mesh” message to its directly connected neigrs, and this information is propagated to the other nodes in the mesh via the membership list described above. Alternatively, a node can fail or just silently decide to quit the group, in which case its neigrs detect that it is no longer sending “keep alive” messages. Some node departures have little effect on the mesh, but should a node detect that the mesh has become partitioned due to a departing node, it creates a new edge to a node in the other partition by sending it a “join mesh” message. Note that multiple neigrs can simultaneously decide that a partition has occurred in the mesh, leading to multiple cross-partition edges being added to the mesh.
As described so far, we will end up with a mesh that is a subgraph of the original fully connected Internet, but it may have suboptimal performance because (1) initial neigr selection adds random links to the topology, (2) partition repair might add edges that are essential at the moment but not useful in the long run, (3) group membership may change due to dynamic joins and departures, and (4) underlying network conditions may change. What needs to happen is that the system must evaluate the value of each edge, resulting in new edges being added to the mesh and existing edges being removed over time.
To add new edges, each node i periodically probes some random member j that it is not currently connected to in the mesh, measures the round-trip latency of edge (i, j), and then evaluates the utility of adding this edge. If the utility is above a certain threshold, link (i, j) is added to the mesh. Evaluating the utility of adding edge (i, j) might look something like this:
EvaluateUtility(j)
utility = 0
for each member m not equal to i
CL = current latency to node m along route through mesh
NL = new latency to node m along mesh if edge (i, j) is added
if (NL < CL) then
utility += (CL - NL)/CL
return utility
Deciding to remove an edge is similar, except each node i computes the cost of each link to current neigr j as follows:
EvaluateCost(j)
Costij = number of members for which i uses j as next hop
Costji = number of members for which j uses i as next hop
return max(Costij, Costji)
It then picks the neigr with the lowest cost, and drops it if the cost falls below a certain threshold.
Finally, since the mesh is maintained using what is essentially a distance vector protocol, it is trivial to run DVMRP to find an appropriate multicast tree in the mesh. Note that, although it is not possible to prove that the protocol just described results in the optimum mesh network, thereby allowing DVMRP to select the best possible multicast tree, both simulation and extensive practical experience suggests that it does a good job.
Resilient Overlay Networks
Another function that can be performed by an overlay is to find alternative routes for traditional unicast applications. Such overlays exploit the observation that the triangle inequality does not hold in the Internet. Figure 9.23 illustrates what we mean by this. It is not uncommon to find three sites in the Internet—call them A, B, and C—such that the latency between A and B is greater than the sum of the latencies from A to C and from C to B. That is, sometimes you would be better off indirectly sending your packets via some intermediate node than sending them directly to the destination.
Figure 9.23. The triangle inequality does not necessarily hold in networks.
How can this be? Well, the Border Gateway Protocol (BGP) never promised that it would find the shortest route between any two sites; it only tries to find some route. To make matters more complex, BGP's routes are heavily influenced by policy issues, such as who is paying whom to carry their traffic. This often happens, for example, at peering points between major backbone ISPs. In short, that the triangle inequality does not hold in the Internet should not come as a surprise.
How do we exploit this observation? The first step is to realize that there is a fundamental tradeoff between the scalability and optimality of a routing algorithm. On the one hand, BGP scales to very large networks, but often does not select the best possible route and is slow to adapt to network outages. On the other hand, if you were only worried about finding the best route among a handful of sites, you could do a much better job of monitoring the quality of every path you might use, thereby allowing you to select the best possible route at any moment in time.
An experimental overlay, called the Resilient Overlay Network (RON), does exactly this. RON scales to only a few dozen nodes because it uses an n × n strategy of closely monitoring (via active probes) three aspects of path quality—latency, available bandwidth, and loss probability—between every pair of sites. It is then able to both select the optimal route between any pair of nodes, and rapidly change routes should network conditions change. Experience shows that RON is able to deliver modest performance improvements to applications, but more importantly, it recovers from network failures much more quickly. For example, during one 64-hour period in 2001, an instance of RON running on 12 nodes detected 32 outages lasting over 30 minutes, and it was able to recover from all of them in less than 20 seconds on average. This experiment also suggested that forwarding data through just one intermediate node is usually sufficient to recover from Internet failures.
Since RON is not designed to be a scalable approach, it is not possible to use RON to help random host A communicate with random host B; A and B have to know ahead of time that they are likely to communicate and then join the same RON. However, RON seems like a good idea in certain settings, such as when connecting a few dozen corporate sites spread across the Internet or allowing you and 50 of your friends to establish your own private overlay for the sake of running some application. The real question, though, is what happens when everyone starts to run their own RON. Does the overhead of millions of RONs aggressively probing paths swamp the network, and does anyone see improved behavior when many RONs compete for the same paths? These questions are still unanswered.
All of these overlays illustrate a concept that is central to computer networks in general: virtualization.6 That is, it is possible to build a virtual network from abstract (logical) resources on top of a physical network constructed from physi\-cal resources. Moreover, it is possible to stack these virtualized networks on top of each other and for multiple virtual network to coexist at the same level. Each virtual network, in turn, provides new capabilities that are of value to some set of users, applications, or higher-level networks.
9.4.2 Peer-to-Peer Networks
Music-sharing applications like Napster® and KaZaA introduced the term “peer-to-peer” into the popular vernacular. But what exactly does it mean for a system to be “peer-to-peer”? Certainly in the context of sharing MP3 files it means not having to download music from a central site, but instead being able to access music files directly from whoever in the Internet happens to have a copy stored on their computer. More generally then, we could say that a peer-to-peer network allows a community of users to pool their resources (content, storage, network bandwidth, disk bandwidth, CPU), thereby providing access to a larger archival store, larger video/audio conferences, more complex searches and computations, and so on than any one user could afford individually.
Quite often, attributes like decentralized and self-organizing are mentioned when discussing peer-to-peer networks, meaning that individual nodes organize themselves into a network without any centralized coordination. If you think about it, terms like these could be used to describe the Internet itself. Ironically, however, Napster was not a true peer-to-peer system by this definition since it depended on a central registry of known files, and users had to search this directory to find what machine offered a particular file. It was only the last step—actually downloading the file—that took place between machines that belong to two users, but this is little more than a traditional client/server transaction. The only difference is that the server is owned by someone just like you rather than a large corporation.
So we are back to the original question: What's interesting about peer-to-peer networks? One answer is that both the process of locating an object of interest and the process of downloading that object onto your local machine happen without your having to contact a centralized authority, and at the same time the system is able to scale to millions of nodes. A peer-to-peer system that can accomplish these two tasks in a decentralized manner turns out to be an overlay network, where the nodes are those hosts that are willing to share objects of interest (e.g., music and other assorted files), and the links (tunnels) connecting these nodes represent the sequence of machines that you have to visit to track down the object you want. This description will become clearer after we look at two examples.
Gnutella
Gnutella is an early peer-to-peer network that attempted to distinguish between exchanging music (which likely violates somebody's copyright) and the general sharing of files (which must be good since we've been taught to share since the age of two). What's interesting about Gnutella is that it was one of the first such systems to not depend on a centralized registry of objects. Instead, Gnutella participants arrange themselves into an overlay network similar to the one shown in Figure 9.24. That is, each node that runs the Gnutella software (i.e., implements the Gnutella protocol) knows about some set of other machines that also run the Gnutella software. The relationship “A and B know each other” corresponds to the edges in this graph. (We'll talk about how this graph is formed in a moment.)
Figure 9.24. Example topology of a Gnutella peer-to-peer network.
Whenever the user on a given node wants to find an object, Gnutella sends a QUERY message for the object—for example, specifying the file's name—to its neigrs in the graph. If one of the neigrs has the object, it responds to the node that sent it the query with a QUERY RESPONSE message, specifying where the object can be downloaded (e.g., an IP address and TCP port number). That node can subsequently use GET or PUT messages to access the object. If the node cannot resolve the query, it forwards the QUERY message to each of its neigrs (except the one that sent it the query), and the process repeats. In other words, Gnutella floods the overlay to locate the desired object. Gnutella sets a TTL on each query so this flood does not continue indefinitely.
In addition to the TTL and query string, each QUERY message contains a unique query identifier (QID), but it does not contain the identity of the original message source. Instead, each node maintains a record of the QUERY messages it has seen recently: both the QID and the neigr that sent it the QUERY. It uses this history in two ways. First, if it ever receives a QUERY with a QID that matches one it has seen recently, the node does not forward the QUERY message. This serves to cut off forwarding loops more quickly than the TTL might have done. Second, whenever the node receives a QUERY RESPONSE from a downstream neigr, it knows to forward the response to the upstream neigr that originally sent it the QUERY message. In this way, the response works its way back to the original node without any of the intermediate nodes knowing who wanted to locate this particular object in the first place.
Returning to the question of how the graph evolves, a node certainly has to know about at least one other node when it joins a Gnutella overlay. The new node is attached to the overlay by at least this one link. After that, a given node learns about other nodes as the result of QUERY RESPONSE messages, both for objects it requested and for responses that just happen to pass through it. A node is free to decide which of the nodes it discovers in this way that it wants to keep as a neigr. The Gnutella protocol provides PING and PONG messages by which a node probes whether or not a given neigr still exists and that neigr's response, respectively.
It should be clear that Gnutella as described here is not a particularly clever protocol, and subsequent systems have tried to improve upon it. One dimension along which improvements are possible is in how queries are propagated. Flooding has the nice property that it is guaranteed to find the desired object in the fewest possible hops, but it does not scale well. It is possible to forward queries randomly, or according to the probability of success based on past results. A second dimension is to proactively replicate the objects, since the more copies of a given object there are, the easier it should be to find a copy. Alternatively, one could develop a completely different strategy, which is the topic we consider next.
Structured Overlays
At the same time file sharing systems have been fighting to fill the void left by Napster, the research community has been exploring an alternative design for peer-to-peer networks. We refer to these networks as structured, to contrast them with the essentially random (unstructured) way in which a Gnutella network evolves. Unstructured overlays like Gnutella employ trivial overlay construction and maintenance algorithms, but the best they can offer is unreliable, random search. In contrast, structured overlays are designed to conform to a particular graph structure that allows reliable and efficient (probabilistically bounded delay) object location, in return for additional complexity during overlay construction and maintenance.
If you think about what we are trying to do at a high level, there are two questions to consider: (1) How do we map objects onto nodes, and (2) How do we route a request to the node that is responsible for a given object? We start with the first question, which has a simple statement: How do we map an object with name x into the address of some node n that is able to serve that object? While traditional peer-to-peer networks have no control over which node hosts object x, if we could control how objects get distributed over the network, we might be able to do a better job of finding those objects at a later time.
A well-known technique for mapping names into an address is to use a hash table, so that
implies object x is first placed on node n, and at a later time a client trying to locate x would only have to perform the hash of x to determine that it is on node n. A hash-based approach has the nice property that it tends to spread the objects evenly across the set of nodes, but straightforward hashing algorithms suffer from a fatal flaw: How many possible values of n should we allow? (In hashing terminology, how many buckets should there be?) Naively, we could decide that there are, say, 101 possible hash values, and we use a modulo hash function; that is,
hash(x)
return x% 101
Unfortunately, if there are more than 101 nodes willing to host objects, then we can't take advantage of all of them. On the other hand, if we select a number larger than the largest possible number of nodes, then there will be some values of x that will hash into an address for a node that does not exist. There is also the not-so-small issue of translating the value returned by the hash function into an actual IP address.
To address these issues, structured peer-to-peer networks use an algorithm known as consistent hashing, which hashes a set of objects x uniformly across a large ID space. Figure 9.25 visualizes a 128-bit ID space as a circle, where we use the algorithm to place both objects
and nodes
onto this circle. Since a 128-bit ID space is enormous, it is unlikely that an object will hash to exactly the same ID as a machine's IP address hashes to. To account for this unlikelihood, each object is maintained on the node whose ID is closest, in this 128-bit space, to the object ID. In other words, the idea is to use a high-quality hash function to map both nodes and objects into the same large, sparse ID space; you then map objects to nodes by numerical proximity of their respective identifiers. Like ordinary hashing, this distributes objects fairly evenly across nodes, but, unlike ordinary hashing, only a small number of objects have to move when a node (hash bucket) joins or leaves.
Figure 9.25. Both nodes and objects map (hash) onto the ID space, where objects are maintained at the nearest node in this space.
We now turn to the second question—how does a user that wants to access object x know which node is closest in x's ID in this space? One possible answer is that each node keeps a complete table of node IDs and their associated IP addresses, but this would not be practical for a large network. The alternative, which is the approach used by structured peer-to-peer networks, is to route a message to this node! In other words, if we construct the overlay in a clever way—which is the same as saying that we need to choose entries for a node's routing table in a clever way—then we find a node simply by routing toward it. Collectively, this approach is sometimes called a distributed hash table (DHT), since conceptually, the hash table is distributed over all the nodes in the network.
Figure 9.26 illustrates what happens for a simple 28-bit ID space. To keep the discussion as concrete as possible, we consider the approach used by a particular peer-to-peer network called Pastry. Other systems work in a similar manner. (See the papers cited at the end of the chapter for additional examples.)
Figure 9.26. Objects are located by routing through the peer-to-peer overlay network.
Suppose you are at the node with id 65a1fc (hex) and you are trying to locate the object with ID d46a1c. You realize that your ID shares nothing with the object's, but you know of a node that shares at least the prefix d. That node is closer than you in the 128-bit ID space, so you forward the message to it. (We do not give the format of the message being forwarded, but you can think of it as saying “locate object d46a1c .”) Assuming node d13da3 knows of another node that shares an even longer prefix with the object, it forwards the message on. This process of moving closer in ID-space continues until you reach a node that knows of no closer node. This node is, by definition, the one that hosts the object. Keep in mind that as we logically move through “ID space” the message is actually being forwarded, node to node, through the underlying Internet.
Each node maintains a both routing table (more below) and the IP addresses of a small set of numerically larger and smaller node IDs. This is called the node's leaf set. The relevance of the leaf set is that, once a message is routed to any node in the same leaf set as the node that hosts the object, that node can directly forward the message to the ultimate destination. Said another way, the leaf set facilitates correct and efficient delivery of a message to the numerically closest node, even though multiple nodes may exist that share a maximal length prefix with the object ID. Moreover, the leaf set makes routing more robust because any of the nodes in a leaf set can route a message just as well as any other node in the same set. Thus, if one node is unable to make progress routing a message, one of its neigrs in the leaf set may be able to. In summary, the routing procedure is defined as follows:
Route(D)
if D is within range of my leaf set
forward to numerically closest member in leaf set
else
let l= length of shared prefix
let d= value of l-th digit in D's address
if RouteTab[l, d] exists
forward to RouteTab[l, d]
else
forward to known node with at least as long a shared prefix
and numerically closer than this node
The routing table, denoted RouteTab, is a two-dimensional array. It has a row for every hex digit in an ID (there such 32 digits in a 128-bit ID) and a column for every hex value (there are obviously 16 such values). Every entry in row i shares a prefix of length i with this node, and within this row the entry in column j has the hex value j in the i + 1 th position. Figure 9.27 shows the first three rows of an example routing table for node 65a1fcx, where x denotes an unspecified suffix. This figure shows the ID prefix matched by every entry in the table. It does not show the actual value contained in this entry—the IP address of the next node to route to.
Figure 9.27. Example routing table at the node with ID 65a1fcx.
Adding a node to the overlay works much like routing a “locate object message” to an object. The new node must know of at least one current member. It asks this member to route an “add node message” to the node numerically closest to the ID of the joining node, as shown in Figure 9.28. It is through this routing process that the new node learns about other nodes with a shared prefix and is able to begin filling out its routing table. Over time, as additional nodes join the overlay, existing nodes also have the option of including information about the newly joined node in their routing tables. They do this when the new node adds a longer prefix than they currently have in their table. Neigrs in the leaf sets also exchange routing tables with each other, which means that over time routing information propagates through the overlay.
Figure 9.28. Adding a node to the network.
The reader may have noticed that although structured overlays provide a probabilistic bound on the number of routing hops required to locate a given object—the number of hops in Pastry is bounded by log16N, where N is the number of nodes in the overlay—each hop may contribute substantial delay. This is because each intermediate node may be at a random location in the Internet. (In the worst case, each node is on a different continent!) In fact, in a world-wide overlay network using the algorithm as described above, the expected delay of each hop is the average delay among all pairs of nodes in the Internet! Fortunately, one can do much better in practice. The idea is to choose each routing table entry such that it refers to a nearby node in the underlying physical network, among all nodes with an ID prefix that is appropriate for the entry. It turns out that doing so achieves end-to-end routing delays that are within a small factor of the delay between source and destination node.
Finally, the discussion up to this point has focused on the general problem of locating objects in a peer-to-peer network. Given such a routing infrastructure, it is possible to build different services. For example, a file sharing service would use file names as object names. To locate a file, you first hash its name into a corresponding object ID and then route a “locate object message” to this ID. The system might also replicate each file across multiple nodes to improve availability. Storing multiple copies on the leaf set of the node to which a given file normally routes would be one way of doing this. Keep in mind that even though these nodes are neigrs in the ID space, they are likely to be physically distributed across the Internet. Thus, while a power outage in an entire city might take down physically close replicas of a file in a traditional file system, one or more replicas would likely survive such a failure in a peer-to-peer network.
Services other than file sharing can also be built on top of distributed hash tables. Consider multicast applications, for example. Instead of constructing a multicast tree from a mesh, one could construct the tree from edges in the structured overlay, thereby amortizing the cost of overlay construction and maintenance across several applications and multicast groups.
Bit
Bit is a peer-to-peer file sharing protocol devised by Bram Cohen. It is based on replicating the file or, rather, replicating segments of the file, which are called pieces. Any particular piece can usually be downloaded from multiple peers, even if only one peer has the entire file. The primary benefit of Bit's replication is avoiding the bottleneck of having only one source for a file. This is particularly useful when you consider that any given computer has a limited speed at which it can serve files over its uplink to the Internet, often quite a low limit due to the asymmetric nature of most broadband networks. The beauty of Bit is that replication is a natural side effect of the downloading process: As soon as a peer downloads a particular piece, it becomes another source for that piece. The more peers downloading pieces of the file, the more piece replication occurs, distributing the load proportionately, and the more total bandwidth is available to share the file with others. Pieces are downloaded in random order to avoid a situation where peers find themselves lacking the same set of pieces.
Each file is shared via its own independent Bit network, called a swarm. (A swarm could potentially share a set of files, but we describe the single file case for simplicity.) The lifecycle of a typical swarm is as follows. The swarm starts as a singleton peer with a complete copy of the file. A node that wants to download the file joins the swarm, becoming its second member, and begins downloading pieces of the file from the original peer. In doing so, it becomes another source for the pieces it has downloaded, even if it has not yet downloaded the entire file. (In fact, it is common for peers to leave the swarm once they have completed their downloads, although they are encouraged to stay longer.) Other nodes join the swarm and begin downloading pieces from multiple peers, not just the original peer. See Figure 9.29.
Figure 9.29. Peers in a Bit swarm download from other peers that may not yet have the complete file.
If the file remains in high demand, with a stream of new peers replacing those who leave the swarm, the swarm could remain active indefinitely; if not, it could shrink back to include only the original peer until new peers join the swarm.
Now that we have an overview of Bit, we can ask how requests are routed to the peers that have a given piece. To make requests, a would-be downloader must first join the swarm. It starts by downloading a . file containing meta-information about the file and swarm. The . file, which may be easily replicated, is typically downloaded from a web server and discovered by following links from Web pages. It contains:
■
The target file's size
■
The piece size
■
SHA-1 hash values (Section 8.1.4) precomputed from each piece
■
The URL of the swarm'stracker
A tracker is a server that tracks a swarm's current membership. We'll see later that Bit can be extended to eliminate this point of centralization, with its attendant potential for bottleneck or failure.
The would-be downloader then joins the swarm, becoming a peer, by sending a message to the tracker giving its network address and a peer ID that it has generated randomly for itself. The message also carries a SHA-1 hash of the main part of the . file, which is used as a swarm ID.
Let's call the new peer P. The tracker replies to P with a partial list of peers giving their IDs and network addresses, and P establishes connections, over TCP, with some of these peers. Note that P is directly connected to just a subset of the swarm, although it may decide to contact additional peers or even request more peers from the tracker. To establish a Bit connection with a particular peer after their TCP connection has been established, P sends P's own peer ID and swarm ID, and the peer replies with its peer ID and swarm ID. If the swarm IDs don't match, or the reply peer ID is not what P expects, the connection is aborted.
The resulting Bit connection is symmetric: Each end can download from the other. Each end begins by sending the other a bitmap reporting which pieces it has, so each peer knows the other's initial state. Whenever a downloader (D) finishes downloading another piece, it sends a message identifying that piece to each of its directly connected peers, so those peers can update their internal representation of D's state. This, finally, is the answer to the question of how a download request for a piece is routed to a peer that has the piece, because it means that each peer knows which directly connected peers have the piece. If D needs a piece that none of its connections has, it could connect to more or different peers (it can get more from the tracker) or occupy itself with other pieces in hopes that some of its connections will obtain the piece from their connections.
How are objects—in this case, pieces—mapped onto peer nodes? Of course each peer eventually obtains all the pieces, so the question is really about which pieces a peer has at a given time before it has all the pieces or, equivalently, about the order in which a peer downloads pieces. The answer is that they download pieces in random order, to keep them from having a strict subset or superset of the pieces of any of their peers.
The Bit described so far utilizes a central tracker that constitutes a single point of failure for the swarm and could potentially be a performance bottleneck. Also, providing a tracker can be a nuisance for someone who would like to make a file available via Bit. Newer versions of Bit additionally support “trackerless” swarms that use a DHT-based implementation. Bit client software that is trackerless capable implements not just a Bit peer but also what we'll call a peer finder (the Bit terminology is simply node), which the peer uses to find peers.
Peer finders form their own overlay network, using their own protocol over UDP to implement a DHT. Furthermore, a peer finder network includes peer finders whose associated peers belong to different swarms. In other words, while each swarm forms a distinct network of Bit peers, a peer finder network instead spans swarms.
Peer finders randomly generate their own finder IDs, which are the same size (160 bits) as swarm IDs. Each finder maintains a modest table containing primarily finders (and their associated peers) whose IDs are close to its own, plus some finders whose IDs are more distant. The following algorithm ensures that finders whose IDs are close to a given swarm ID are likely to know of peers from that swarm; the algorithm simultaneously provides a way to look them up. When a finder F needs to find peers from a particular swarm, it sends a request to the finders in its table whose IDs are close to that swarm's ID. If a contacted finder knows of any peers for that swarm, it replies with their contact information. Otherwise, it replies with the contact information of the finders in its table that are close to the swarm, so that F can iteratively query those finders.
After the search is exhausted, because there are no finders closer to the swarm, F inserts the contact information for itself and its associated peer into the finders closest to the swarm. The net effect is that peers for a particular swarm get entered in the tables of the finders that are close to that swarm.
The above scheme assumes that F is already part of the finder network, that it already knows how to contact some other finders. This assumption is true for finder installations that have run previously, because they are supposed to save information about other finders, even across executions. If a swarm uses a tracker, its peers are able to tell their finders about other finders (in a reversal of the peer and finder roles) because the Bit peer protocol has been extended to exchange finder contact information. But, how can a newly installed finder discover other finders? The . files for trackerless swarms include contact information for one or a few finders, instead of a tracker URL, for just that situation.
An unusual aspect of Bit is that it deals head-on with the issue of fairness, or good “network citizenship.” Protocols often depend on the good behavior of individual peers without being able to enforce it. For example, an unscrupulous Ethernet peer could get better performance by using a backoff algorithm that is more aggressive than exponential backoff, or an unscrupulous TCP peer could get better performance by not cooperating in congestion control.
The good behavior that Bit depends on is peers uploading pieces to other peers. Since the typical Bit user just wants to download the file as quickly as possible, there is a temptation to implement a peer that tries to download all the pieces while doing as little uploading as possible—this is a bad peer. To discourage bad behavior, the Bit protocol includes mechanisms that allow peers to reward or punish each other. If a peer is misbehaving by not nicely uploading to another peer, the second peer can choke the bad peer: It can decide to stop uploading to the bad peer, at least temporarily, and send it a message saying so. There is also a message type for telling a peer that it has been unchoked. The choking mechanism is also used by a peer to limit the number of its active Bit connections, to maintain good TCP performance. There are many possible choking algorithms, and devising a good one is an art.
9.4.3 Content Distribution Networks
We have already seen how HTTP running over TCP allows web browsers to retrieve pages from web servers. However, anyone who has waited an eternity for a Web page to return knows that the system is far from perfect. Considering that the backbone of the Internet is now constructed from OC-192 (10-Gbps) links, it's not obvious why this should happen. It is generally agreed that when it comes to downloading Web pages there are four potential bottlenecks in the system:
■
The first mile. The Internet may have high-capacity links in it, but that doesn't help you download a Web page any faster when you're connected by a 56-Kbps modem or a poorly performing 3G wireless link.
■
The last mile. The link that connects the server to the Internet can be overloaded by too many requests, even if the aggregate bandwidth of that link is quite high.
■
The server itself. A server has a finite amount of resources (CPU, memory, disk bandwidth, etc.) and can be overloaded by too many concurrent requests.
■
Peering points. The handful of ISPs that collectively implement the backbone of the Internet may internally have high-bandwidth pipes, but they have little motivation to provide high-capacity connectivity to their peers. If you are connected to ISP A and the server is connected to ISP B, then the page you request may get dropped at the point where A and B peer with each other.
There's not a lot anyone except you can do about the first problem, but it is possible to use replication to address the remaining problems. Systems that do this are often called Content Distribution Networks (CDNs). Akamai operates what is probably the best-known CDN.
The idea of a CDN is to geographically distribute a collection of server surrogates that cache pages normally maintained in some set of backend servers. Thus, rather than having millions of users wait forever to contact www.cnn.com when a big news story breaks—such a situation is known as a flash crowd —it is possible to spread this load across many servers. Moreover, rather than having to traverse multiple ISPs to reach www.cnn.com, if these surrogate servers happen to be spread across all the backbone ISPs, then it should be possible to reach one without having to cross a peering point. Clearly, maintaining thousands of surrogate servers all over the Internet is too expensive for any one site that wants to provide better access to its Web pages. Commercial CDNs provide this service for many sites, thereby amortizing the cost across many customers.
Although we call them surrogate servers, in fact, they can just as correctly be viewed as caches. If they don't have a page that has been requested by a client, they ask the backend server for it. In practice, however, the backend servers proactively replicate their data across the surrogates rather than wait for surrogates to request it on demand. It's also the case that only static pages, as opposed to dynamic content, are distributed across the surrogates. Clients have to go to the backend server for any content that either changes frequently (e.g., sports scores and stock quotes) or is produced as the result of some computation (e.g., a database query).
Having a large set of geographically distributed servers does not fully solve the problem. To complete the picture, CDNs also need to provide a set of redirectors that forward client requests to the most appropriate server, as shown in Figure 9.30. The primary objective of the redirectors is to select the server for each request that results in the best response time for the client. A secondary objective is for the system as a whole to process as many requests per second as the underlying hardware (network links and web servers) is able to support. The average number of requests that can be satisfied in a given time period—known as the system throughput —is primarily an issue when the system is under heavy load, such as when a flash crowd is accessing a small set of pages or a Distributed Denial of Service (DDoS) attacker is targeting a particular site, as happened to CNN, Yahoo, and several other high-profile sites in February 2000.
Figure 9.30. Components in a Content Distribution Network (CDN).
CDNs use several factors to decide how to distribute client requests. For example, to minimize response time, a redirector might select a server based on its network proximity. In contrast, to improve the overall system throughput, it is desirable to evenly balance the load across a set of servers. Both throughput and response time are improved if the distribution mechanism takes locality into consideration; that is, it selects a server that is likely to already have the page being requested in its cache. The exact combination of factors that should be employed by a CDN is open to debate. This section considers some of the possibilities.
Mechanisms
As described so far, a redirector is just an abstract function, although it sounds like what something a router might be asked to do since it logically forwards a request message much like a router forwards packets. In fact, there are several mechanisms that can be used to implement redirection. Note that for the purpose of this discussion we assume that each redirector knows the address of every available server. (From here on, we drop the “surrogate” qualifier and talk simply in terms of a set of servers.) In practice, some form of out-of-band communication takes place to keep this information up-to-date as servers come and go.
First, redirection could be implemented by augmenting DNS to return different server addresses to clients. For example, when a client asks to resolve the name www.cnn.com, the DNS server could return the IP address of a server hosting CNN's Web pages that is known to have the lightest load. Alternatively, for a given set of servers, it might just return addresses in a round-robin fashion. Note that the granularity of DNS-based redirection is usually at the level of a site (e.g., cnn.com) rather than a specific URL (e.g., http://www.cnn.com/2002/WORLD/europe/06/21/william.birthday/index.html). However, when returning an embedded link, the server can rewrite the URL, thereby effectively pointing the client at the most appropriate server for that specific object.
Commercial CDNs essentially use a combination of URL rewriting and DNS-based redirection. For scalability reasons, the high-level DNS server first points to a regional-level DNS server, which replies with the actual server address. In order to respond to changes quickly, the DNS servers tweak the TTL of the resource records they return to a very short period, such as 20 seconds. This is necessary so clients don't cache results and thus fail to go back to the DNS server for the most recent URL-to-server mapping.
Another possibility is to use the HTTP redirect feature: The client sends a request message to a server, which responds with a new (better) server that the client should contact for the page. Unfortunately, server-based redirection incurs an additional round-trip time across the Internet, and, even worse, servers can be vulnerable to being overloaded by the redirection task itself. Instead, if there is a node close to the client (e.g., a local Web proxy) that is aware of the available servers, then it can intercept the request message and instruct the client to instead request the page from an appropriate server. In this case, either the redirector would need to be on a choke point so that all requests leaving the site pass through it, or the client would have to cooperate by explicitly addressing the proxy (as with a classical, rather than transparent, proxy).
At this point you may be wondering what CDNs have to do with overlay networks, and while viewing a CDN as an overlay is a bit of a stretch, they do share one very important trait in common. Like an overlay node, a proxy-based redirector makes an application-level routing decision. Rather than forward a packet based on an address and its knowledge of the network topology, it forwards HTTP requests based on a URL and its knowledge of the location and load of a set of servers. Today's Internet architecture does not support redirection directly—where by “directly” we mean the client sends the HTTP request to the redirector, which forwards to the destination—so instead redirection is typically implemented indirectly by having the redirector return the appropriate destination address and the client contacts the server itself.
Policies
We now consider some example policies that redirectors might use to forward requests. Actually, we have already suggested one simple policy—round-robin. A similar scheme would be to simply select one of the available servers at random. Both of these approaches do a good job of spreading the load evenly across the CDN, but they do not do a particularly good job of lowering the client-perceived response time.
It's obvious that neither of these two schemes takes network proximity into consideration, but, just as importantly, they also ignore locality. That is, requests for the same URL are forwarded to different servers, making it less likely that the page will be served from the selected server's in-memory cache. This forces the server to retrieve the page from its disk, or possibly even from the backend server. How can a distributed set of redirectors cause requests for the same page to go to the same server (or small set of servers) without global coordination? The answer is surprisingly simple: All redirectors use some form of hashing to deterministically map URLs into a small range of values. The primary benefit of this approach is that no inter-redirector communication is required to achieve coordinated operation; no matter which redirector receives a URL, the hashing process produces the same output.
So what makes for a good hashing scheme? The classic modulo hashing scheme—which hashes each URL modulo the number of servers—is not suitable for this environment. This is because should the number of servers change, the modulo calculation will result in a diminishing fraction of the pages keeping their same server assignments. While we do not expect frequent changes in the set of servers, the fact that the addition of new servers into the set will cause massive reassignment is undesirable.
An alternative is to use the same consistent hashing algorithm discussed in Section 9.4.2. Specifically, each redirector first hashes every server into the unit circle. Then, for each URL that arrives, the redirector also hashes the URL to a value on the unit circle, and the URL is assigned to the server that lies closest on the circle to its hash value. If a node fails in this scheme, its load shifts to its neigrs (on the unit circle), so the addition or removal of a server only causes local changes in request assignments. Note that unlike the peer-to-peer case, where a message is routed from one node to another in order to find the server whose ID is closest to the objects, each redirector knows how the set of servers map onto the unit circle, so they can each, independently, select the “nearest” one.
This strategy can easily be extended to take server load into account. Assume the redirector knows the current load of each of the available servers. This information may not be perfectly up-to-date, but we can imagine the redirector simply counting how many times it has forwarded a request to each server in the last few seconds and using this count as an estimate of that server's current load. Upon receiving a URL, the redirector hashes the URL plus each of the available servers and sorts the resulting values. This sorted list effectively defines the order in which the redirector will consider the available servers. The redirector then walks down this list until it finds a server whose load is below some threshold. The benefit of this approach compared to plain consistent hashing is that server order is different for each URL, so if one server fails its load is distributed evenly among the other machines. This approach is the basis for the Cache Array Routing Protocol (CARP) and is shown in pseudocode below.
SelectServer(URL, S)
for = each serversiin server setS
weighti = hash(URL, address(si))
sortweight
for each serversjin decreasing order ofweightj
if = Load(sj) < threshold then
returnsj
return server with highest weight
As the load increases, this scheme changes from using only the first server on the sorted list to spreading requests across several servers. Some pages normally handled by busy servers will also start being handled by less busy servers. Since this process is based on aggregate server load rather than the popularity of individual pages, servers hosting some popular pages may find more servers sharing their load than servers hosting collectively unpopular pages. In the process, some unpopular pages will be replicated in the system simply because they happen to be primarily hosted on busy servers. At the same time, if some pages become extremely popular, it is conceivable that all of the servers in the system could be responsible for serving them.
Finally, it is possible to introduce network proximity into the equation in at least two different ways. The first is to blur the distinction between server load and network proximity by monitoring how long a server takes to respond to requests and using this measurement as the “server load” parameter in the preceding algorithm. This strategy tends to prefer nearby/lightly loaded servers over distant/heavily loaded servers. A second approach is to factor proximity into the decision at an earlier stage by limiting the candidate set of servers considered by the above algorithms (S) to only those that are nearby. The harder problem is deciding which of the potentially many servers are suitably close. One approach would be to select only those servers that are available on the same ISP as the client. A slightly more sophisticated approach would be to look at the map of autonomous systems produced by BGP and select only those servers within some number of hops from the client as candidate servers. Finding the right balance between network proximity and server cache locality is a subject of ongoing research.
A fully distributed P2P protocol does not have a central indexing server but the data is searched directly among the peers. To be able to perform searches, the peers must form a virtual overlay network into which the search queries are broadcasted. For forming and maintaining the overlay network, peer discovery is used. In peer discovery, special discovery messages are broadcasted into the overlay network to find the changes in topology, i.e. new, left or moved peers. At the moment, Gnutella is the most widely employed fully distributed P2P protocol.
There are clear parallels between the network service provider (NSP) role in telecommunications networks and the concept of an overlay operator, also referred to as an overlay ISP383 or overlay service provider.390 In the future NSPs could also take on the role of deploying and managing overlays to provide services that extend beyond their network operating regions. It is important for managed overlays to leverage the significant investment and capability in management systems that exists in enterprise and telecommunications network environments.
The Telecommunications Management Network (TMN) model is a network management model standardized by ITU-T that is often used to explain the key functions of network management. TMN divides the functions into four layers: business management, service management, network management, and element management. Within the lower layers, the management functions are further divided into fault, configuration, accounting, performance, and security. This set of functions is usually referred to by the abbreviation FCAPS. At the service layer, concepts such as service creation, service directories, and service provisioning have emerged. In general the service layer provides higher-level capability that represents a market need rather than a network element feature. Services at this layer may also cross or aggregate multiple network technologies, which might be separately managed at lower layers.
In this framework, an overlay is viewed as a service, and creation and configuration of the overlay are service creation and service provisioning steps. This perspective supports both the case in which an overlay operator manages a single overlay as well as the overlay operator that operates many overlays.
Service Creation
The overlay operator identifies an application that can be supported by an overlay. Peer software is developed or preexisting peer software is reused. The overlay operator makes the peer software available to the user community, deploys infrastructure peers (if any), and provides a suitable bootstrap mechanism so that users can connect to the overlay. The peer software is designed using P2P principles described in previous chapters so that peers self-organize and automatically connect to other peers in the overlay. In addition, the peer software collects operational statistics and status to deliver to the overlay operator's management agent, which monitors the overlay operation.
The operator controls the intrinsic services of the overlay and may provide a service advertisement and discovery mechanism by which third parties can add new services. For third-party services, the overlay operator can provide service assurance capability to the third party for operations carried by the overlay. The overlay operator should be able to monitor service-use statistics where service invocation uses the overlay for messaging. Service usage records are needed for billing.
Service Provisioning
In conventional network services, a customer selects service options and service quality levels. The service provider in turn configures the necessary network resources and activates the service. In the P2P overlay context, service options and quality levels are determined by the host running the peer software, as well as the collective resources of the current set of connected peers. The overlay operator may deploy high-capacity peers to enhance service quality, such as superpeers, relays, multicast proxies, and gateways. These deployments might be made for specific classes of overlay users, in specific regions of the network, or for specific types of overlay services.
Service Assurance
The service provider enforces the service quality levels agreed on with the customer and sets configuration and monitoring mechanisms in place to enforce these levels. The configuration mechanism relies on the configuration management component of FCAPS. The monitoring mechanism relies on the fault and performance management components of FCAPS.
The P2P overlay is adaptive by design, and peers select the best connection or path from the currently available set by periodically measuring the various links. Mechanisms such as dead node detection and replication are used to avoid information loss. Peer functionality can be upgraded when the operator updates the peer software algorithms and provides updates to existing peers via automatic software download. The overlay operator monitors in real time any faults in the overlay and maintains performance statistics. As in conventional fault management, the operator is concerned with failure conditions at each peer and with correlating these across peers to identify conditions that might cascade into larger regions over the overlay or that indicate a coordinated attack on the operation of the overlay. As in conventional performance management, the overlay operator is also concerned with performance conditions, such as those that violate thresholds for the service class, that indicate problems with the overlay design, or that require provisioning of additional overlay resources.
Accounting
Service costs in telecommunications networks can be packaged in a variety of ways, including by base rate, by usage, and by class of service. In addition, the service-level agreement (SLA) may have penalties to the service provider if the service quality falls below agreed-on levels.
In a P2P overlay, usage-based charging is feasible if each use is securely associated with its source and the amount of use can be reliably and securely measured. For example, a basic service might permit a number of indexing operations per day, and a peer performing beyond this level of indexing operations might either be charged or required to contribute more resources to the overlay. Subscription-based use is less demanding in terms of measuring use but requires subscription management and enforcement of subscriber-only use.
Examples
Table 15.2 illustrates overlay management scenarios for both service assurance and service provisioning. For each example there is a brief description and operations illustrating the use of the operators described. Next we look at scenarios involving specific types of overlays such as resilient overlay networks (discussed in Chapter 11) and P2P storage systems.
Table 15.2. Example Management Scenarios
Category
Example
Operation
Service assurance
Peer notifies management agent (MA) that its DHT storage capacity has exceeded a threshold
notify (p,m,e)e = threshold exceeded event
Service assurance
MA uses remote control of a peer to perform a diagnostic latency measurement to another peer
cmd(p,d,m)d = latency measurement
Service assurance
Peer forwards routing table statistics to MA, such as size of routing table, average age, distribution by region of overlay
notify (p,m,e)e = performance report
Service assurance
A peer notifies MA that an object inserted into the DHT is determined to be invalid due to mismatch of digital signature and public key
notify (p,m,e)e = object invalid
Service provisioning
MP broadcasts the list of peers that are identified as vehicles for a Sybil attack
set (P,r,r-list, r-list1)P = all peers in overlay r-list = type of state is blocked peer list r-list1 = list of blocked peers
Service provisioning
1. MP collects response time history from a set of peers P in a given region to their neigrs 2. MP uses remote control at the peers P to determine response time to peers outside the region 3. MP provisions superpeers in adjacent regions to reduce end-to-end delay 4. MP configures peers P with an updated superpeer list
get (P,r,h,t,m)h = history t = most recent cmd(p,d,m)d = measure index operation response time to specified peers get(P,r,rly-list,m)set(P,r,rly-list,rly-list2)rly-list = state is superpeer list rly-list1 = updated superpeer list
Managing a Resilient Overlay Network
A Resilient Overlay Network (RON) is an overlay network that routes application traffic by finding low-latency and available paths that might not be identified by the usual routing protocols. Most RONs have a small peer population, connect all peers in a mesh, and exchange their link measurements with all other peers in the RON. To monitor the operation of the RON, the management agent can be added as a well-known RON peer and can take on the special role of management agent. This role allows the peer to connect with other peers in the overlay and receive status messages without having to actively participate in flow routing. The overhead of participating in the operation of the overlay could interfere with its management agent role.
Following the FCAPS model, let's divide the management parameters into configuration, performance, and fault categories (Table 15.3). Configuration refers to operational settings of peers that can be set either by the management agent or the peer itself, using its self-organizing algorithms. Performance refers to variables that can be measured and that are significant to the function of the overlay. For example, for most overlays churn rate is an important parameter and could be calculated by having the management agent track join and leave events. Fault refers to functional failures or performance thresholds being exceeded.
Table 15.3. Example FCAPS Variables of Interest for Overlay Management by Type of Overlay
Overlay Type
Configuration
Performance
Fault
Generic DHT
Peer membership Overlay topology Object max size Max request rate Max number of objects
Churn rate Inbound and outbound bandwidth usage per peer Number of superpeers Average connection degree
Table 15.4. Roles of Peers and Corresponding State Information for Performance Management
Role
Description
State Information
Superpeer
A peer that mediates NAT traversal for other peers
Client peers
Media relay, mixer, or transcoder
For applications, a peer that acts an intermediary and may perform media processing for a media session
Number of relayed sessions Data rate per session Latency measurements for endpoints
Multicast proxy
An infrastructure peer for improving the performance of application layer multicasting
Number of multicast sessions Node degree per session Data rate per session
Quarantined peer
A peer that is restricted to client status until its lifetime reaches a minimum value
Operational statistics of quaranteed peers
Client
A node that is not part of the overlay but that can use the overlay services
Operational statistics of clients
Virtual home agent
A peer with a static IP address that mediates overlay messages for one or more mobile peers
Number of mobile peers Roaming handoff statistics
Gateway
A peer that processes messages between peers in two or more different overlays; translating between protocols as needed
Message rates Message histograms
The topology of the overlay is an important configuration parameter to monitor, since it can be used to determine load distribution and detect overlay partitions. For a RON, the network measurements that each peer collects and distributes to the other peers can also be sent to the management agent. Current values can be displayed with the topology to visualize bottleneck and low-capacity paths. Historic values can be used for trend detection or variation by workload or time of day. The RON also maintains periodic per-flow measurements. These measurements can be evaluated against a threshold so that a fault is generated to the management agent when the path the RON selects is not meeting the performance available to the native path. To evaluate and improve the behavior of the overlay, it's important to collect statistics for switching to alternate paths.
Managing a Distributed File Storage Service
A P2P distributed file storage service can be constructed using the secondary storage areas of peers in the overlay. A number of designs have been proposed, including PAST,539 which runs on Pastry, and CFS,541 which runs on Chord. In PAST, each user has a quota for the amount of storage that can be used in the overlay. Files are stored using the hashed file name as the key. Thus two different files from the same owner are likely to be stored at different sets of peers in the overlay. Each file is stored at the k closest peers in the overlay. Since peer addresses are generated randomly, there is a high probability that the k nodes are geographically dispersed.
CFS provides a distributed read-only file system and stores files in the overlay by dividing each file into blocks. The blocks for popular files will be spread over many servers. Each block is identified using a hash of its contents as the key. Each block is replicated k times in the overlay, with replicas at the peers immediately after the block's successor in the Chord ring. A peer sends a request for a block as a DHT lookup of the block's key. Each peer along the lookup path checks its cache to see whether the block is present. If it is, the block is immediately returned. If the request reaches the primary peer, the block is returned to the requesting peer, which then sends a copy to each peer along the lookup path to add to its cache.
Figure 15.1 illustrates the role of the management agent in monitoring the service quality of a distributed file store that is similar to the PAST model. A peer stores a file by the hash of its filename. The peer that receives the request forwards it to the k-1 closest peers in the overlay address space for replication. The management agent is notified that the distributed file service has accepted a file with a specified identifier. Later one of the peers storing the replica leaves the overlay. This generates a replication-level exception to both the peer that inserted the file and the management agent. If the primary storing peer fails to add another replica peer within a given time window, the management agent may intervene.
Figure 15.1. Overlay messages between peers storing file replicas and the manager agent.
In addition, the management agent can monitor the uptime and storage integrity of each peer. The software for each peer can perform periodic file system integrity checks and send negative results to the management agent. A notification is also sent when the file system usage exceeds the threshold.