Sunday, August 23, 2009

Looking Up Data in P2P Systems

P2P systems are are made up individual computers that are independent of each other, but it still reliable and robust as the number of participants grow. These P2P systems are attractive because they usually do not require any administrative or financial agreements, allows us of storage and computation resources of many computers over the internet, is decentralized and distributed -- making it robust, less susceptible to attacks, and fit for long computations and storage.

The authors of the paper discuss some algorithms that aim to look up data in a large P2P system in a scalable manner, without any centralized servers or hierarchy. The design points and trade-offs for each algorithm are also highlighted.

The first such algorithm discussed in length was that of Chord, which uses a successor list to ensure correct lookup even if nodes frequently join or leave the system. The data structure that Chord uses resembles a skip list. (The Chord protocol has already been lengthily discussed in a previous blog article, so I will not repeat it here)

Another way of searching is using tree-like routing wherein for each prefix, each node in the tree-based algorithm records the location of some node with that prefix. Examples of algorithms that use a tree-like data structure are: Pastry, Tapestry, and Kademlia. With Pastry, each node is given a random ID that marks its position in the identifier circle. Messages are routed with a key to the live node with a node ID closest to the key. In turn, there is a leaf set in each node with half the set having values greater than the node, and the other half having values less than the node. Queries are forwarded among the nodes and leaf sets until it reaches a point closest to the key's ID. Then, the routing table is used and the message is sent to the node that has a longer shared prefix with the sought key. The number of hops in Pastry for a query is at most log(base2^b)N.

The last algorithm discussed in detail is in routing in multiple dimensions. An example is CAN, wherein a d-dimensional Cartesian coordinate apce is used to implement the distributed hash table. The coordinate space is divided into rectangular spaces called zones. And each node is responsible for a specific zone. A key is assigned to a coordinate point, and thus to a specific node. Each node maintains a routing table containing its neigbors. When a query is made, a node forwards the query to a neighbor with coordinates closest to the destination key. The lookup cost for such a system is O(dN^(1/d)). The method for joining and leaving nodes in this system were also discussed.

In conclusion, the authors stated that the discussed algorithms in the article are still all under development. Some issues that should be looked into when using the algorithms are: the distance function, operation costs, fault tolerance and concurrent changes, proximity routing, malicious nodes, and indexing and keyword search.

Reference:

Balakrishnan, H. et al.2003.Looking Up Data in P2P Systems.Communications of the ACM

No comments:

Post a Comment