07-02-2012, 12:03 PM
Fault Tolerant Distributed Systems
fault-tol.ppt (Size: 597 KB / Downloads: 29)
#1: Revannaswamy, Bhatt - ‘97
O(log N), N = Number of nodes in the spanning tree
Communicate with neighbours only (Raymonds algorithm)
Tolerant to single link / single node failures
Eliminate the failed component and obtain different tree structure (network is biconnected)
Mechanism to detect the failures is assumed to exist
Only “Branches” are used for message exchanges
#1 Contd - Reconfiguaration
Repeat the steps below till all connected OR insufficient chords
At the end of reconfiguration, all data-structures are reset and algorithm is restarted, older token holder is pre-empted and token is transferred to a newly elected leader
#2 Contd. - Fault Tolerance
Mechanism to detect the failures is assumed to exist
Single link / single node failure and recovery
Any path constructed by taking outgoing edges always leads to the token holder
Studies all the states of the system at which the failure may occur and gives solutions to each of the states for tolerance
At recovery, the node asks its neighbours about its state in order to reconstruct the data structures
Fairness compromized at recovery - queue order is arbitrary
Some ideas - for our protocol
The first node experiencing timeout may send a message to every other node asking for their states
The node experiencing timeout must be: waiting for a grant/token to arrive OR waiting for a release to arrive
In case of grant/token, propagate the message to the parent chain to detect the pooint of failure
In case of release, send the message down the chain of children to see who is supposed to send a release
Once detected the point, recalculate the state of the affected nodes - the ones falling on the chain
Assuming no link failures for now, multiple failures can be detected