[Jacob Gorm Hansen]
On Mon, 2003-01-13 at 16:06, Josh English wrote:
The largest problem I see with this scheme is that I cannot conceive of a mechanism where the data sent to the CPU and the failover CPU will not cause enormous delays. This is because any failover system will require a confirmation of the data received by the backup CPU. Even if the transport and confirmation mechanism was hardware, the bottleneck inherent in the design would render the system unfeasible in a real-world implementation.
I imagine that if the system is only ever accessed from the network, all the data that is to be duplicated comes from there. I suppose one could just use an Ethernet hub or similar hardware to make sure all Ethernet frames arrive at both hosts, coupled with a dedicated low-latency link between the hosts used for syncronisation (which given certain guarantees about the performance of the local Ethernet segment could be as simple as just each host counting all incoming frames and sharing the counter via the dedicated link).
In many situations you can also achieve fault-tolerance using logging, and exploit some sort of causal logging technique to avoid unnecessary ackownledgements from the logger. I suspect that the following two projects might give some useful solutions and/or pointers:
Lightweight Fault-Tolerance URL:http://www.cs.utexas.edu/users/lorenzo/lft.html
WAFT: Support for Fault-Tolerance in Wide-Area Object-Oriented Systems URL:http://www-cse.ucsd.edu/users/marzullo/WAFT/index.html
In particular, there are proposals on how to achive fault-tolerance (through logging) for TCP-like protocols.
eSk