Wednesday, November 4, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 5: Programming Fault-Tolerant Systems

“... [I]f we cannot do what we want to do, then try to do something simpler." I would like to see some examples of this because I cannot think of many ways to implement this in the systems that I have been working on lately. For example, one system is supposed to place orders for financial instruments. I cannot think of any way to break this down into simpler, alternative tasks. If the system cannot thoroughly and completely validate the order, then it cannot place the ord. If the order cannot be persisted to the database, then it cannot be placed. There is nothing simpler to try. Most key business task seem to be all-or-nothing at a cursory glance.

Here is another attempt to differentiate errors, exceptions, failures, faults, bugs, etc. I know there are some official, IEEE-backed definitions for these terms, but the problem I have is usually the terms are too similar and creating the distinctions seem like an academic exercise and the definitions usually do not clearly separate the terms. Here is my stab at interpreting what the author was trying to say: Errors occur at run-time when the system does not know what to do. Dividing by zero is an error to the VM since it cannot perform the computation, but this may or may not be an error to the parent process depending on whether or not the developer accounted for this scenario. If the code corrects the divide by zero error, it ceases to be an error because the system now knows what to do with it. (I had to read ahead a bit to section 5.3 get a decent definition.) Exceptions are a mechanism to communicate about errors between processes. When an error occurs in the VM, it raises an exception to let the calling process know what happened. That process may pass the exception on or create a new one to communicate to its parent that an error has occurred that it cannot correct. If a process receives an exception for which it does not have a catch handler, the process fails (shuts down) and notifies its linked peers why. So in a nutshell, an error is a run-time problem that the system does not know how to handle that is communicated between processes by exceptions that cause failures in processes that do not have catch blocks for them. Now let’s see how those definitions help up build a reliable system...

The author recommends creating supervision trees.
+Supervisors monitor other supervisors and worker nodes.
+Worker nodes execute well-behaved functions that raise exceptions when errors occur.
+Supervisors are responsible for detecting exceptions in child nodes & implementing the SSRS (stop, start, restart specification) for each node that it supervises.
+Supervisors are responsible for stopping all children if their parent stops them & restarting children that fail.
+Supervisors can be branded either AND or OR. If they are of the AND type & one child fails, they stop all other children & then restart them all. These are used for coordinated processes where a process cannot continue if any of its siblings fail. If they are of the OR type & one child fails, they restart that node and leave the rest of the children alone.

"The above discussion is all rather vague, since we have never said what an error is, nor have we said how we can in practice distinguish between a correctable and an uncorrectable error." The author absolutely read my mind! Time for some concrete examples...

Rules for well-behaved functions:
+"The program should be isomorphic to the specification": everything that is in the specification must be in the function and everything that is in the function must be in the specification. Nothing from the specification should be omitted in the function, and nothing should be added to the function that is not in the specification.
+"If the specification doesn’t say what to do, raise an exception": Don't guess; fail fast.
+Be sure the exceptions contain enough useful information so that you can isolate and fix what caused it.
+"Turn non-functional requirements into assertions (invariants) that can be checked at run-time": Not sure how I would do this in general. The author's example is timing function calls to make sure none go into uncontrolled infinite loops.

No comments:

Post a Comment