Friday, October 23, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 4: Programming Techniques

I took the time to read Chapter 3 on the language itself because I am quite interested in learning new languages whenever possible. One “best practice” for self-development that I often read is to learn one new programming language every year. The goal is not to become a master in the language and use it in production systems, but to become competent with it so you can understand its patterns and paradigms and have those influence your day job. Functional languages have been all the rage lately and have had significant impacts on modern OO languages, e.g. LINQ in C#. An example of applying functional concepts to your OO programming could be to create generic functions that let you program more declaratively and using immutable value objects.

The one section that I found interesting outside the syntax of the language was the treatment on page 65 of tail recursion. Although I am familiar with the concept, it was an interesting refresher on how it works technically with return pointers & how you can convert to tail recursion by introducing an auxiliary function.



Abstracting Concurrency

The author advocates putting all concurrency-related code into generic reusable modules written by expert programmers and putting all sequential code into plug-ins that anyone can write. Obviously concurrent programming is harder, even in a language that natively supports it like Erlang, than writing sequential code due to the complexities of deadlocks, starvation, interprocess communication, etc. He also advocates creating a plug-in architecture because he claims that you cannot write the concurrent code in a side-effect free manner, whereas you can easily do this with sequential code. To answer Ben’s question, I do believe this statement to be true. The language does not allow side effects in terms of mutating state, so the author is not referring to that. He is referring to the fact that the functions in the concurrent modules need to affect the environment by spawning & exiting processes.

I am not sure why the “find” method in the VSHLR responds with an error when it cannot find the person. I thought that each process was supposed to complete successfully or exit. Granted, this is a central server and we do not want it to die, but although it makes logical sense, it seems to go against the whole paradigm that the author has laid out thus far in the paper. (After reading the rest of the chapter & re-reading this section, the server should not need to exit here. The point of the “fail fast” and “let another process handle the problem” philosophies here means that the server should simply return an error and let the client figure out what to do rather than take any corrective action.)

I like how the author was able to “factor out” concurrency in the example of the VSHLR server. I am wondering if this pattern is always so easy to apply, though.

The idea of having supervisor processes monitor the health of the nodes in the system so a client does not have to worry about its server dying seems like it would work as long as there was a supervisor for the supervisor and a supervisor for that supervisor and so on. So far, the application seems durable, but I think that this same level of durability (except in the face of hardware failures) could be easily replicated in Java or C# without having separate JVMs running in separate processes, as the author said was necessary earlier. I have a “catch (Exception ex) {/*show error message*/}” at the top level in most of my apps that works just fine for errors that are not due to hardware failures. There are obviously many other benefits to Erlang that would be missed, but the durability argument is not shaping up to be quite what I thought it would be.

The whole idea of swapping code out on the fly is much simpler here than I imagined it to be. Here the server is just sent a new function/closure/lambda to execute instead of the one that it was using before. I know there is support for a much more extensive upgrade given what was mentioned in Chapter 3, and I’m interested to see how that works.



Error Handling Philosophy

Even after the author lists out four reasons why having another process handle the errors or the faulty process, I still do not see why it is much more advantageous than having the faulty process handle its own errors unless the errors are hardware related and/or unrecoverable and the faulty process dies. In most systems that I build, we rely on failover systems to handle hardware failures, but I imagine most of the systems that go to the lengths of using Erlang want a more bulletproof Plan B than that. None of his arguments related to code clutter or code distribution sell me at all, though (I really want to be sold on the idea & think outside my Java/.NET box, but the author isn’t a convincing salesman thus far or the pitch is going right over my head).

Later…

I guess that after reading a bit farther, I should recant what I said above. Having the error handling completely removed from the code the worker processes will be executing, even if it is not generic, seems much nicer than having to rely on “catch (Exception ex)” blocks sprinkled throughout the code. If the error handling code in the supervisors can be made generic enough to work with multiple different types of worker processes, all the better.

No comments:

Post a Comment