Sunday, October 18, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 2: The Architectural Model

I’m very excited for this paper. I’ve heard a lot of buzz about how Erlang facilitates the building of highly reliable systems. I was a little disappointed when the introduction said “The thesis does not cover in detail many of the algorithms used as building blocks for construction fault-tolerant systems—it is not the algorithms themselves which are the concern of this thesis, but rather the programming language in which such algorithms are expressed.” This gives me the impression that the paper will focus on how Erlang helps you develop reliable systems but leave out lessons that I can apply in other languages. We’ll see how things go…

I’m in complete agreement with the definition of architecture presented. The author says that there are myriad such definitions, but this one contains all the elements that I find noteworthy. It’s by Grady Booch, though, and he usually has some good ideas </understatement> .

I think that a philosophy or set of guiding principles is a commonly overlook portion of a software architecture, and I’m glad to see it called out here. As cheesy as this and related ideas like a system metaphor from Extreme Programming may sound, I have found that teams bigger than four or five people really do benefit from having a foundation that architects and developers can refer back to throughout the project’s lifecycle in an attempt to keep a uniform feel. A good example from a developer’s perspective, which bleed over into a “set of construction guidelines,” that still touches on some architectural decisions can be found here: http://www.cauldwell.net/patrick/blog/ThisIBelieveTheDeveloperEdition.aspx.

This sounds like an extremely interesting challenge. Each of the ten requirements would make for a challenging system, but mixing high concurrency, high reliability, in-place upgrades, and complex features sounds like quite the daunting task.

I’ve heard that Erlang isolates very small pieces of code in independent processes to restrict how far errors can propagate & that each process/module should always make the assumption any other modules that it will talk to will fail. However, the thought of truly to use a full process at the operating system level, this sounds like a very heavy-weight solution due to the overhead of having so many processes and the cost of interprocess communication versus intraprocess communication. It’s good to see that the processes used by Erlang in this sense “are part of the programming language and are not provided by the host operating system.” It’s interesting that towards the end of the chapter the author presents the idea that the isolation from errors that lightweight, hardware independent Erlang processes provide can only be matched in Java by running applications in different JVMs in different operating system processes, which is obviously a tad bit more resource intensive.

Concurrency orient programming doesn’t feel like it has too many differences from OOP to me. It just sounds like OOP in a COPL. Real-world activities are mapped to concurrent processes in much the same way the real-world objects and actions are mapped to classes and methods. Nothing about the characteristics of a COPL seem new, but it is a novel idea to have all the functionality listed (support for processes, process isolation, PIDs, communication only via unreliable message passing, failure detection) implemented completely within constructs of the programming language that can be completely independent of the operating system. It does seem like it could be a problem to have to copy data between processes, though. I can’t imagine that processes could always be organized so that large chunks of data never need to be passed between processes.

I’m never a fan of security through obscurity, so when the author mentions that not knowing the PID of a process means you can’t interact with it and thus can’t compromise it in any way. I’m sure that any decent architect would put more layers of security in place, but I dislike security through obscurity so much that I wouldn’t even bother bringing it up.

Message passing seems to be the best way to build reliable systems at any level of abstraction from my recent readings. At the enterprise architecture scale, event driven architectures and message passing systems are at the crux of many of the most durable sites as seen on High Scalability, e.g. http://highscalability.com/amazon-architecture. At the complete opposite end of the spectrum from systems integration is Erlang, which builds these exact same concepts, e.g. asynchronous message passing, into the programming language itself. Since I hear good things about reliable systems being coded in Erlang and reliable enterprise architectures building SOA stacks on EDAs and message passing systems, I’m guessing there’s something to this pattern.

The idea of failing fast as soon as you detect an unrecoverable error seems like common sense to me. Maybe I’m being naïve or not entirely comprehending the details of the idea, but I don’t understand why anyone would have their system continue after encountering an error which they can’t handle. I’ve come across this idea many times in the past but still have yet to feel like I’m not missing something.

No comments:

Post a Comment