Thursday, November 5, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 6: Building an Application

I am getting the sense that the author loves his hierarchies. When it comes to structuring a system, we throw a few more layers on top of the supervisor-worker trees mentioned in the last chapter. We group systems into applications that are either independent or loosely coupled. Each application then consists of supervisor-worker trees.

I find it very interesting that just by passing a 'global' flag into the server's start function, the server can be transparently accessed from any machine. I wonder what the setup is at the system or application level to make the communication between machines theoretically anywhere in the world transparent. How does the system avoid naming collisions? If you simply pass {global, Name} as an argument to start up a server, I imagine it could be very easy to get collisions. Sure, the process will likely fail fast, but what are good options for recovery? Restarting it will not help because it will have the same name. I imagine that changing the name will break other processes that want to communicate with it...Oh, wait. I think it is the PID that is used for IPC, not the name. What is the point of the name, then?

I am really overwhelmed by the API documentation. Although I can follow it somewhat (Chapter 3 is my only exposure to Erlang), I do not understand what benefit I am supposed to be getting from it. Although I love having concrete examples to back up the abstract concepts, I feel like this is conveying anything interesting, e.g. the structure of an .app file.

I am still somewhat confused on how SSRS is supposed to solve all your reliability problems. If there is a problem with some data or with some data and a certain piece of code, restarting & reprocessing it again will continually cause the same problem. The only solution is to completely ignore that data. I suppose the goal is just to not crash things at the application or system level, but simply throwing your hands up and saying "I don't know what to do" seems like it wouldn't be good enough in many cases.

I've been impressed by how Erlang can abstract out a lot of the tricky bits like managing concurrency & supervision while making it easy for every-day developers to plug into that. I think that is key for building reliable systems. If every-day programmers are building the code that is supposed to ensure reliability & fault tolerance, odds are that they will mess it up. I'm not sure if the examples lend themselves to this extremely well or things are just much easier in Erlang, but separating error handling concerns this easily in OO languages that I use is not quite so easy. It reminds me of AOP, though I have not seen AOP really take off as I was hoping it would, though there are a few good AOP tools out there.

No comments:

Post a Comment