“Apology-based computing”

I came across this phrase sometime back, and was instantly intrigued by it. So, like any good samaritan, let me share what I could make of it for the larger good of humankind! While I present my assessment, I’ll also highlight aspects that make it very viable in most of the computing contexts. We will also delve into how this phrase, at some level, dabbles with the aspects of modern day computing like eventual consistency, the tradeoff between performance and correctness, and Amdahl’s law.

First, let’s get to understand what the phrase means.

Let me make a claim: We come across apology-based computing in our day-to-day digital ongoings — be it shopping on e-com websites, chatting using our phone apps, or general browsing.

So what is it?!

It merely points to the fact that in this age of highly distributed systems1, in a majority of situations, it’s ok for messages to be delayed, go undelivered, or, just go awry! Note that messages here imply any sort of communication between two or more systems1.

If this sounds a bit overbearing and leads to cringing of eyes, you might get want to recall situations in which you had to “refresh the page” in order to see an update, or those times when your text or chat messages did not deliver and a cute little (!) sign appeared beside it, or, (for the technically inclined) you had to explicitly invalidate a cache, so that updated data is reflected quickly. Also, recall the fact that you were sort of okay with it, and in a majority of such scenarios, did not complain.

And, that’s precisely it!

Over-time, as the systems have grown, and as the apps have proliferated into almost all aspects of our life — we’re becoming more and more OK, when every once in a while, the systems do not behave as expected. This is not just philosophical — we humans did not become more patient overnight! Rather, over the years, something interesting has happened — our response as intermediate or ultimate users of these systems has evolved. So much so, that a seemingly bureaucratic statement that,

It’s easier to ask for forgiveness than to get permission.


has found a benign presence in computing and forgiveness or apology aspects have become the order of the day!

Why did this happen, you ask? Because there’s no other option!

Something of this sort is what David Ungar has deliberated upon, and proved in his iconic talk titled “Everything You Know (about Parallel Programming) Is Wrong!” (see embedded video).

In this talk, in the light of the above quote, David Ungar highlights how the bias in computing is leaning (or should lean) more towards something he refers to as “end-to-end nondeterminism”, or “race-and-repair”, rather than correctness.

Correctness or Determinism comes at a cost, and despite one’s best efforts, we are limited by Amdahl’s law — when it’s within the confines of a system, and aspects like CAP, and other distributed system vagaries, the moment a process (transaction) crosses the boundaries of one system.

So, what do we do?

Well, distributed systems engineers are well-aware of the phenomenon that,

Failure is a norm rather than an exception.

which, if you come to think of it, is a paradigm shift from the conventional thinking where we, as programmers or system architects, were told to treat it as catastrophic! However, treating failures as a norm in modern-day computing leads us to something very practical — something we call “designing for failure”.
That is to say, we need to build systems with better resiliency, quick failure detection, fault tolerance, and, with CAP in perspective — be willing to compromise on correctness in favour of availability, by being eventually correct! That pretty much takes care of most of the scenarios. (Of course, we’re not talking about mission critical systems or transactions — where correctness and/or availability, whatever be the cost, is indispensable!)

So, going back to the aforementioned scenarios, what the systems are doing by making us ‘refresh’ the browser window or by making us log back in, is aligning to the correctness part of the application. The other option, of course, would have been a 503 or something like this, which leads to far more painful memories!

1Systems = Processes

Reliable Distributed Communication with JGroups

We were looking for an alternative to Java Messaging Service (JMS) for a project requirement – I had had my fair share of unpleasant experiences with JMS from the past. With JMS, most of which had to do with housekeeping involved in the event of a broker outage.

Issue at hand

The solution we were working on was a legacy web-based application, which was being enabled for a cluster. Scalability, it seems, was not designed into the system. The solution, therefore, demanded making the system scalable with a quick-turnaround, minimal code changes, and of course, if not positive, then no negative impact on the application throughput.

The main challenge was the presence of a programmatic cache (using core Java data structures), for some specific requirements, which of course, needed to be made cluster-aware. Data consistency is the primary concern in such cases, because the application pages can be serviced by any potentially any node.

In this post, I want to talk about how we achieved the same – since such a scenario of different nodes needing to “talk” can arise in many a applications of the present day – as distributed computing is becoming more and more popular.


When we say cluster-aware we imply that the clustered application, might need to talk to its peers at various junctures. It might or might not decide to perform an operation based on those interactions – but the conversation does need to occur. There are of course several options today to facilitate the same – JMS, Hazelcast, JGroups, etc. JMS requires the presence of an active broker process at all times. The housekeeping involved in broker outages, especially when message persistence is enabled (for reliability), becomes a problem. Solutions like Hazelcast required us to modify the existing code so as to use the data structures provided by them. This wasn’t feasible because we wanted to minimize the code changes and rather work out a solution on top of the existing code base.

We settled for JGroups because it seemed to address all the concerns:

  • there is no active process required beforehand, but rather, each node either joins an existing communication session, or else, starts one of its own. This means that there’s no single point of failure (SPOF)
  • message reliability, ordering and flow control are all configurable
  • code changes could be contained, because it involved minor modifications.

JGroups is a reliable IP  toolkit that is based upon a flexible protocol stack. IP , or broadcast, implies sending a message to multiple recipients. Traditionally, multicast communication is UDP (User Datagram Protocol) based, and is unreliable. For instance, think of a streaming video from YouTube – where once the streaming starts, little does it matter that a few frames are missed or jagged. Such are typically unreliable UDP based transmissions.

JGroups addresses the shortcomings of traditional IP multicast by adding aspects like reliability and membership to it.

Flexible protocol stack implies that there are options to tailor JGroups behaviour according to the application requirements. For example, aspects like Failure Detection, Encryption, etc., can be configured.


Conceptually, the design was very simple: the caches on each node needed to “talk” to other nodes, whenever a significant operation occurred. Since we were dealing with user-defined entities, these operations mostly involved some kind of update of these entities.

Since each of the node needed a JGroups handle in order to be able to talk, we invoked a local JGroups session on each. Through the concept of groups and membership, a JGroups session looks for other active sessions when it comes alive. Consequently, members belonging to the same group are aware of all the peers. For example:
Suppose node n1 comes up as the very first node, and initializes a JGroups session, creating a group called ProjectXGroup. Any subsequent node, say, n2, n3, also does the same, but in its case, it find an active ProjectXGroup already in place, and hence joins it, rather than creating a new session. Thus, each node (peer) is aware of all the members in that form the group. JGroups, by design, provides failure detection (FD) – that any node that goes down is removed from the list of active members, thus ensuring that at any given point of time, all the group members are aware of the current state of the group.

Any local update of one or more entities required the change to be propagated to all the peers. We identified that this should be a 3-step process. Given an entity EntityX, the following needed to be done on all the nodes:

  • Lock EntityX
  • Update ExtityX
  • Unlock EntityX

We also needed to ensure that failure to perform any of these operations should result in a rollback, which also needed to be taken care of. For any given operation, we subdivided it into smaller tasks. An example of a task could be LockTask, UpdateTask, UnlockTask, etc. It is these tasks that would be the language of communication. Typically a task would have taskID, information about the entities involved, status flag (SCHEDULED, SUCCESSFUL, FAILED..), etc.

JGroups has options to either broadcast a message or unicast it. Since our implementation required that caches on all the peers be consistent, a sender broadcasts a task to all the peers. Each peer, would then perform the task, and report the outcome back to the sender. Continuing with our earlier example, given a task T1 which the involves the following sub-tasks on an entity known as E1:

  1. Lock E1
  2. Update E1
  3. Unlock E1

Let’s consider the first sub-task — Lock E1:
This would require E1 to be locked on all the nodes, and therefore we invoke the LockTask, with information about E1. Now, suppose n2 is a node which is catering to the user at a given point of time, then n2 would first lock E1 locally, and then broadcast this LockTask to the group.  Since it is a broadcast, all the nodes (including n2) would receive and process the LockTask (and lock E1 locally). We can easily identify the sender and not perform the task there (since the task was issued to other nodes only after the sender had performed it). This means that n2 would ignore the task, but n1, n3..nn would need to perform it.

Once the given task is processed, in order to report the outcome back, we set the appropriate status flags in the original Task object and send it back to the sender. This time, however, we set it as a unicast communication (by specifying the recipient), since we know who is this message directed to. What it boils down to in terms of JGroups is: after performing the task n1, n3,…nn would all get the handle to the JGroups session and can use the same LockTask instance to report back the outcome of a given task. Upon receiving responses from all the peers, and if the responses are positive, the sender can then proceed with other operations (sub-tasks) — Update E1, Unlock E1…etc.

Exceptional Scenarios

There are various junctures where the above conversation could fail, and thus, all the exceptional scenarios need to be taken care of. There were several failure scenarios that we handled. To list a few:

  • No response (or response timeout) from one or more peers
  • Failure of subtask in the sender node
  • Failure of subtask in any peer, etc.


We discussed how JGroups can be an apt solution in situations where reliable distributed communication is necessary. We also saw an example where JGroups was used for enabling intra-node communication for broadcast and unicast services.