For updated content from Camunda, check out the Camunda Blog.

Where is the “retry” in BPMN 2.0?

In a famous article, Gregor Hohpe describes four strategies for dealing with failures in a business transaction:

  • Write-off,
  • Retry,
  • Compensation
  • and Two Phase Commit.

How does this map to BPMN 2.0? Here are some experiments I made.

Compensation

In BPMN 2.0 we can model compensation explicitly:

If I detect that I have no milk after making coffee, I throw the coffee away. It is important not to serve coffee without milk, even at the expense of having an unsatisfied customer.

 

Two Phase Commit

According to the specification, a BPMN transaction subprocess can be used for 2PC style transactions: “A Transaction is a specialized type of Sub-Process that will have a special behavior that is controlled through a transaction protocol (such as WS-Transaction) (Page 178). I won’t go into the ambiguity of the semantics of a BPMN 2.0 transaction subprocess now. For now, let’s just assume, that the transaction sub-process allows modelling potentially distributed, two-phase-commit-backed transactions:
Maybe this could be implemented by a system where the coffee-making-unit and milk-adding-unit are strongly coupled** and need to reach consensus whether they are able to produce a cappuccino atomically. If a failure occurs, the customer is still unhappy but we did not waste any resources (we didn’t throw any coffee away). Implementing this might require some more expensive and more complex infrastructure.

Write-off

Write-off means that you detect that there is an error, however you don’t have to / want to handle it now.

So if there is no Milk, I simply give up.

It could also mean that I drink the coffee anyway.

This means that the “error” is there, but I don’t make it explicit in the diagram. I know that there is a possibility that I run out of Milk (actually it happens to me all the time!), but that “error” is handled at a lower level (in the implementation of the “Add Milk” activity). So in this case we do not model the error, BUT: the “Drink Coffee” activity is able to handle both coffee with milk and coffee without milk, and we know it (I think that the awareness is the important part).

Adding milk is optional, but it is still desirable. So even if I cannot do something about it now, I might want to do something about it later:

In my opinion this is still a write-off, since I am still drinking this particular cup of coffee without milk.

Retry

Retry can be implemented at different levels. 

First, we can of course model the retrying of an activity explicitly:

(Ok, granted, maybe the retry doesn’t make much sense in this particular real-world example 🙂 )

Second, the retry can be handled by the process engine. For example, the activiti process engine supports the concept of asynchronous continuations: it allows you to define a safe-point before an activity. When reaching the safe-point, the process engine commits its current transaction and releases the thread. In a background thread, it will try to execute the activity repeatedly until it eventually succeeds.All of this is not visible in the process diagram and configured using vendor extension in the process model.

Third, the retry can be handled by asynchronous / messaging middleware. First, we put a message in a queue. The middleware will then continuously try to deliver the message, until it eventually succeeds.

And finally the retry can be handled at the service level. In this case it is transparent to the service consumer: we simply call the service and wait for a callback response. In the meantime, the service might internally perform retries, maybe using a message queue as well. We are not aware of it.

I think that the fact that the retry pattern can be implemented at different levels is very interesting and makes it somewhat different from the other patterns:

  • This is different to the way compensation is handled: since the failure of one activity leads to the compensation of another activity, it is best modeled at the process-level. If one service would trigger the compensation of another one directly, we would introduce explicit dependencies between services.
  • If it is handled at the service or middleware level, the service invocation must by asynchronous (from the point of view of the process engine). This means that You either need polling or callbacks for retrieving the results and the continuation of the process instance.
  • If it is handled by the process engine, you need some concept of “safe-point” before the retried activitiy (cf. asynchronous continuation in activiti). This will usually have an influence on threading and transactions.

Furthermore, we can make the following two observations:

  • every service that is unavailable is retryable. What I mean: not every service is retryable by it’s nature (see the cited article by Gregor Hohpe for details). However, if a service call fails because the service is not available (or because the messaging system is unavailable) it is always retryable.
  • applying retry sometimes depends on the context of the service invocation. Maybe you want to handle the failure of the same service using retry in one process but using compensation in another process.

From this, I conclude that in general, it is useful to have some concept of retry at at the process-engine level.

However, since implementing retry at the process engine-level needs some concept of “safe-point”, which has an influence on the execution semantics of the process, I am asking myself whether the “safe-point” should be visible in the process diagram.

———-

** interestingly, the article is only available in German Wikipedia

Already read?

Scientific performance benchmark of open source BPMN engines

Why BPMN is not enough

Decision Model and Notation (DMN) – the new Business Rules Standard. An introduction by example.

New Whitepaper: The Zero-Code BPM Myth

5 Responses

Leave a reply