Sagas and distributed transactions

Controversial opinion

I haven’t ever had any useful discussion about sagas with anyone at any company I’ve ever worked with. I’ve found the people who bring sagas up and make a song and dance about them, are generally the same people who tend to over-engineer solutions and have difficulty ‘keeping it simple’. I’d like to dispel a lot of mystery about these processes which seem so different to everything else in the event driven world.

A web search for sagas will likely throw plenty of material at you which talks about distributed transactions (ugh) – a concept I hoped we left behind with the unlaced Dr Martins and eye length fringes of the 1990’s. In an event driven microservices world, a distributed transaction is a poisonous concept.

Yes, I’m caveating this post with event driven microservices, because it’s what I spend my life working with.

With eventual consistency, it doesn’t matter if that eventuality means some data change just made in service A is at some point reversed by the same service. The result is eventual consistency, which is normal and shouldn’t be at all concerning.

But if distributed transactions aren’t a thing, in our world, what’s the problem? Well it turns out that a lot of engineers still like to cling to the old world, and want something which explicitly tracks the progress of a cross-service flow. This is where the ‘saga discussion’ usually starts.

Eventual consistency

Service A completes some task and raises an event called Task1Completed. Services B, C, and F are all listening to that event because they either need to record something about the new state of the subdomain Service A represents, or they need to trigger some new process which is completely internal to themselves. As they finish doing what they need to, they raise events called Task2Completed, Task3Completed, etc.

This is pretty basic, let’s drop in some more interesting requirements. Let’s say that Task1 which was completed by Service A needs to be reversed if Task2 in Service B fails. The thing to remember here is that events are one way, and they indicate something has happened, they don’t carry instructions. The way to convey the failure is just to follow the same pattern, Service B raises a Task2Failed event, which Service A listens for. The Task2Failed event must contain enough state information in it to allow Service A to do what it needs to, but that’s true for all events, there’s no nastiness here.

Lets call the process of reverting Task1, TaskX. When TaskX completes, Service F needs to store some data about it, so Service F listens for the TaskXCompleted event.

This is completely normal event driven stuff, but it’s also a saga, which grows in complexity as we add more consumers of failure events. But we haven’t had to talk about sagas, or tracking – we haven’t even thought about anything beyond what should be considered bread and butter eventing.

More tools in the toolkit

The above is a pretty simple example, and I’ll wager that 95% of the time, you’ll be dealing with pretty simple examples. But what happens if the process is highly critical and much more complicated? I am of the opinion that there are a lot of techniques available to keep track of things. These are the 3 techniques I rely on most:

Domain driven design. The moment I see a well-defined, critical process running across multiple services, I start to think that I’ve scoped my services poorly, or perhaps that I’ve missed a subdomain to which this process belongs. Using DDD to help align service boundaries with bounded contexts will get rid of a good proportion of these processes. It might even be useful to have a new service orchestrate the flow across the other services using a combination of async commands and HTTP requests (see Open vs Closed Distributed Processes) – the coupling can be worth it, if it already exists conceptually within your domain.

You might think this is just following the ‘saga orchestration’ pattern, and you might be right, but the reason we reach that pattern is not to ‘solve a saga’, it’s well scoped services.

Over-arching system integration tests. Any good developer will tell you that much can be learnt about a system by reviewing the tests. Implementing an automated test to verify the related event flows work together as expected gives a huge pointer to anyone working on that code in future. Name these tests well, in line with the ubiquitous language of the business, and the successful linking of these processes becomes strikingly clear.

Well-conceived namespaces. This might seem too simple to be worth mentioning, but grouping code together in namespaces is how we show things are related. By placing the event handlers for the events concerned into the same namespace, we can convey their relationship to the next developer.

Keep it simple

I’ve been building event driven microservices for a lot of years, almost as long as there has been such a thing as a microservice. In all that time, I’ve never had cause to care that I’m implementing a saga. Complexity isn’t an inherent property of a saga. Complexity is something which we mitigate always, whether we’re working on a saga or anything else. The three simple tactics above are (not exclusively) great ways to tackle all sorts of complexity.

More to the point, I’ve seen Nuget packages appearing which report to deliver some kind of ‘plug and play’ saga management, which I find insulting because they’re playing to the hype. You really don’t need anything special to handle a saga – it’s just business as usual event driven design. In truth, if you break from your normal patterns to handle some special case because you’ve been sold on the saga capabilities of a 3rd party tool, you’re building a snowflake which will confuse the next developer to work on that code. Forget about it being a saga, if there is complexity, deal with that complexity as you would any complexity.

Leave a comment