How to Design Decoupled Systems

Decoupled architecture is one of the biggest enablers of agility, speedy delivery, and high quality; yet many software designers have limited experience of what decoupled looks like. I hear people talk about different amounts of indirection and API layers, without understanding that these don’t inherently decouple.

I’ve been confronted with the insistence that because there’s a queue involved, two services couldn’t possibly be coupled, even though they could rarely be worked on in isolation, or redeployed without impacting each other.

To me, coupling isn’t just the obvious stuff. I think software can be coupled in many different ways, some human as well as conceptual. I also think that depending on your job, you can tend to see coupling in different ways.

  • An Architect might see that multiple systems have to carry out the same very precise process in order for data to match up later on, downstream.
  • A Developer might see that they can’t deploy a change to one service without updating another.
  • A Delivery Lead might find that they can’t integrate with a particular platform without first arranging time for another team to do some work.
  • A QA might notice that some functionality can’t be tested in isolation.
  • An Ops person might notice that there is competition for a single server’s resources between multiple services.

These are all different ways in which systems can be coupled. They can all impact delivery, performance, and reliability.

Speak to any one of these roles about the problem, and each will give you a different solution, specific to their role. Each will solve their own piece of the puzzle by adding new working practices, or 3rd party services, or a new class of tests, and so on, but each solution will introduce additional complexity and each will be applied indiscriminately. What’s more, each solution is just a sticking plaster over a much bigger problem which isn’t getting addressed at all.

I have spent the majority of my professional career as a contractor. I’ve been lucky enough to have worked with some of the smartest people in the industry, at some of the most forward thinking businesses in London. In that time, I’ve seen a pattern. It isn’t a secret formula – plenty of other technologists see the same, but it is far easier to implement than many believe.

In this post, I want to tackle the very broad question of how we can avoid coupling problems altogether.

Something to solve

First of all, we need a software solution to design, something which is reasonably realistic, but isn’t too complicated or this will be an incredibly long post.

Let’s tackle a new sales web portal for a business which normally sells via showrooms and by phone order. We’ll assume they currently have a monolithic architecture and are experiencing all the problems that come with that after a decade or more of neglect. Where do we start?

In the beginning

I want to start with an idea which should be obvious but never is: if this new sales portal is to be decoupled, it doesn’t have to obey any current conventions, it doesn’t have to call any existing systems, it is… “de-coupled”!

When you spend your life working with a particular system which everyone knows about, and every piece of software written ends up talking to, it can be a surprisingly difficult leap to make, that something new can be whatever it needs to be. We can literally start afresh, with concepts that work for the new system and might not even exist anywhere else. Or maybe a concept is familiar in a different service, but we need to use it in a different way. We can design it with very specific architecture, which differs from what’s gone before because it’s intended to work for this system, not anything else. We have freedom to build something really good, and that should already excite you.

This has just become the type of design problem software engineers should eat for breakfast. We need a UI, we need a database, we need some authentication and security, probably a firewall, reverse proxy, and some load balancing. This is definitely sounding more refreshing than worrying about integrating with any big, legacy monolith.

The stocklist

This is going to be a sales system, so the main information we are capturing is the creation of a sale. A customer sat at home will hopefully be happy to click the button which creates the sale record, but where will the stock list come from?

This data currently sits in a different system’s database. We could just query the data directly, but that would be a form of coupling. We’d have a direct dependency on that database being available and we’d be impacting the performance of one system by adding another.

We could build an API in front of the existing database and call that. This is a little cleaner than going directly to the database, but it’s still coupling. If the database is unavailable for any reason, our sales system fails. If we have a flurry of sales, we can impact the other application using the database.

We can place as many layers of abstraction between the two systems as we like, but ultimately we will always have some form of coupling. Trying to query the data by API or database doesn’t just couple the applications, it couples the teams who work on those applications.

If you want to make a change to a system, but in order to do so you have to synchronise that change with another development team, then this is coupled. Even if the changes kept within your team – if you can’t update one system without updating another, this is coupling.

Depend on yourself

So querying data from an external system is not decoupling. Also, we need to have that data if we want a user to select a product to purchase. If we aren’t going to go and get that data, we must already have it in our new service. This would mean we only depend on ourselves – we’d be well on the way to ‘decoupled’.

So how do we get the stocklist from the legacy database into the new sales portal, and keep it up to date?

Moving data

Once upon a time, back in the 1990’s, ETL ruled the world of data movement. Every enterprise worth talking about ran ever growing data transfers over night, starting as soon as they believed everyone would be heading home from the office, and hopefully finishing before anything important started the next day.

Unfortunately, this approach only works for so long. Eventually there are just too many data jobs needing to move and copy too much data to allow them to complete over night. IT departments start to spend huge amounts of money on bigger and faster arrays of disks and faster (and much more expensive) WAN links. Massive sums of money would be invested before someone would dare to suggest it’s a losing battle.

What’s more, we aren’t just talking about moving data. A ‘Stock Item’ in the legacy database will certainly have all sorts of properties which we aren’t interested in. It might have property names which are simply outdated and no longer make sense, and while it might feel like a safer option, continuing to reuse terminology which is no longer appropriate would introduce yet another form of coupling.

We don’t just want to copy the stock list. We want to have our own list of objects with properties which only support the new portal. We want to have a new concept owned by the sales system.

Events

And so we come to events.

Events are a specific type of message where the originating system doesn’t care how (or if) they are consumed.

Implemented correctly, events can keep the source and target systems decoupled. Firstly, the system raising the event defines the payload. There is no consideration given to mapping properties to something consumers would prefer – the source system knows nothing about the consumers. The payload of the event is structured and named as per the source system standards.

Event consumers use anti-corruption layers (ACL) to make the event payload consumable. The ACL is not necessarily a different service, it could simply be a mapping class in the consumer itself, but that ACL code is coupled specifically with the event it is designed to receive.

The ACL is coupled to the events, not to the source systems. An event is a static thing which will always look the way it looks. The payload will always be structured according to the same rules, even if the source system is updated or changed, it is not possible to change an event type after it has gone into use. It can only be replaced or re-versioned (more info on that below).

There isn’t just one right solution for which events we could use to keep the stocklist up to date, but let’s look at 2 events which go some of the way: NewStockItem and StockLevelChanged.

{
  "EventType": "NewStockItem",
  "EventVersion": 1,
  "Payload": {
    "StockItemId": "TCW567",
    "StockItemName": "The Cool Widget",
    "InitialStockLevel": 200,
    "RetailPrice": "$300",
    "CanShipToCountry": [
      "UK", "Australia", "USA", "France"
    ]
  }
}

{
  "EventType": "StockLevelChanged",
  "EventVersion": 1,
  "Payload": {
    "StockItemId": "TCW567",
    "CurrentStockLevel": 50
  }
}

These are examples of both events. The first event tells our system that a new product is now available and provides some additional information about how the item can be sold. The second event is very specific and doesn’t contain much data – it just sets a new stock level.

There are different levels of granularity which could be used here without breaking the pattern. Perhaps all updates to The Cool Widget could come via a StockItemUpdated event; but sometimes making events a bit more specific can save additional work later, when a consumer is only interested in changes to the price, for example. We might not want a consumer to be constantly comparing the updated record with the previous record to find out what’s changed.

Versioning events

Maybe the last obvious point of coupling is that if the source system needs to change the event payload in some way, it can break consumers. This is why events in the wild shouldn’t change once they’re in use. That’s not to say you’re stuck with the way things work, there is a way to introduce change which negates coupling.

There’s a concept in SOLID development: The Open and Closed Principle. Open to change, closed to modification. This applies to events also.

Let’s say the architectural team has decided that exposing the StockItemId is leaking internal concepts (because it’s the internal primary key for the source system). They want to stop using that and start using a SockItemRef property which is a shared identifier. Until all systems consuming the above events have been updated to use SockItemRef, we can’t stop using the StockItemId altogether. We handle this by versioning the events.

{
  "EventType": "NewStockItem",
  "EventVersion": 2,
  "Payload": {
    "StockItemRef": "SI00k45",
    "StockItemName": "The Cool Widget",
    "InitialStockLevel": 200,
    "RetailPrice": "$300",
    "CanShipToCountry": [
      "UK", "Australia", "USA", "France"
    ]
  }
}

{
  "EventType": "StockLevelChanged",
  "EventVersion": 2,
  "Payload": {
    "StockItemRef": "SI00k45",
    "CurrentStockLevel": 50
  }
}

The source system is updated so it raises both versions of the events. This doesn’t immediately break anything and allows event consumers to move to version 2 in their own time. Once there are no consumers of version 1, the source system is updated so it only raises version 2.

The strategy of versioning events provides further decoupling of not just the services, but the development teams building the services; one team does not have to work in sync with other teams, the change rolls out gradually.

Eventual consistency

There will be some people reading this who have spotted a possible issue with holding the data in two places. Which is the system of record for a sale?

If the new sales portal sells what it thinks is the last Cool Widget, at the same time as someone buys it in the store, what happens?

It sounds like a bit of a nightmare for a business to be in, but let’s put some perspective on this.

  1. If you’re selling out of something, that’s great, it means it’s a popular item – be happy.
  2. The alternative to this architecture will cause problems at some point. Decoupling is the only way to scale painlessly.
  3. There are ways to avoid this situation, using other business rules (only allow the last few to be sold in store, or make sure stock levels never get that low).
  4. A phone call to say “I’m sorry, that’s a very popular item, the last one was sold just before you ordered – I can recommend this item instead, or you can wait until we have more in stock.” Is not going to hurt anyone.

The modern world is so used to eventual consistency that we don’t even realise it’s happening. High performance systems make use of this practice all the time, because it’s more efficient to handle the few clashes that arise rather than waiting for one big system of truth to keep up.

So in answer to the question “which is the system of record for a sale?” – both are.

The sales portal is the system of record for an online sale, the original system is the system of record for other existing types of sale. Sales from each are both equally as valid within their own context. Accidentally over-selling stock gets managed by other business processes.

The online sales subdomain

I talk about the subdomain in reference to the Domain Driven Design concept of a subdomain. Choosing to scope our portal so it contains all core domain concepts of selling online, means we have a clear boundary of responsibility. It means that even if we get other things wrong, our sales portal shouldn’t need to go elsewhere for any processing related to selling online. It solidifies the self contained nature of the decoupled architecture we are chasing.

If everything else in the business breaks, we can still sell online, because we are not dependent on anything else. This also means that we don’t need any other systems to be running in order to carry out all functional testing of this subdomain.

This is the removal of yet another type of coupling – the need for having external dependencies deployed at just the right version in order to test what’s been built. The effort of coordination across teams and departments is an ever-growing overhead that we can do without.

In fact, because we version our events, the sales portal no longer cares what happens once the event is raised. Any consumer would write their own tests to make sure those events are consumed correctly. This class of testing (a type of contract testing) should be automated and can benefit from projects such as Pact, but I find such 3rd party helpers a little opinionated, overly complicated, and inflexible – plenty will disagree, so make your own choice.

There can be no doubt that by bringing all related concepts together in one system, we make testing much easier. It also helps us keep things simple – there’s no need to have a myriad of messages flying around until something solid has actually happened. The completion of a sale is a very solid thing – it deserves an event. Overly distributed systems end up sharing concepts and business logic, preventing developers from writing clear unit tests and forcing logic to be tested at the integration level – huge coupling which we are trying to avoid.

So which concepts will belong to our new portal? This is my list (but it isn’t necessarily the only valid list).

  • Shopping cart
  • Stock items and availability
  • Geographical based availability rules
  • Recording of sales
  • Taking payment
  • Storing customer information for subsequent visits
  • Identifying recurring customers

There’s nothing here about managing the fulfilment of a sale, or marketing emails, or tracking deliveries, etc. These are all concepts belonging to other subdomains, which would overly complicate the sales portal’s code, unnecessarily.

Everything is a microservice

There are a LOT of people who scoff at the microservice concept on social media. Whenever I dig into what they’re saying, their problems (almost) always come from poorly chosen abstractions. A microservice must comprise a complete abstraction – it isn’t just the smallest thing you can make. So please let’s put the nay sayers to one side and focus on the properties of a microservice which mean it’s decoupled.

  1. Dedicated data store which isn’t shared – nothing else ever accesses its database.
  2. It’s deployed in a way which doesn’t depend on other systems – maybe into a container, or simply onto its own micro instance.
  3. It can be tested in isolation.
  4. It represents a business capability.
  5. Logic is contained within the service, no mapping of concepts by message transport systems.

The list goes on (thank you Martin Fowler).

So we will have similar rules in place for our sales portal.

  1. Nothing but the sales portal will ever access the database.
  2. The sales portal application will be deployed in a container.
  3. The database won’t be in the container, but it will be updated by the application itself as a new versions are deployed. (The sales portal owns and manages its database).
  4. The build pipeline will run unit tests – 100% of business logic should have unit tests. These unit tests will form part of the test approach, and will be acknowledged and observed by QA.
  5. The deploy pipeline will run developer written integration tests which will use the portal’s API (events and endpoints) – they won’t read or write directly to the database, the portal will expose whatever API is needed to set up test scenarios.
  6. The deploy pipeline will automate the configuration of reverse proxies, load balancers, DNS entries, and any other traffic rules required.

By isolating the portal in its own container, and by strictly following microservice concepts, we prevent most opportunities for cascading errors from other systems – another form of coupling avoided.

Finally decoupled

So, now our sales portal doesn’t depend on other systems. It can scale horizontally without impacting other systems. Developers can work on it and make changes without having to wait for other development teams to align. It can be tested and deployed in isolation, without impacting anything other than itself.

This is decoupled, and it isn’t difficult to do. In fact I think 85% of decoupling is easier than depending on external API’s, because it’s just a very simple stand-alone application design problem which we’ve all been pretty good at since we were junior devs.

I’d really be interested to hear whether you are managing to build decoupled software, and if you aren’t, why and what’s stopping you?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s