Conway’s Shackles?

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.

Melvin E. Conway

When someone builds a bridge, or a house, or a road, they do so with the intention that it can remain in place, unchanged for decades. Conway’s Law tells us that a software system will follow the ‘shape’ of the organisation which builds it. So if we build software with the rigidity and ‘final plan’ view with which we build bridges, will the architecture of the software prevent the organisation from evolving?

Let’s consider an imaginary company ‘Widgets Are Cool’. For decades, they’ve been selling widgets both through shop fronts and over the phone. Their entire software infrastructure is built on top of a point of sale application and an order tracking system. Neither of these are particularly rich in functionality, but they have worked well for the business so far. In fact, Widgets Are Cool is the number one supplier of widgets in the UK, with several large outlets across the country.

Recently a new CEO took over and decided that the business needed to move into the new century and focus on an online presence. Undoubtedly a good idea, and surely can’t be that difficult to do…

From the top

Initially, Widgets Are Cool has two sales channels:

  1. Over the phone sales
  2. Over the counter sales

The way people buy widgets differs little between these channels; what you can discuss and agree with a shop assistance over a counter, you can generally discuss and agree with a call handler over the phone. This alignment isn’t accidental – the shop assistant is working with the same Point of Sale application that the call handler has in front of them. They both have access to the same functionality.

When they review the functionality required by both scenarios, the Software Architects decide that a digital presence will most closely fit the phone sales channel. It will need access to the POS system so it can process payments against an inventory, and it will need access to the order tracking system so it can initiate order fulfilment and delivery. So they plan for two points of integration, one into each system.

Into the cloud

The existing systems are deployed into some hired space in two data centres, with the POS system including an on-prem database for each location which is synchronised back to the data centres to build a combined read model of the enterprise.

The architects are excited to move into the cloud as they build their digital platform. They decide to use some very simple web application containers which can be scaled horizontally either manually or automatically, based on load.

This is actually a really good idea for a first attempt at working in the cloud. Both AWS and Azure offer their own versions of simple to deploy, easy to scale out, web products. AWS has Elastic Beanstalk and Azure has Azure Web Apps. It makes it very easy to make progress very quickly.

There then follows a heated debate on the best way to have the cloud hosted web application talk to the existing systems. This is when someone hits upon an old favourite: building a unified API layer with an Enterprise Service Bus.

Four teams

There is already a team responsible for the upkeep and customisation of the POS system. This team work closely with the sales staff to make sure their needs are met. There’s a second team already in place to work with the order tracking system, they work closely with the manufacturing team and warehouse staff to make sure their needs are met. The CTO decides that they also need two new teams:

  1. An Integration team to build the ESB.
  2. A Digital team to build the web platform.

So far, everyone at Widgets Are Cool are excited with their new direction and what it can bring. Unfortunately, they’ve already made all the mistakes which will end up costing them significantly more than need be.

2 months in

After a couple of months, there’s an obvious pattern forming. Features are suggested by the business, discussed by architects, and split down into items of work for each team. Generally each feature requires something along the lines of:

  1. An API exposing some functionality of the POS system.
  2. An API exposing some functionality of the order tracking system.
  3. A new API in the ESB giving some level of abstraction for the back end systems.
  4. Changes to the new web application to consume the new API on the ESB.

The web application has a hard dependency on the ESB, which has a hard dependency on both APIs it abstracts, which have hard dependencies on the underlying business systems.

When a feature is being built, everyone tries their best to get ahead by building as much as possible before their dependencies are ready. What actually happens is that once the API’s are in place in front of the business systems, the integration team realise there are gaps, or bugs, in the design. They either try to work around this, or send the API teams back to make changes. Working around missing functionality generally means making multiple calls which makes the service less performant, getting the API team to fix it takes additional time.

Once the integration team have got into a testing phase, the digital team start to consume their API. The digital team start to find gaps, or data which is formatted ineffectively for how they want to use it. They now either work around the issues or ask the integration team to make changes. If the integration team make changes, they could find they need the API teams to make changes, which they may find can’t be made because the underlying business systems don’t allow it.

Regression

What’s worse is that whenever there’s more than a small amount of work carried out downstream of the digital team, other things seem to break, and often this will go unnoticed for a few days.

There are relationships between entities in the underlying systems which have been ‘bent’ to fit the way the business works. Exposing these brings some odd data structures and concepts which aren’t relevant outside of the business application. Because outside of the digital team there is little appreciation of the user flows they’re building, or how they relate to the underlying systems, no-one has a clear view of how and where to place abstractions.

The business system API’s expose the business system’s internals. The integration API exposes the models and concepts exposed by the business system API’s. So the web application ends up with data structures in their Javascript which are owned by the underlying business systems!

Aligning the teams

Middle management see that the development efforts must aligned. They have been working in a sudo-agile way with sprints and some scrum ceremonies, so they decide to have a ‘scrum of scrums’ to coordinate everything.

Unfortunately they can’t align their way around the fact that the design of the web applications is very fluid – often being changed after something is built and found to be a bit clunky. They find that it isn’t really possible to ‘manage’ quality into software – simply telling teams to communicate more is just lip service. During the scrum of scrums, people talk about what they’re planning to do, but each team only has their small piece of the puzzle – it’s quite easy for someone to sit and listen to how a developer is about to change everything they’re building against but not realise there’s an impact.

Now there’s pressure building

Some of the web application looks pretty good, but there are places where it’s either incredibly slow, or particularly awkward to use. Testers find that getting end to end successful journeys is hit and miss. Bugs are raised which bounce around from team to team with no-one wanting to take on the additional work. Several basic features are still not complete and are going through a second or third redesign. The deployment of new code has been restricted to twice a day, meaning there’s no end to end integration test feedback until the next day. The business want to release soon but want it finished first.

What’s also apparent, is that several key engineers are not happy with the way the project is being managed and are looking for other jobs.

The project undergoes it’s biggest set-back so far when someone tries to upgrade the POS system to the latest version and discovers that it not only breaks their API, it also completely invalidates several orchestrations in the ESB, and even causes Javascript errors in the web application!

New digital requirements

It takes more than a year to get the web application released. That’s three times as long as the business had planned for, and far more than three times the cost.

Once it’s in place, the business want to ‘double down on digital’ to get the most out of their investment, so they decide to create an industry first, online widget customisation platform. They don’t just want users to be able to buy their pre-designed widgets, they want them to be able to order something unique with every visit.

The architects start discussing the idea, but the business is dismayed to hear estimates of one, or even two, years. When pressed to show their reasoning, the architects explain that the back end systems will need extensive customisation to provide this functionality. They might have to replace the POS system entirely with something more expensive, which would incur a rebuild of parts of the original digital platform.

What went wrong?

When this approach was being considered, there will have been conversations about ongoing management of the platform, reducing complexity, sharing functionality, and utilising existing resources. Principles will have been established, such as not writing anything more than once, and trusting the ESB to introduce resilience.

These all sound like excellent endeavours. They all seem to be genuine concerns and quite reasonable ideas on which to base an architecture. Unfortunately, none of the really important things are among them.

In software, change is the only constant. From the point where the first line of code is written, the likelihood of needing to change that line increases with each subsequent line. Each hard dependency introduces another source of change for the entire system, which is then a source of change for everything which depends on it, and so on. The biggest killer of a software system is being designed in such a way that making changes is painful.

Changes can be painful for different reasons: perhaps the code-base is a mess and not well understood, or maybe even the simplest change requires coordination of multiple teams. Sometimes it can be impossible to plan a change because the resources which are eventually required to make the change aren’t even in the same business unit which is wanting to make the change.

Certainly this project has all of these problems and others. Because the action of a ‘sale’ is owned by a back-end system, it’s impossible for Digital to sell widgets when the POS system is unavailable. It’s also incredibly difficult for Digital to sell in a manner that the POS system is not designed for. Yes, they can fail elegantly, but why introduce the source of failure at all?

Consider the situation where payment is taken via the POS system, but then something breaks while dispatching via the order management system – what do you tell the user? If we reverse the flow, then we might send something and never take payment for it.

Inverting dependencies

The problem of rigidity can pretty much vanish when we invert our external dependencies. This is a significant change in thinking, which works like this:

1. Have the Digital platform own its own processes.

When a user interacts with the Digital platform, they don’t care about downstream systems. When they click the ‘buy’ button, they want to buy. There are already plenty of ways a purchase can fail, don’t introduce more – take payment and save the paid status locally with the details of the order. Pretend the Digital platform is just a single application, don’t rely on back office systems to reach the stage of having sold something. The system of record for a Digital sale becomes ‘Digital’.

By doing this, Digital can now sell whatever they want, however they want (and as quickly as they want). The web is a rapidly changing environment for sales – being tied to the back office systems will kill innovation.

2. Raise events

An event is a specific type of message which is 100% owned by the system raising it. If Digital raise an event to say an order has been purchased (let’s call it DigitalOrderPurchased) then Digital define the data in that event. Digital are in control of the data in that event. If Digital change the event without telling downstream systems that’s what they’ve done, then they break their own platform – they know it’s happened and see it when they test. The aren’t surprised to find someone else has changed something.

The pub/sub pattern means any interested system can listen for the event and do something with it. If that process must guarantee success, then that system implements a mechanism for guaranteeing success, to the point of requesting manual intervention if necessary. The down stream system’s successful handling of the message in no way impacts the validity of the sale.

3. Version everything

If Digital are going to change an event they’re raising, then they can start raising a ‘version 2’ while still raising the original version, until other systems have been updated to start handling version 2. Digital can implement a change without requiring any work from other teams!

Sources of failure

Because the ESB team are now handling an event, they have no choice but to maintain a working contract with Digital, because it’s defined by Digital – the dependency has been inverted. So ESB can’t change the contract, Digital will maintain backwards compatibility by ‘versioning’ the contract – both systems have only themself as a source of failure.

It may well be harder to invert the dependencies between the ESB and the back-end systems, as raising events from an off-the-shelf application might not be possible. However, failures between the ESB and these systems is not causing your users problems – they have made their purchase and gone to put the kettle on, blissfully unaware that it might be several hours before your order management system manages to process and dispatch because someone pulled a patch lead.

Conway’s law

When I think about how the two approaches differ, I’m struck that Conway’s law still applies either way. With the first approach, we found that evolving the business was very difficult. After inverting the dependencies, we found that changes could be made in isolation, without fear of breaking a system belonging to another team. In my mind, I see Conway’s law as a blanket, covering an enterprise – choosing to couple or decouple systems changes that cover from something holding the business tight in one position, to something fluid which can move with the business.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s