A great post from someone who really understands the problem. Worth a read!
At the time of writing this post, I am 41 years old, I’ve been in the business of writing software for over 20 years, and I have never ever experienced a release weekend. Until now.
It’s now nearly 1 pm. I’ve been here since 7 am. There are a dozen or so different applications which are being deployed today, which are highly coupled and maddeningly unresilient. For my part, I was deploying a web application and some config to a security platform. We again hit a myriad of issues which hadn’t been seen in prior environments and spent a lot of time scratching our heads. The automated deployment pipeline I built for the change takes roughly a minute do deploy everything, and yet it took us almost 3 hours to get to the point where someone could log in.
The release was immediately labelled a ‘success’ and everyone starts singing praises. As subsequent deployments of other applications start to fail.
This is not success!
Success is when the release takes the 60 seconds for the pipeline to run and it’s all working! Success isn’t having to intervene to diagnose issues in an environment no-one’s allowed access to until the release weekend! Success is knowing the release is good because the deploy status is green!
But when I look at the processes being followed, I know that this pain is going to happen. As do others, who appear to expect it and accept it, with hearty comments of ‘this is real world development’ and ‘this is just how we roll here’.
So much effort and failure thrown at releasing a fraction of the functionality which could have been out there if quality was the barrier to release, not red tape.
And yet I know I’m surrounded here by some very intelligent people, who know there are better ways to work. I can’t help wondering where and why progress is being blocked.
Some enterprises have grown their technical infrastructure to the point where dev ops and continuous deployment are second nature. The vast majority of enterprises are still on their journey, or don’t even realise there is a journey for them to take. Businesses aren’t generally built around great software development practices – many businesses are set up without much thought to how technical work even gets done, as this work is seen as only a supporting function, not able to directly increase profitability. This view of technical functions works fine for some time, but eventually stresses begin to form.
Failing at software delivery.
Each area of a young business can be easily supported by one or two primary pieces of software. They’re probably off the shelf solutions which get customised by the teams who use them. They probably aren’t highly integrated; information flows from department to department in spreadsheets. You can think of the integration points between systems as being manual processes.
While the flow of work is funnelled through a manual process such as sales staff on phones or shop staff, this structure is sufficient. The moment the bottleneck of sales staff is removed (in other words, once an online presence is built where customers can be serviced automatically) things need to run a bit quicker. Customers online expect to get instant feedback and delivery estimates. They expect to be able to complete their business in one visit and only expect to receive a follow up communication when there is a problem. Person to person interaction can be very flexible; a sales person can explain why someone has to wait in a way which sounds perfectly fine to a customer. The self-service interaction on a website is less flexible – a customer either gets what they want there and then, or they go somewhere else.
And so businesses start to prioritise new features by how many sales they will bring in, either by reducing the number of customers jumping out of the sales flow or by drawing additional customers into the sales flow.
Problems arise as these features require more and more integration with each of the off the shelf solutions in place throughout the various areas of the business. Tight coupling starts to cause unexpected (often unexplained) outages. Building and deploying new features becomes harder. Testing takes longer – running a full regression becomes so difficult and covers so many different systems that if there isn’t a full day dedicated to it, it won’t happen. The online presence gets new features, slowly, but different instabilities make it difficult for customers to use. The improvements added are off-set by bugs.
Developers find it increasingly difficult to build quality software. The business puts pressure on delivery teams to build new features in less time. The few movements among the more senior developers to try to improve the business’ ability to deliver are short lived because championing purely technical changes to the business is a much more complicated undertaking than getting approval from technical piers. It’s not long before enough attempts to make things better have either been shot down or simply ignored, that developers realise there are other companies who are more willing to listen. The business loses a number of very talented technologists over a few months. They take with them the critical knowledge of how things hang together which was propping up the development team. So the rate of new features to market plummets, as the number of unknown bugs deployed goes through the roof.
It’s usually around this point that the management team starts talking about off-shoring the development effort. The same individuals also become prime targets for sales people peddling their own off the shelf, monolithic, honey trap of a system – promising stability, fast feature development, low prices, and high scalability. The business invests in some such platform without properly understanding why they got into the mess they are in. Not even able to define the issues they need to overcome, never mind able to determine if the platform delivers on them.
By the time the new system is plumbed in and feature parity has been achieved with existing infrastructure, the online presence is a long way behind the competition. Customers are leaving faster than ever, and a decision is made to patch things up for the short term so the business can be sold.
Not an inevitability.
Ok, so yes this is a highly pessimistic, cynical, and gloomy prognosis. But it’s one that I’ve seen more than a few times, and I’m guessing that if you’re still reading then it had a familiar ring for you too. I’m troubled by how easy it is for a business to fall into this downwards spiral. Preventing this is not just about being aware that it’s happening, it’s about knowing what to do about it.
Successfully scaling development effort requires expertise in different areas: architecture, development, operations, whatever it is the business does to make money, people management, leadership, change management, and others I’m sure. It’s a mix of technical skills and soft skills which can be hard to come by. To make things even harder, ‘good’ doesn’t look the same everywhere. Different teams need different stimuli in order to mature, and there are so many different tools and approaches which are largely approved of that one shoe likely won’t fit all. What’s needed is strong, experienced, and inspirational leadership, with buy in from the highest level. That and a development team of individuals who want to improve.
Such leaders will have experience of growing other development teams. They will probably have an online presence where they make their opinions known. They will be early adopters of technologies which they know make sense. They will understand how dev ops works. They will understand about CI and CD, and they will be excited by the idea that things can get better. They’ll want to grow the team that’s there, and be willing to listen to opinion.
Such leaders will make waves. Initially, an amount of effort will be directed away from directly delivering new functionality. Development effort will no longer be solely about code going out the door. They will engage with technology teams at all levels, as comfy discussing the use of BDD as they are talking Enterprise Architecture. Strong opinions will emerge. Different technologies will be trialled. To an outsider (or senior management) it might appear like a revolt – a switch in power where the people who have been failing to deliver are now being allowed to make decisions, and they’re not getting them all right. This appearance is not completely deceptive; ownership and empowerment are huge drivers for team growth. Quality will come with this, as the team learn how to build it into their work from architecture to implementation, and learn how to justify what they’re doing to the business.
Software delivery becomes as much a part of the enterprise as, for example, Human Resources, or Accounts. The CEO may not understand all aspects of HR, but it is understood that the HR department will do what needs to be done. The same level of respect and trust has to be given to software delivery, as it is very unlikely the CEO or any other non-technical senior managers will understand fully the implications of the directions taken. But that’s fine – that’s the way it’s meant to be.
In time to make a difference.
Perhaps the idea of strong technical leadership being critical to technical success is no surprise, it seems sensible enough. So why doesn’t this happen?
There are probably multiple reasons, but I think it’s very common for senior managers to fear strong technical leadership. There seems to be a belief that devolving responsibility for driving change among senior technicians can bring about similar results as a single strong leader while avoiding the re-balancing of power. I see this scenario as jumping out of the same plane with a slightly larger parachute – it’ll be a gentler ride down, but you can only pretend you’re flying. By the time the business makes it’s mind up to try to hire someone, there’s often too much of a car crash happening to make an enticing offering.
If we accept that lifting the manual sales bottleneck and moving to web based sales is the catalyst for the explosion of scale and complexity (which I’m not saying is always the case) then this would be the sweet spot in time to start looking for strong technology leadership. Expect to pay a lot more for someone capable of digging you out of a hole than for someone who has the experience to avoid falling in it to begin with. And other benefits include keeping your customers, naturally.
There are lots of problems that prevent businesses from responding to market trends as quickly as they’d like. Many are not IT related, some are. I’d like to discuss a few problems that I see over and over again, and maybe present some useful solutions. As you read this, please remember that there are always exceptions. But deciding that you have one of these exceptional circumstances is always easier when starting from a sensible basic idea.
Business focused targeting.
For many kinds of work, quicker is better. For software development, quicker is better. But working faster isn’t the same thing as delivering faster.
I remember working as technical lead for a price comparison site in the UK, where once a week each department would read out a list of the things they achieved in the last week and how that had benefited the business. For many parts of the business there was a nice and easy line that could be drawn from what they did each week and a statistic of growth (even if some seemed quite contrived). But the development team was still quite inexperienced, and struggling to do CI never mind CD. For the less experienced devs, being told to “produce things quicker” had the opposite effect. Traditional stick and carrot doesn’t have the same impact on software development as on other functions, because a lot of the time what speeds up delivery seems counter intuitive.
- Have two people working on each task (pair programming)
- Focus on only one feature at a time
- Write as much (or more) test code as functional code
- Spend time discussing terminology and agreeing a ubiquitous language
- Decouple from other systems
- Build automated delivery pipelines
These are just a few examples of things which can be pushed out because someone wants the dev team to work faster. But in reality, having these things present is what enables a dev team to work faster.
Development teams feel a lot of pressure to deliver, because they know how good they can be. They know how quickly software can be written, but it takes mature development practices to deliver quickly and maintain quality. Without the required automation, delivering quick will almost always mean a reduction in quality and more time taken fixing bugs. Then there are the bugs created while fixing other bugs, and so on. Never mind the huge architectural spirals because not enough thought went into things at the start. In the world of software, slow and steady may lose the first round, but it sets the rest of the race up for a sure win.
Tightly coupling systems.
I can’t count how often I’ve heard someone say “We made a tactical decision to tightly couple with <insert some system>, because it will save us money in the long run.”
Please stop thinking this.
Is it impossible for highly coupled systems to be beneficial? No. Is yours one of these cases? Probably not.
There are so many hidden expenses incurred due to tightly coupled designs that it almost never makes any sense. The target system is quite often the one thing everything ends up being coupled with, because it’s probably the least flexible ‘off the shelf’ dinosaur which was sold to the business without any technical review. There are probably not many choices for how to work with it. Well the bottom line is: find a way, or get rid. Ending up with dozens of applications all tightly bound to one central monster app. Changes become a nightmare of breaking everyone else’s code. Deployments take entire weekends. License fees for the dinosaur go through the roof. Vendor lock in turns into shackles and chains. Reality breaks down. Time reverses, and mullets become cool.
Maybe I exaggerated with the mullets.
Once you start down this path, you will gradually lose whatever technical individuals you have who really ‘get’ software delivery. The people who could make a real difference to your business will gradually go somewhere their skills can make a difference. New features will not only cost you more to implement but they’ll come with added risk to other systems.
If you are building two services which have highly related functionality, ie. they’re in the same sub-domain (from a DDD perspective), then you might decide that they should be aware of each other on a conceptual level, and have some logic which spans both services and depends on both being ‘up’, and which get versioned together. This might be acceptable and might not lead to war or famine, but I’m making no promises.
It’s too hard to implement Dev Ops.
No, it isn’t.
Yes, you need at least someone who understands how to do it, but moving to a Dev Ops approach doesn’t mean implementing it across the board right away. That would be an obscene way forwards. Start with the next thing you need to build. Make it deployable, make it testable with integration tests written by the developer. Work out how to transform the configuration for different environments. Get it into production. Look at how you did it, decide what you can do better. Do it better with the next thing. Update the first thing. Learn why people use each different type of technology, and whether it’s relevant for you.
Also, it’s never too early to do Dev Ops. If you are building one ‘thing’ then it will be easier to work with if you are doing Dev Ops. If you have the full stack defined in a CI/CD pipeline and you can get all your changes tested in pre-production environments (even infra changes) then you’re winning from the start. Changes become easy.
If you have a development team who don’t want to do Dev Ops then you have a bigger problem. It’s likely that they aren’t the people who are going to make your business succeed.
Ops do routing, DBA’s do databases.
Your developers should be building the entire stack. They should be building the deployment pipeline for the entire stack. During deployment, the pipeline should configure DNS, update routing tables, configure firewalls, apply WAF rules, deploy EC2 instances, install the built application, run database migration scripts, and run tests end to end to make sure the whole lot is done correctly. Anything other than this is just throwing a problem over the fence to someone else.
The joke of the matter is that the people doing the developer’s ‘dirty work’ think this is how they support the business. When in reality, this is how they allow developers to build software that can never work in a deployed state. This is why software breaks when it gets moved to a different environment.
Ops, DBA’s, and other technology specialists should be responsible for defining the overall patterns which get implemented, and the standards which must be met. The actual work should be done by the developer. If for no other reason than the fact that when the developer needs a SQL script writing, there will never be a DBA available. The same goes for any out-of-team dependencies – they’re never available. This is one of the biggest blockers to progress in software development: waiting for other people to do their bit. It’s another form of tight coupling, building inter-dependent teams. It’s a people anti-pattern.
If you developers need help to get their heads around routing principals or database indexing, then get them in a room with your experts. Don’t get those people to do the dirty work for everyone else, that won’t scale.
BAU handle defects.
A defect found by a customer should go straight back to the team which built the software. If that team is no longer there, then whichever team was ‘given’ responsibility for that piece of software gets to fix the bug.
Development teams will go a long way to give themselves an easy life. That includes adding enough error handling, logging, and resilient design practices to make bug fixing a cinch, but only if they’re the ones who have to deal with the bugs.
Fundamental design flaws won’t get fixed unless they’re blocking the development team.
This isn’t an exhaustive list. Even now there are more and more things springing to mind, but if I tried to shout every one out then I’d have a book, not a blog post. The really unfortunate truth is that 90% of the time I see incredibly intelligent people at the development level being ignored by the business, by architects, even by each other, because even though a person hears someone saying ‘this is a good/bad idea’ being able to see past their own preconceptions to understand that point of view is often incredibly difficult. Technologists all too often lack the soft skills required to make themselves heard and understood. It’s up to those who have made a career from their ‘soft skills’ to recognise that and pay extra attention. A drowning person won’t usually thrash about and make a noise.
A friend tweeted recently about how it isn’t always possible to decide late on which product to use for data storage as different products often force an application to use different patterns. This got me thinking about making other decisions in software design. In general it’s accepted that deciding as late as possible is usually a good thing but I think people often miss-interpret ‘as late as possible’ to mean ‘make it an after-thought’.
‘As late as possible’ is a wonderfully subjective term. It suggests that although we want to wait longer before make a decision, we might not be able to. We might need to make a decision to support the design of the rest of the system. Or perhaps in some cases, making the decision early might be more important than making the ‘perfect’ decision.
I started thinking about how to decide whether a decision should be put off and I was reminded of the work of Roy Osherove. He suggests that development teams transition through different states and should be managed differently in each state. There is a similar methodology which relates the approach to software design with different categories of problem space. It’s called Cynefin (pronounced like ‘Kevin’ but with a ‘n’ straight after the ‘K’).
To quote Wikipedia:
The framework provides a typology of contexts that guides what sort of explanations or solutions might apply.
There’s a diagram that goes with this and helps give some context (thanks Wikipedia):
I don’t want to do a deep dive into Cynefin in this article (maybe another day) but to summarise:
- Obvious – these solutions are easy to see, easy to build, and probably available off the peg. They’re very low complexity and shouldn’t require a lot of attention from subject matter experts.
- Complicated – these solutions are easier to get wrong. Maybe no-one on the team has implemented anything similar before, but experience and direction is available possibly from another team.
- Complex – these solutions have lots of possible answers but it’s unclear which will be best. While this is new to your company, other companies have managed to build something similar so you know it is possible.
- Chaotic – these solutions are totally new and you have no point of reference as to what would be the best way to implement. You may not even know if it’s possible.
In general, an enterprises core domain will sit in both Complicated and Complex. Chaotic might be something you do for a while but the focus is to move the solution back into one of the other categories.
So what does this have to do with making decisions?
Well, I suggest that the decision making process might change depending on what category your solution falls into.
- Obvious – obvious solutions probably don’t have many difficult decisions to make. This is not your core domain (unless you work in a really boring sector) so throwing money and resources at obvious solutions is not sensible. The driver for choosing tech stack might well be just “what’s already available” and might be a constraint right up front. You may want to buy an off the shelf product, in which case a lot of decisions are already made for you. If SQL Server is the quickest path to delivery then it might well be the right thing to use here, even if you’re longing to try a NoSQL approach.
- Complicated – complicated solutions are often solved by taking advice from an expert. “We found a separate read concern solved half our problems.” and “A relational database just doesn’t work for this.” are both great nuggets of advice someone who’s done this before might put forward. These solutions are in your core domain, you want to avoid code rot and inflexible architectures – deciding late seems generally sensible, but the advice from your experts can help scope those decisions. Focus on finding the abstractions on which to base the solution. You might know that you’ll need the elasticity that only the cloud can provide, but you might leave the decision on which provider until as late as possible.
- Complex – complex solutions are where experts are harder to get involved. They might be in different teams or hired from a consultancy. The focus should still be on finding the right abstractions to allow critical decisions to be delayed. Running multiple possible solutions in parallel to see what works best is a great approach which will give your team confidence in the chosen option. A subject matter expert might be more useful in explaining how they approached defining a solution rather than just what the solution was.
- Chaotic – it might seem a terrible idea to try and make decisions in this situation but there are advantages. Chaotic can become Complex if you can find an anchor for the solution. “How do we solve this with ‘x’?” is a lot easier for a team to decide than a more general approach. You’ll almost certainly want to run with two or three possible options in parallel. Keep in mind that whatever decision you make may well eventually be proved incorrect.
I think this shows how the approach to decision making can be affected by what category of solution you’re working on. By picking the right strategy for the right kind of problem, you can focus resources more cost effectively.