A great post from someone who really understands the problem. Worth a read!
At the time of writing this post, I am 41 years old, I’ve been in the business of writing software for over 20 years, and I have never ever experienced a release weekend. Until now.
It’s now nearly 1 pm. I’ve been here since 7 am. There are a dozen or so different applications which are being deployed today, which are highly coupled and maddeningly unresilient. For my part, I was deploying a web application and some config to a security platform. We again hit a myriad of issues which hadn’t been seen in prior environments and spent a lot of time scratching our heads. The automated deployment pipeline I built for the change takes roughly a minute do deploy everything, and yet it took us almost 3 hours to get to the point where someone could log in.
The release was immediately labelled a ‘success’ and everyone starts singing praises. As subsequent deployments of other applications start to fail.
This is not success!
Success is when the release takes the 60 seconds for the pipeline to run and it’s all working! Success isn’t having to intervene to diagnose issues in an environment no-one’s allowed access to until the release weekend! Success is knowing the release is good because the deploy status is green!
But when I look at the processes being followed, I know that this pain is going to happen. As do others, who appear to expect it and accept it, with hearty comments of ‘this is real world development’ and ‘this is just how we roll here’.
So much effort and failure thrown at releasing a fraction of the functionality which could have been out there if quality was the barrier to release, not red tape.
And yet I know I’m surrounded here by some very intelligent people, who know there are better ways to work. I can’t help wondering where and why progress is being blocked.
I’ve heard a lot of people say something like “but we don’t need huge scalability” when pushed for reason why their architecture is straight out of the 90’s. “We’re not big enough for devops” is another regular excuse. But while it’s certainly true that many enterprises don’t need to worry so much about high loads and high availability, there are some other, very real benefits to embracing early 21st century architecture principals.
Scalable architecture is simple architecture
Keep it simple, stupid! It’s harder to do than it might seem. What initially appears to be the easy solution can quickly turn into a big ball of unmanageable, tightly coupled string of dependencies where one bad line of code can affect a dozen different applications.
In order to scale easily, a system should be simple. When scaling, you could end up with dozens or even hundreds of instances, so any complexity is multiplied. Complexity is also a recipe for waste. If you scale a complex application, the chances are you’re scaling bits which simply don’t need to scale. Systems should be designed so hot functions can be scaled independently of those which are under utilised.
Simple architecture takes thought and consideration. It’s decoupled for good reason – small things are easier to keep ‘easy’ than big things. An array of small things all built with the same basic rules and standards, can be easily managed if a little effort is put in to working out an approach which works for you. Once you have a few small things all being managed in the same way, growing to lots of small things is easy, if it’s needed.
Simple architecture is also resilient, because simple things tend not to break. And even if you aren’t bothered about a few outages, it’s better to only have the outages you plan for.
Scalable architecture is decoupled
If you need to make changes in anything more than a reverse proxy in order to scale one service, then your architecture is coupled, and shows signs of in-elasticity. Other than being scalable, decoupled architecture is much easier to maintain, and keeps a much higher level of quality because it’s easier to test.
Decoupled architecture is scoped to a specific few modules which can be deployed together repeatedly as a single stack with relative ease, once automated. Outages are easy to fix, as it’s just a case of hitting the redeploy button.
Your end users will find that your decoupled architecture is much nicer to use as well. Without having to make dozens of calls to load and save data in a myriad of different applications and databases, a decoupled application would just make only one or two calls to load or save the data to a dedicated store, then raise events for other systems to handle. It’s called eventual consistency and it isn’t difficult to make work. In fact it’s almost impossible to avoid in an enterprise system, so embracing the principal wholeheartedly makes the required thought processes easier to adopt.
Scalable architecture is easier to test
If you are deploying a small, well understood, stack with very well known behaviours and endpoints, then it’s going to be no-brainer to get some decent automated tests deployed. These can be triggered from a deployment platform with every deploy. As the data store is part of the stack and you’re following micro-architecture rules, the only records in the stack come from something in the stack. So setting up test data is simply a case of calling the API’s you’re testing, which in turn tests those API’s. You don’t have to test beyond the interface, as it shouldn’t matter (functionally) how the data is stored, only that the stack functions correctly.
Scalable architecture moves quicker to market
Given small, easily managed, scalable stacks of software, adding a new feature is a doddle. Automated tests reduce the manual test overhead. Some features can get into production in a single day, even when they require changes across several systems.
Scalable architecture leads to higher quality software
Given that in a scaling situation you would want to know your new instances are going to function, you need attain a high standard of quality in what’s built. Fortunately, as it’s easier to test, quicker to deploy, and easier to understand, higher quality is something you get. Writing test first code becomes second nature, even writing integration tests up front.
Scalable architecture reduces staff turnover
It really does! If you’re building software with the same practices which have been causing headaches and failures for the last several decades, then people aren’t going to want to work for you for very long. Your best people will eventually get frustrated and go elsewhere. You could find yourself in a position where you finally realise you have to change things, but everyone with the knowledge and skills to make the change has left.
I guess what I’m trying to point out is that I haven’t ever heard a good reason for not building something which can easily scale. Building for scale helps focus solutions on good architectural practices; decoupled, simple, easily testable, micro-architectures. Are there any enterprises where these benefits are seen as undesirable? Yet, when faced with the decision of either continuing to build the same, tightly coupled, monoliths which require full weekends (or more!) just to deploy, or building something small, light weight, easily deployed, easily maintained, and ultimately scalable, there are plenty of people claiming “Only in an ideal world!” or “We aren’t that big!”.
Some enterprises have grown their technical infrastructure to the point where dev ops and continuous deployment are second nature. The vast majority of enterprises are still on their journey, or don’t even realise there is a journey for them to take. Businesses aren’t generally built around great software development practices – many businesses are set up without much thought to how technical work even gets done, as this work is seen as only a supporting function, not able to directly increase profitability. This view of technical functions works fine for some time, but eventually stresses begin to form.
Failing at software delivery.
Each area of a young business can be easily supported by one or two primary pieces of software. They’re probably off the shelf solutions which get customised by the teams who use them. They probably aren’t highly integrated; information flows from department to department in spreadsheets. You can think of the integration points between systems as being manual processes.
While the flow of work is funnelled through a manual process such as sales staff on phones or shop staff, this structure is sufficient. The moment the bottleneck of sales staff is removed (in other words, once an online presence is built where customers can be serviced automatically) things need to run a bit quicker. Customers online expect to get instant feedback and delivery estimates. They expect to be able to complete their business in one visit and only expect to receive a follow up communication when there is a problem. Person to person interaction can be very flexible; a sales person can explain why someone has to wait in a way which sounds perfectly fine to a customer. The self-service interaction on a website is less flexible – a customer either gets what they want there and then, or they go somewhere else.
And so businesses start to prioritise new features by how many sales they will bring in, either by reducing the number of customers jumping out of the sales flow or by drawing additional customers into the sales flow.
Problems arise as these features require more and more integration with each of the off the shelf solutions in place throughout the various areas of the business. Tight coupling starts to cause unexpected (often unexplained) outages. Building and deploying new features becomes harder. Testing takes longer – running a full regression becomes so difficult and covers so many different systems that if there isn’t a full day dedicated to it, it won’t happen. The online presence gets new features, slowly, but different instabilities make it difficult for customers to use. The improvements added are off-set by bugs.
Developers find it increasingly difficult to build quality software. The business puts pressure on delivery teams to build new features in less time. The few movements among the more senior developers to try to improve the business’ ability to deliver are short lived because championing purely technical changes to the business is a much more complicated undertaking than getting approval from technical piers. It’s not long before enough attempts to make things better have either been shot down or simply ignored, that developers realise there are other companies who are more willing to listen. The business loses a number of very talented technologists over a few months. They take with them the critical knowledge of how things hang together which was propping up the development team. So the rate of new features to market plummets, as the number of unknown bugs deployed goes through the roof.
It’s usually around this point that the management team starts talking about off-shoring the development effort. The same individuals also become prime targets for sales people peddling their own off the shelf, monolithic, honey trap of a system – promising stability, fast feature development, low prices, and high scalability. The business invests in some such platform without properly understanding why they got into the mess they are in. Not even able to define the issues they need to overcome, never mind able to determine if the platform delivers on them.
By the time the new system is plumbed in and feature parity has been achieved with existing infrastructure, the online presence is a long way behind the competition. Customers are leaving faster than ever, and a decision is made to patch things up for the short term so the business can be sold.
Not an inevitability.
Ok, so yes this is a highly pessimistic, cynical, and gloomy prognosis. But it’s one that I’ve seen more than a few times, and I’m guessing that if you’re still reading then it had a familiar ring for you too. I’m troubled by how easy it is for a business to fall into this downwards spiral. Preventing this is not just about being aware that it’s happening, it’s about knowing what to do about it.
Successfully scaling development effort requires expertise in different areas: architecture, development, operations, whatever it is the business does to make money, people management, leadership, change management, and others I’m sure. It’s a mix of technical skills and soft skills which can be hard to come by. To make things even harder, ‘good’ doesn’t look the same everywhere. Different teams need different stimuli in order to mature, and there are so many different tools and approaches which are largely approved of that one shoe likely won’t fit all. What’s needed is strong, experienced, and inspirational leadership, with buy in from the highest level. That and a development team of individuals who want to improve.
Such leaders will have experience of growing other development teams. They will probably have an online presence where they make their opinions known. They will be early adopters of technologies which they know make sense. They will understand how dev ops works. They will understand about CI and CD, and they will be excited by the idea that things can get better. They’ll want to grow the team that’s there, and be willing to listen to opinion.
Such leaders will make waves. Initially, an amount of effort will be directed away from directly delivering new functionality. Development effort will no longer be solely about code going out the door. They will engage with technology teams at all levels, as comfy discussing the use of BDD as they are talking Enterprise Architecture. Strong opinions will emerge. Different technologies will be trialled. To an outsider (or senior management) it might appear like a revolt – a switch in power where the people who have been failing to deliver are now being allowed to make decisions, and they’re not getting them all right. This appearance is not completely deceptive; ownership and empowerment are huge drivers for team growth. Quality will come with this, as the team learn how to build it into their work from architecture to implementation, and learn how to justify what they’re doing to the business.
Software delivery becomes as much a part of the enterprise as, for example, Human Resources, or Accounts. The CEO may not understand all aspects of HR, but it is understood that the HR department will do what needs to be done. The same level of respect and trust has to be given to software delivery, as it is very unlikely the CEO or any other non-technical senior managers will understand fully the implications of the directions taken. But that’s fine – that’s the way it’s meant to be.
In time to make a difference.
Perhaps the idea of strong technical leadership being critical to technical success is no surprise, it seems sensible enough. So why doesn’t this happen?
There are probably multiple reasons, but I think it’s very common for senior managers to fear strong technical leadership. There seems to be a belief that devolving responsibility for driving change among senior technicians can bring about similar results as a single strong leader while avoiding the re-balancing of power. I see this scenario as jumping out of the same plane with a slightly larger parachute – it’ll be a gentler ride down, but you can only pretend you’re flying. By the time the business makes it’s mind up to try to hire someone, there’s often too much of a car crash happening to make an enticing offering.
If we accept that lifting the manual sales bottleneck and moving to web based sales is the catalyst for the explosion of scale and complexity (which I’m not saying is always the case) then this would be the sweet spot in time to start looking for strong technology leadership. Expect to pay a lot more for someone capable of digging you out of a hole than for someone who has the experience to avoid falling in it to begin with. And other benefits include keeping your customers, naturally.
There are lots of problems that prevent businesses from responding to market trends as quickly as they’d like. Many are not IT related, some are. I’d like to discuss a few problems that I see over and over again, and maybe present some useful solutions. As you read this, please remember that there are always exceptions. But deciding that you have one of these exceptional circumstances is always easier when starting from a sensible basic idea.
Business focused targeting.
For many kinds of work, quicker is better. For software development, quicker is better. But working faster isn’t the same thing as delivering faster.
I remember working as technical lead for a price comparison site in the UK, where once a week each department would read out a list of the things they achieved in the last week and how that had benefited the business. For many parts of the business there was a nice and easy line that could be drawn from what they did each week and a statistic of growth (even if some seemed quite contrived). But the development team was still quite inexperienced, and struggling to do CI never mind CD. For the less experienced devs, being told to “produce things quicker” had the opposite effect. Traditional stick and carrot doesn’t have the same impact on software development as on other functions, because a lot of the time what speeds up delivery seems counter intuitive.
- Have two people working on each task (pair programming)
- Focus on only one feature at a time
- Write as much (or more) test code as functional code
- Spend time discussing terminology and agreeing a ubiquitous language
- Decouple from other systems
- Build automated delivery pipelines
These are just a few examples of things which can be pushed out because someone wants the dev team to work faster. But in reality, having these things present is what enables a dev team to work faster.
Development teams feel a lot of pressure to deliver, because they know how good they can be. They know how quickly software can be written, but it takes mature development practices to deliver quickly and maintain quality. Without the required automation, delivering quick will almost always mean a reduction in quality and more time taken fixing bugs. Then there are the bugs created while fixing other bugs, and so on. Never mind the huge architectural spirals because not enough thought went into things at the start. In the world of software, slow and steady may lose the first round, but it sets the rest of the race up for a sure win.
Tightly coupling systems.
I can’t count how often I’ve heard someone say “We made a tactical decision to tightly couple with <insert some system>, because it will save us money in the long run.”
Please stop thinking this.
Is it impossible for highly coupled systems to be beneficial? No. Is yours one of these cases? Probably not.
There are so many hidden expenses incurred due to tightly coupled designs that it almost never makes any sense. The target system is quite often the one thing everything ends up being coupled with, because it’s probably the least flexible ‘off the shelf’ dinosaur which was sold to the business without any technical review. There are probably not many choices for how to work with it. Well the bottom line is: find a way, or get rid. Ending up with dozens of applications all tightly bound to one central monster app. Changes become a nightmare of breaking everyone else’s code. Deployments take entire weekends. License fees for the dinosaur go through the roof. Vendor lock in turns into shackles and chains. Reality breaks down. Time reverses, and mullets become cool.
Maybe I exaggerated with the mullets.
Once you start down this path, you will gradually lose whatever technical individuals you have who really ‘get’ software delivery. The people who could make a real difference to your business will gradually go somewhere their skills can make a difference. New features will not only cost you more to implement but they’ll come with added risk to other systems.
If you are building two services which have highly related functionality, ie. they’re in the same sub-domain (from a DDD perspective), then you might decide that they should be aware of each other on a conceptual level, and have some logic which spans both services and depends on both being ‘up’, and which get versioned together. This might be acceptable and might not lead to war or famine, but I’m making no promises.
It’s too hard to implement Dev Ops.
No, it isn’t.
Yes, you need at least someone who understands how to do it, but moving to a Dev Ops approach doesn’t mean implementing it across the board right away. That would be an obscene way forwards. Start with the next thing you need to build. Make it deployable, make it testable with integration tests written by the developer. Work out how to transform the configuration for different environments. Get it into production. Look at how you did it, decide what you can do better. Do it better with the next thing. Update the first thing. Learn why people use each different type of technology, and whether it’s relevant for you.
Also, it’s never too early to do Dev Ops. If you are building one ‘thing’ then it will be easier to work with if you are doing Dev Ops. If you have the full stack defined in a CI/CD pipeline and you can get all your changes tested in pre-production environments (even infra changes) then you’re winning from the start. Changes become easy.
If you have a development team who don’t want to do Dev Ops then you have a bigger problem. It’s likely that they aren’t the people who are going to make your business succeed.
Ops do routing, DBA’s do databases.
Your developers should be building the entire stack. They should be building the deployment pipeline for the entire stack. During deployment, the pipeline should configure DNS, update routing tables, configure firewalls, apply WAF rules, deploy EC2 instances, install the built application, run database migration scripts, and run tests end to end to make sure the whole lot is done correctly. Anything other than this is just throwing a problem over the fence to someone else.
The joke of the matter is that the people doing the developer’s ‘dirty work’ think this is how they support the business. When in reality, this is how they allow developers to build software that can never work in a deployed state. This is why software breaks when it gets moved to a different environment.
Ops, DBA’s, and other technology specialists should be responsible for defining the overall patterns which get implemented, and the standards which must be met. The actual work should be done by the developer. If for no other reason than the fact that when the developer needs a SQL script writing, there will never be a DBA available. The same goes for any out-of-team dependencies – they’re never available. This is one of the biggest blockers to progress in software development: waiting for other people to do their bit. It’s another form of tight coupling, building inter-dependent teams. It’s a people anti-pattern.
If you developers need help to get their heads around routing principals or database indexing, then get them in a room with your experts. Don’t get those people to do the dirty work for everyone else, that won’t scale.
BAU handle defects.
A defect found by a customer should go straight back to the team which built the software. If that team is no longer there, then whichever team was ‘given’ responsibility for that piece of software gets to fix the bug.
Development teams will go a long way to give themselves an easy life. That includes adding enough error handling, logging, and resilient design practices to make bug fixing a cinch, but only if they’re the ones who have to deal with the bugs.
Fundamental design flaws won’t get fixed unless they’re blocking the development team.
This isn’t an exhaustive list. Even now there are more and more things springing to mind, but if I tried to shout every one out then I’d have a book, not a blog post. The really unfortunate truth is that 90% of the time I see incredibly intelligent people at the development level being ignored by the business, by architects, even by each other, because even though a person hears someone saying ‘this is a good/bad idea’ being able to see past their own preconceptions to understand that point of view is often incredibly difficult. Technologists all too often lack the soft skills required to make themselves heard and understood. It’s up to those who have made a career from their ‘soft skills’ to recognise that and pay extra attention. A drowning person won’t usually thrash about and make a noise.