Testing Times

Although I see developers writing more and more tests, their efforts are often ignored by QA and not taken into account by the test strategy. It’s common for developers to involve QA in their work, but this is not the full picture. To maximise on efficiency of test coverage, I think developer tests should be accounted for as part of the test approach.

The role of QA as the gateway of quality, rather than the implementors of quality, allows for good judgement to be used when deciding if a developer’s tests provide suitable coverage to count toward the traditional QA effort. This requires QA to include some development skills, so the role is capable of seeing the benefits and flaws in developer written tests.

What follows is a breakdown of how I categorise different types of test, followed by a “bringing it all together” section where I hope to outline a few approaches for streamlining the amount of testing done.


Unit Tests

Unit testing can take the form of a few different flavours. I’ve noticed some difference depending on language and platform. There are also different reasons for writing unit tests.

Post development tests

These are the unit tests developers would write before TDD became popular. They do have a lot of value. They are aimed at ensuring business logic does what it’s supposed to at the time of development, and keeps on working at the time of redevelopment five years later.

TDD

I fall into the crowd who bleieve TDD is more about code design than it is about functional correctness. Taking a TDD approach will help a developer write SOLID code, and hopefully make it much easier to debug and read. Having said that, it’s a fantastic tool for writing complex business logic, because generally you will have had a very kind analyst work out exactly what the output of your complex business logic is expected to be. Tackling the dev with a TDD approach will make translating that logic into code much easier, as it’s immediately obvious when you break something you wrote 2 minutes ago.

BDD

I’m a big believer in BDD tests being written at the unit level. For me, these are tests which are named and namespaced in a way to indicate the behaviour being tested. They will often have most of the code in a single setup method and test the result of running the setup in individual test methods, named appropriately. These can be used in the process of TDD, to design the code, and they can also be excellent at making a connection between acceptance criteria and business logic. Because the context of the test is important, I find there’s about a 50/50 split of when I can usefully write unit tests in a BDD fashion vs working with a more general TDD approach. I’ve also found that tests like these can encourage the use of domain terminology from the story being worked on, as a result of the wording of the AC’s.

Without a doubt, BDD style unit tests are much better at ensuring important behaviours remain unbroken than more traditional class level unit tests, because the way they’re named and grouped makes the purpose of the tests, and the specific behaviour under test, much clearer. I can tell very quickly if a described behaviour is important or not, but I can’t tell you whether the method GetResult() should always return the number 5. I would encourage this style of unit testing where business logic is being written.

However! BDD is not Gherkin. BDD tests place emphasis on the behaviour being tested rather than just on the correctness of the results. Don’t be tied to arbitrarily writing ‘given, when, then’ statements.

Hitting a database

The DotNet developer in me screams “No!” whenever I think about this, but the half of me which loves Ruby on Rails understands that because MySQL is deployed with a single ‘apt’ command and because I need to ensure my dynamically typed objects save and load correctly, hitting the db is a really good idea. My experience with RoR tells me that abstracting the database is way more work than just installing it (because ‘apt install mysql2’ is far quicker to write than any number of mock behaviours). In the DotNet world, you have strongly typed objects, so cheking that an int can be written to an integer column in a database is a bit redundant.

Yes, absolutely, this is blatantly an integration test. When working with dynamic languages (especially on Linux) I think the dent in the concept is worth the return.

Scope of a unit test

This is important, because if we are going to usefully select tests to apply toward overall coverage, we need to know what is and isn’t being tested. There are different views on unit testing, on what does and doesn’t constitute a move from unit testing to integration testing, but I like to keep things simple. My view is that if a test crosses application processes, then it’s an integration test. If it remains in the application you are writing (or you’re hitting a db in RoR) and doesn’t require the launching of the application prior to test, then it’s a unit test. That means unit tests can cover a single method, a single class, a single assembly, or multiple of all or any of these. Pick whatever definition of ‘a unit’ works to help you write the test you need. Don’t be too constrained on terminology – if you need a group of tests to prove some functionality which involves half a dozen assemblies in your application, write them and call them unit tests. They’ll run after compile just fine. Who cares whether someone else’s idea of a ‘unit’ is hurt?

Who writes these?

I really hope that it’s obvious to you that developers are responsible for writing unit tests. In fact, developers aren’t only responsible for writing unit tests, they’re responsible for realising that they should be writing unit tests and then writing them. A developer NEVER has to ask permission to write a unit test, any more than they need permission to turn up to work. This is a part of their job – does a pilot need to ask before extending the landing gear?

How can QA rely on unit tests?

Firstly, let’s expel the idea that QA might ‘rely’ on a unit test. A unit test lives in the developer’s domain and may change at any moment or be deleted. As is so often the case in software development, the important element is the people themselves. If a QA and a dev work together regularly and the QA knows there are unit tests, has even seen them and understands how the important business logic is being unit tested, then that QA has far greater confidence that the easy stuff is probably OK. Hopefully QA have access to the unit test report from each build, and the tests are named well enough to make some sense. With this scenarion, it’s easier to be confident that the code has been written to a standard that is ready for a QA to start exposing it to real exploratory “what if” testing, rather than just checking it meets the acceptance criteria. Reasons for story rejection are far less likely to be simple logic problems.


Component Tests

I might upset some hardware people, but in my mind a component test is ran against your code when it is running, but without crossing application boundaries downstream. So if you have written a DotNet Web API service, you would be testing the running endpoint of that service while intercepting downstream requests and stubbing the responses. I’ve found Mountebank to be an excellent tool for this, but I believe there are many more to choose from.

TDD and BDD

Component tests can be ran on a developer’s machine, so it’s quite possible for these to be useful in a TDD/BDD fashion. The downside is that the application needs to be running in order to run the tests, so if they are to be executed from a build server, then the application needs to be started and the stubbing framework set up – this can be trickier to orchestrate. As with any automation, this is only really tricky the first time it’s done. After that, the code and patterns are in place.

In my experience, component tests have limited value. I’ve found that this level of testing is often swallowed by the combination of good unit testing and good integration testing. From a point of view, they are actually integration tests, as you are testing the integration between the application and the operating system.

Having said that, if the downstream systems are not available or relilable, then this approach allows the application functionality to be tested seperately from any integrations.

Who writes these?

Developers write component tests. They may find that they are making changes to an older application, which thiey don’t fully understand. Being able to sandbox the running app and stub all external dependencies can help in these situations.

How can QA rely on component tests?

This again comes down to the very human relationship between the QA and the developer. If there is a close working relationship, and the developer takes the time to show the QA the application under test and explain why they’ve tested it in that way, then it increases the confidence the QA has that the code is ready for them to really go to town on it. It might be that in a discussion prior to development, the QA had suggested that they would be comfier if this component test existed. The test could convey as much meaning as the AC’s in the story.

Again, quality is ensured by having a good relationship between the people involved in building it.


Application Scoped Integration Tests

Integration tests prove that the application functions correctly when exposed to the other systems it will be talking to once in production. They rely on the application being installed and running, and they rely on other applications in ‘the stack’ also running. Strictly speaking, any test that crosses application boundaries is an integration test, but I want to focus on automated tests. We haven’t quite got to manual testing yet.

TDD and BDD

With an integration test, we are extending the feedback time a little too far for it to be your primary TDD strategy; you may find it useful to write an integration test to show a positive and negative result from some business logic at the integration level before that logic is deployed just so you can see the switch from failing to passing tests, but you probably shouldn’t be testing all boundary results at an integration level as a developer. It’s definitely possible to write behaviour focussed integration tests. If you’re building an API and the acceptance criteria includes a truth table, you pretty much have your behvaviour tests already set out for you, but consider what you are testing at the integration level – if you have unit tests proving the logic, then you only need to test that the business logic is hooked in correctly.

The difficult part of this kind of testing often seems to be setting up the data in downstream systems for the automated tests. I find it difficult to understand why anyone would design a system where test data can’t be injected at will – this seems an obvious part of testable architecture; a non-functional requirement that keeps getting ignored for no good reason. If you have a service from which you need to retrieve a price list, that service should almost certainly be capable of saving price list data sent to it in an appropriate way; allowing you to inject your test data.

Scope

The title of this section is ‘Application Scoped Integration Tests’ – my intention with that title is to draw a distinction between tests which are intended to test the entire stack and tests which are intended to test the application or service you are writing at the time. If you have 10 downstream dependencies in your architecture, these tests would hit real instances of these dependencies but you are doing this to test the one thing you are building (even though you will generally catch errors from further down the stack as well).

Who writes these?

These tests are still very closely tied with the evolution of an application as it’s being built, so I advocate for developers to write these tests.

How can QA rely on application scoped integration tests?

Unit and component tests are written specifically to tell the developer that their code works; integration tests are higher up the test pyramid. This means they are more expensive and their purpose should be considered carefully. Although I would expect a developer to write these tests alongside the application they are building, I would expect significant input from a QA to help with what behaviours should and shouldn’t be tested at this level. So we again find that QA and dev working closely gives the best results.

Let’s consider a set of behaviours defined in a truth table which has different outcomes for different values of an enum retrieved from a downstream system. The application doesn’t have control over the vaules in the enum; it’s the downstream dependency that is returning them, so they could conceivably change without the application development team knowing it. At the unit test level, we can write a test to prove every outcome of the truth table. At the integration level, we don’t need to re-write those tests, but we do need to verify that the enum contains exactly the values we are expecting, and what happens if the enum can’t be retrieved at all.

Arriving at this approach through discussion between QA and Dev allows everyone to understand how the different tests at different levels compliment each other to prove the overall functionality.


Consumer Contracts

These are probably my favourite type of test! Generally written for APIs and message processors, these tests can prevent regression issues caused by services getting updated beyond the expectation of consumers. When a developer is writing something which must consume a service (whether via synchronous or asynchronous means) they write tests which execute against the downstream service to prove it behaves in a way that the consumer can handle.

For example: if the consumer expects a 400 HTTP code back when it POSTs an address with no ‘line 1’ field, then the test will intentionally POST the invalid address and assert that the resulting HTTP code is 400. This gets tested because subsequent logic in the consumer relies on having received the 400; if the consumer didn’t care about the response code then this particular consumer contract wouldn’t include this test.

The clever thing about these tests is when they are ran: the tests are given to the team who develop the consumed service and are ran as part of their CI process. They may have similar tests from a dozen different consumers, some testing the same thing, the value is that it’s immediately obvious to the service developers who relies on what behaviour. If they break something then they know who will be impacted.

Scope

This is the subject of some disagreement. There is a school of thought which suggests nothing more than the shape of the request, the shape of the response, and how to call the service should be tested; beyond that should be a black box. Personally, I think that while this is probably correct most of the time, there should be some wiggle room, depending on how coupled the services are. If a service sends emails, then you might want to check for an ’email sent’ event being raised after getting a successful HTTP response (even though the concept of raising the event belongs to the emailer service) – the line is narrow, testing too deep increases coupling but all situations are different.

Who writes these?

These are written by developers and executed in CI.

How can QA rely on consumer contracts?

Consumer contracts are one of the most important classes of tests. If the intention is ever to achieve continuous delivery, these types of test will become your absolute proof that a change to a service hasn’t broken a consumer. Because they test services, not UI, they can be automated and all encompassing but they might not test at a level that QA would normally think about. To get a QA to understand these tests, you will probably have to show how they work and what they prove. They will execute well before any code gets to a ‘testing’ phase, so it’s important for QA to understand the resilience that the practice brings to distributed applications.

Yet again we are talking about good communication between dev and QA as being key to proving the tests are worth taking into consideration.


Stack Scoped Integration Tests

These are subtley different from application scoped integration tests.

There are probably respected technologists out there who will argue that these are the same class of test. I have seen them combined and I have seen them adopted individually (or more often not adopted at all) – I draw a destinction because of the different intentions behind writing them.

These tests are aimed at showing the interactions between the entire stack are correct from the point of view of the main entry point. For example, a microservice may call another service which in turn calls a database. A stack scoped test would interact with the first microservice in an integrated environment and confirm that relevant scenarios from right down the stack are handled correctly.

TDD and BDD

You would be forgiven for wondering how tests at such a high level can be included in TDD or BDD efforts; the feedback loop is pretty long. I find these kind of tests are excellent for establishing behaviours around NFRs, which are known up front so failing tests can be put in place. These are also great at showing some happy path scenarios, while trying to avoid full on boundary testing (simply because the detail of boundary values can be tested much more efficiently at a unit level). It might be worth looking at the concept of executable specifications and tools such as Fitnesse – these allow behaviours to be defined in a heirarchical wiki and linked directly to both integration and unit tests to prove they have been fulfilled. It’s an incredibly efficient way to produce documentation, automated tests, and functioning code at the same time.

Scope

Being scoped to the stack means that there is an implicit intention for these tests to be applied beyond the one service or application. We are expecting to prove integrations right down the stack. This also means that it might not be just a developer writing these. If we have a single suite of stack tests, then anyone making changes to anything in the stack could be writing tests in this suite. For new features, it would also be efficient if QA were writing some of these tests; this can help drive personal integration between dev and QA, and help the latter get exposure to what level of testing has already been applied before it gets anywhere near a classic testing phase.

These tests can be brittle and expensive if the approach is wrong. Testing boundary values of business logic at a stack level is inefficient. That isn’t to say that you can’t have an executable specification for your business logic, just that it possibly shouldn’t be set up as an integration test – perhaps the logic could be tested at multiple levels from the same suite.

How can QA rely on stack scoped integration tests?

These tests are quite often written by a QA cooperating closely with a BA, especially if you are using something like Fitnesse and writing executable specifications. A developer may write some plumbing to get the tests to execute against their code in the best way. Because there is so much involvement from QA, it shouldn’t be difficult for these tests to be trusted.

I think this type of testing applied correctly demonstrates the pinacle of cooperation between BA, QA, and dev; it should always result in great product.


Automated UI Tests

Many user interfaces are browser based, and as such need to be tested in various different browsers and different versions of each browser. Questions like “does the submit button work with good data on all browsers?” are inefficient to answer without automation.

Scope

This is a tricky class of test to scope correctly. Automated UI tests are often brittle and hard to change, so if you have too many it can start to feel like the tests are blocking changes. I tend to scope these to the primary sales pipelines and calls to action: “can you succesfully carry out your business?” – anything more than this tends to quickly become more pain than usefulness. Far more efficient to look at how quickly a small mistake in a less important part of your site/application could be fixed.

This is an important problem to take into consideration when deciding where to place business logic. If you have a web application calling a webservice, business logic can be tested behind the service FAR easier than in the web application.

Who writes these?

I’ve usually seen QA’s writing these, although I have written a handful myself in the past. They tend to get written much later in the development lifecycle than other types of test as they rely on attributes of the actual user interface to work. This is the very characteristic which often makes them brittle, as when the UI changes then they tend to break.

How can QA rely on automated UI tests?

Automated UI tests are probably the most brittle and most expensive tests to write, update, and run. It is one of the most expensive ways to find a bug, unless the bug is a regression issue detected by a pre-existing test (and your UI tests are running frequently). To rely on these tests, they need to be used carefully; just test the few critical journeys through your application which can be tested easily. Don’t test business logic this way, ever. These tests are often sat solely in the QA domain; written by a QA, so trusting them shouldn’t be a problem.


Exploratory Testing

This is the one of the few types of testing which human beings are built for. It’s generally applied to user interfaces and is executed by a QA who tries different ways to break the application either by acting ‘stupid’ or malicious. It simply isn’t practical yet to carry out this kind of testing in an automated fashion; it requires the imagination of an actual person. The intention is to catch problems which were not thought of before the application was built. These might be things which were missed, or they may be a result of confusing UX which couldn’t be forseen without the end result in place.

Who does this?

This is (IMO) the ‘traditional’ QA effort.


User Acceptance Testing

Doesn’t UAT stand for “test everything all over again in a different environment”?

I’ve seen the concept of UAT brutalised by more enterprises than I can count. User acceptance testing is meant to be a last, mostly high level, check to make sure that what has been built is still usable when exposed to ‘normal people’ (aka. the end users).

Things that aren’t UAT:

  1. Running an entire UI automation suite all over again in a different environment.
  2. Running pretty much any automation tests (with the possible exception of some core flows).
  3. Blanket re-running of the tests passing in other environments in a further UAT environment.

If you are versioning your applications and tests in a sensible way, you should know what combinations of versions lead to passing tests before you start UAT. UAT should be exactly what it says on the tin: give it to some users. They’re likely to immediately try to do something no-one has thought about – that’s why we do UAT.

Any new work coming out of UAT will likely not be fixed in that release – don’t rely on UAT to find stuff. If you don’t think your pre-UAT test approach gives sufficient confidence then change your approach. If you feel that your integration environment is too volatile to give reliable test results for features, have another environment with more controls on deployments.

Who runs these tests?

It should really be end users, but often it’s just other QA’s and BA’s. I recommend this not being the same people who have been involved right through the build; although some exposure to the requirements will make life easier.

How can QA rely on User Acceptance Testing?

QA should not rely on UAT. By the time software is in a UAT phase, QA should be ok for whatever has been built to hit production. That doesn’t mean outcomes from UAT are ignored, but the errors found during UAT should be more fundamental gaps in functionality which were either not considered or poorly conceived before a developer ever got involved, or (more often than not) yet more disagreement on the colour of the submit button.


Smoke Testing

Smoke testing originated in the hardware world, where a device would be powered up and if it didn’t ‘smoke’, it had passed the test. Smoke testing in the software world isn’t a huge effort. These are a few, lightweight tests which can confirm that your application deployed properly; slightly more in-depth than a simple healthcheck, but nowhere near as comprehensive as your UI tests.

Who runs these tests?

These should be automated and executed from your deployment platform after each deploy. They give early feedback of likely success or definite failure without having to wait for a full suite of integration tests to run.

How can QA rely on Smoke Testing?

QA don’t rely on smoke tests, these are really more for developers to see fundamental deployment issues early. QA are helped by the early feedback to the developer which doesn’t require them to waste their time trying to test something which won’t even run.


Penetration Testing

Penetration testing is a specific type of exploratory test which requires some specialist knowledge. The intention is to gain access to server and/or data maliciously via the application under test. There are a few automated tools for this, but they only cover some very basic things. This is again better suited to a human with an imagination.

Who runs these tests?

Generally a 3rd party who specialises in penetration testing is bought in to carry out these tests. That isn’t to say that security shouldn’t be considered until then, but keeping up with new attacks and vulnerabilities is a full time profession.

I haven’t yet seen anyone take learnings from penetration testing and turn them into a standard automation suite which can run automatically against new applications of a similar architecture (e.g. most web applications could be tested for the same set of vulnerabilities), but I believe this would be a sensible thing to do; better to avoid the known issues rather than repeatedly fall foul to them and have to spend time rewriting.

How can QA rely on Penetration Testing?

Your QA team will generally not have the expert knowledge to source penetration testing anywhere other than from a 3rd party. The results often impact Ops as well as Dev, so QA are often not directly involved, as they are primarily focussed on the application, not how it sits in the wider enterprise archetcture.


Bringing It All Together

There are so many different ways to test our software and yet I see a lot of enterprises completely ignoring half of them. Even when there is some knowledge of the different classes of test, it’s deemed too difficult to use more than a very limited number of strategies.

I’ve seen software teams write almost no unit tests or developer written integration tests and then hand over stories to a QA team who write endless UI automation tests. Why is this a bad thing? I think people forget that the test pyramid is about where business value is maximised; it isn’t an over-simplification taught to newbies and university students, it reflects something real.

Here is my list of test types in the pyramid:

My test pyramid

Notice that I haven’t included Consumer Contracts in my list. This is because Consumer Contracts can be ran at Unit, Integration, or Component levels, so they are cross-cutting in their way.

In case you need reminding: the higher the test type is on the pyramid, the more expensive it is to fix a bug which is discovered there.

The higher levels of the pyramid are often over-inflated because the QA effort is focused on testing, and not on assuring quality. In an environment where a piece of work is ‘thrown over the fence’ to the QA team, there is little trust (or interest) in any efforts the developer might have already gone to in the name of quality. This leads to inefficiently testing endless combinations of request properties of API’s or endless possibilities of paths a user could navigate through a web application.

If the software team can build an environment of trust and collaboration, it becomes easier for QA to work closer with developers and combine efforts for test coverage. Some business logic in an API being hit by a web application could be tested with 100% certainty at the unit level, leaving integration tests to prove the points of integration, and just a handful of UI tests to make sure the application handles any differing responses correctly.

This is only possible with trust and collaboration between QA and Developers.

Distrust and suspicion leads to QA ignoring the absolute proof of a suite of passing tests which fully define the business logic being written.

What does it mean?

Software development is a team effort. Developers need to know how their code will be tested, QA need to know what testing the developer will do, even Architects need to pay attention to the testability of what they design; and if something isn’t working, people need to talk to each other and fix the problem.

Managers of software teams all too often focus on getting everyone to do their bit as well as possible, overlooking the importance of collaborative skills; missing the most important aspect of software delivery.

A Refreshing Change

Many of my clients have used large scale data refresh processes to pull production data down into staging and development environments. This process is generally accompanied by a complicated process of depersonalising the data and masking anything which could be deemed private or confidential. In larger enterprises, the process can take several days for a single environment, making it unavailable for deployments and testing. The process often breaks integrations where relational integrity between very separate systems is lost.

So, if it’s such a large, difficult task, why does it happen?

Where it works

Let’s start with looking at where this practice is useful (the list isn’t very long).

User acceptance testing and load testing are best performed with production-like datasets. This is because the results could be affected by the shape, size, and detail of the data in the system. These types of tests are generally carried out toward the end of an iteration, whether that iteration is the delivery of a sprint or the delivery of a feature – they test the combined results of all the small changes which have been made. It makes sense to run these against a dataset which has been generated from the production data, as that is guaranteed to contain all your production scenarios (including data corruptions).

Because these types of tests are not being run constantly and they can be run on the same datasets, they can be run in the same environment. When they aren’t being run, the data refresh processes can be running to update that single environment with up to date records from production. This needs to be made efficient, otherwise the more platform is built, the more data is in prod, and the longer the refresh will take.

I’m pretty sure I’m going to cop for some flack, saying that UAT tests and load tests aren’t being run continuously, but I beg to disagree. UAT tests carried out at the story level are not real UAT tests unless the story encompasses an entire feature. A story can be integration tested, UI tested, auto tested, manually tested, unit tested, but usually not UAT’d. A user acceptance test is from the point of view of a user, and that generally happens with feature releases (especially when a later story may change the functionality of an earlier story, making the earlier UAT irrelevant).

There might be an amount of load testing carried out in other environments, but on a much smaller scale and with narrower scopes. The tests we’re talking about here are end to end.

Because only a single environment is being affected, temporary outages due to the complicated nature of refreshing data and masking personal data tend not to impact ongoing work.

Where it doesn’t work

As a rule, don’t let developers near your production datasets. Not even obfuscated copies. This isn’t a security problem, it’s an architecture problem. If developers and architects don’t have to worry about the composition of a record, if they don’t have to think of how many different systems need data injecting into them in order for a single screen to function, then things start to sprawl in horrible ways. I’ve seen first hand the ridiculous scenario where there is simply no known way to reliably inject a user in such a way that a system will work fully. What’s worse, is that I’ve seen this more than once.

I’ve been in the situation where there is no single developer who knows exactly where a user record comes from in full. The idea of building a ‘User Service’ which could create a user seemed mindbogglingly complicated.

Why is this a bad thing? If your development teams don’t understand where the data is coming from, they don’t understand the behaviour of the system they’re building, and they can’t write tests which cover all scenarios. You start to rely on the (incorrect) idea that the production data is a ‘golden recordset’ which contains so much data it must cover all scenarios. Then the developers start to realise they can’t write reliable tests against data which is refreshed every few weeks and randomly masked in different ways. It becomes a manual QA effort to find records to use in tests. Problems aren’t found until much later and cost much more to solve, or worse: problems aren’t noticed.

If it isn’t possible for developers to understand and write coded tests for all behaviours and inject data to drive each behaviour, then you are slowly grinding to a halt.

Avoid it

Avoid pushing production-like data to development, or staging, or any other environment where it isn’t needed. Behaviours should be sufficiently defined, and architecture should be properly conceived, so injecting test data as part of automated testing is simple. There are no swings and roundabouts here – there’s just a good and bad approach. Please pick the good one.

Automation with Forgerock AM 6.5

Beware – here be dragons!

Over the last year, I’ve become very familiar with Forgerock’s Access Manager platform. Predominantly I’ve been working with a single, manually managed, 13.5 instance, but since experiencing 3 days of professional services from Forgerock, I’ve been busily working on automating AM 6.5.1 using Team City, Octopus, and Ansible. While the approach I’ve taken isn’t explicitly the recommended by Forgerock, it isn’t frowned upon and it is inline with the containerised deployment mechanisms which are expected to become popular with AM v7. I can’t share the source code for what was implemented as it would be a breach of client trust, but given the lack of material available on automating AM (and the shier complexity of the task), I think it’s worth outlining the approach.

Disclaimer alert!
What I cover here is a couple of steps on from what was eventually implemented for my client. The reason being that automating something as complex as Forgerock AM is new for them, as are Ansible Roles, and volatile infra. We went as far as having a single playbook for the AM definition, and we had static infra – the next logical step would be to break down into roles and generate the infra with each deploy.

I’ve already been through the pain of distilling non functional requirements down to a final approach, I feel it would be easier here to start at the end. So let’s talk implementation.

Tech Stack

The chosen tech stack was driven by what was already in use by my client. The list is augmented with some things we felt pain for missing.

Code repositories: git in Azure DevOps
Build platform: Team City
Deployment platform: Octopus
Configuration management: Ansible
Package management: JFrog Artifactory
Local infra as code tool: Vagrant

A few points I’d like to make about some of these:

  1. Azure DevOps looks really nice, but has an appalling range of thousands of IP addresses which need to be whitelisted in order to use any of the hosted build / deploy agents. The problem goes away if you self host agents, but it’s a poor effort on the part of Microsoft.
  2. Octopus isn’t my preferred deployment tool. I find Octopus is great for beginners, but it lends itself to point and click rather than versioning deploy code in repos. It’s also very over-engineered and opinionated, forcing their concepts onto users. My personal preference is Thoughtworks’ Go Deploy which takes the opposite appraoch.
  3. You don’t need to use Vagrant for local development, I only call it out here because I believe it can help speed things up considerably. It’s possible to execute Ansible playbooks via the Vagrantfile, or (my preference) write a bash script which can be used manually, via Ansible, or from virtually any other platform.
  4. I don’t have huge amounts of experience with Ansible, but it seems to do the job pretty well. I’m sure I probably missed a few tricks in how I used it.

Architecture

Generally, with a multi-node deployment of Forgerock AM, we end up with something looking like fig. 1.

AM6.5 Automation - AM - Shared Config - Affinity Token Stores
Fig. 1: Basic multi-node configuration with affinity enabled.

There are two items to note about this configuration:

  1. The shared config database means ssoadm/Amster commands only need executing against one instance. The other instance then just needs restarting to pick up the config which has been injected into the config database.
  2. Affinity is the name for the mechanism AM uses to load balance the token stores without risking race conditions and dirty reads. If a node writes a piece of data to token store instance 1, then every node will always go back to instance 1 for that piece of data (failing over to other options if instance 1 is unavailable). This helps where replication takes longer than the gap between writing and reading.

Affinity rocks. Until we realised this was available, there was a proxy in front of the security token stores set to round robin. If you tried to read something immediately after writing, you’d often get a dirty read or an exception. Affinity does away with this by deciding where the data should be stored based on a hash of the data location which all nodes can calculate. Writes and reads from every node will always go to the same STS instance first.

For my purposes, I found that the amount of data I needed to store in the user’s profile was tiny; I had maybe two properties. Which led me down the path of trying to use client based sessions to store the profile. The benefit of this approach is that we don’t really need any security token stores. Our architecture ends up looking like fig. 2.

AM6.5 Automation - AM - Default Config - Client Based Sessions
Fig. 2: No need for token stores.

We don’t just do away with the token stores. Because we are fully automating the deployment, we don’t need to share a config database – we know our config is aligned because it is recreated with every deploy exactly as it is in the source code.

Keys

Ok, so it isn’t quite as easy as that. Because we aren’t sharing config, we can’t allow the deploy process to pick a random encryption keys. These keys are used to encode session info, security tokens, and cookies. To align these we need to run a few commands during deployment.

set-attr-defs --verbose --servicename iPlanetAMSessionService -t global -a "openam-session-stateless-signing-rsa-certificate-alias=<< your cert alias >>"
set-attr-defs --verbose --servicename iPlanetAMSessionService -t global -a "openam-session-stateless-encryption-rsa-certificate-alias=<< your cert alias >>"
set-attr-defs --verbose --servicename iPlanetAMSessionService -t global -a "openam-session-stateless-encryption-aes-key=<< your aes key >>"
set-attr-defs --verbose --servicename iPlanetAMSessionService -t global -a "openam-session-stateless-signing-hmac-shared-secret=<< your hmac key >>"
set-attr-defs --verbose --servicename iPlanetAMAuthService -t organization -a "iplanet-am-auth-hmac-signing-shared-secret=<< your hmac key >>"
set-attr-defs --verbose --servicename iPlanetAMAuthService -t organization -a "iplanet-am-auth-key-alias=<< your cert alias >>"
set-attr-defs --verbose --servicename RestSecurityTokenService -t organization -a "oidc-client-secret=<< your oidc secret >>"

These settings are ssoadm commands mostly found in this helpful doco, but I think I had to dig a bit further for one or two. Some of these have rules over minimum complexity. The format I’ve given is how they would appear if you are using the ssoadm do-batch command to run a number of instructions via batch file.

SAML gotcha

To make client based profiles work for SAML authentication, I was surprised to find that I needed to write a couple of custom classes.

SAML auth isn’t something we wanted to do, but we were forced down this route due to limitations of another 3rd party platform.

I started off with this doco from Forgerock, and with Forgerock’s am-external repo. With some debug level logging, I was able to find that I needed to create a custom AccountMapper and a custom AttributeMapper. It seems that both of the default classes were coded to expect the profile to be stored in a db, regardless of whether client sessions were enabled or not. Rather than modifying the existing classes, I added my own classes to avoid breaking anything else which might be using them.

Referencing the new classes is annoyingly not well documented. Firstly, build the project and drill down into the compiled output (I just used the two .class files created for my new classes) – copy over to the war file in WEB-INF/lib/openam-federation-library.jar. Make sure you put the .class files in the right location. I managed to reference these classes in my ‘identityprovider.properties’ file with these xml elements:

<Attribute name="idpAccountMapper">
    <Value>com.sun.identity.saml2.plugins.YourCustomAccountMapper</Value>
</Attribute>
<Attribute name="idpAttributeMapper">
    <Value>com.sun.identity.saml2.plugins.YourCustomAttributeMapper</Value>
</Attribute>

As code

To fully define the deployment of AM in code which I ended up with, we can use the git repositories shown in fig.3.

AM6.5 Automation - DevOps Git Repo's
Fig. 3: The collection of git repo’s listed against the teams which could own them. The colour coding used here will be continued through other diagrams.

Infra space

Hopefully you’re using fully volatile instances, and creating/destroying new webservers all the time. If you are, then this should make some sense. There’s a JBoss webserver role which references a RedHat server role. These can be reused by various deployments and they’re configured once by the Infra team.

I’m not going to go into much detail about these, as standards for building instances will change from place to place.

We didn’t have fully volatile infra when I implemented AM automation, which meant it was important to completely remove every folder from the deploy before re-deploying. While developing I’d often run into situations where a setting was left over from a previous run and would fail on a new instance.

Platform space

The Platform Space is about managing the 3rd party applications that support the enterprise. This space owns the repo for customising the war file, a copy of the am-external repo from Forgerock, and the Ansible Role defining how to deploy a totally vanilla instance of AM – this references the JBoss webserver role. These are all artifacts which are needed in order to just deploy the vanilla, reusable Forgerock AM platform without any realms.

Dev space

The Dev Space should be pretty straight forward. It contains a repo for the protected application, and a repo for the AM Realm to which the application belongs. The realm definition is an Ansible Playbook, rather than a role. It’s a playbook because there isn’t a scenario where it would be shared. Also, although ansible-galaxy can be used to download the dependencies from git, it doesn’t execute them, you still need a point of entry for running the play and its dependencies, which can be just a playbook. One of the files in the playbook should be a requirements.yml, which is used to initiate the chain of dependencies through the other roles (mentioned below in a little more detail).

Repo: Forgerock AM war file

My solution structure for building the war file looks like this:

- root
  - warFile
  - customCode
    - code
    - tests
  - amExternal
  - xui
    - openam-ui-api
    - openam-ui-ria
  - staticResources

We can go through each of these subfolders in turn.

warFile

The war file is an unzipped copy of the official file, downloaded from here. I was using version 6.5.1 of Access Manager. This folder is a Maven project configured to output a .war file. Before compiling this war file, we need to pull in all the customisations from the rest of the solution.

customCode

This is (unsurprisingly) for custom Java code, built against the Forgerock AM code. The type of things you might find here would be plugins, auth nodes, auth modules, services, and all sorts of other points of extension where you can just create a new ‘thing’ and reference it by classname in the realm config.

Custom code is pretty straight forward. New libraries are the easiest to deal with as you’re writing code to interface. You compile to a jar and copy that jar into the war file under /WEB-INF/lib/ along with any dependencies. As long as you are careful with your namespaces and keep an eye on the size of what you’re writing, you can probably get away with just building a single jar file for all  your custom code. This makes things easier for you in the sense that you can do everything in one project, right along-side an unzipped war file. If you start to need multiple jars to break down your code further, consider moving your custom code to a different repo, and hosting jars on an internal Maven server.

Because you are writing new code here, there is of course the opportunity to add some unit tests, and I suggest you do. I found that keeping my logic out of any classes which implement anything from Forgerock was a good move – allowing me to test logic without worrying about how the Forgerock code hangs together. This is probably sage advice at any time on any other platform, as well.

Useful link: building custom auth nodes (may require a Forgerock account to access)

amExternal

The code in am-external is a little trickier. This repo from Forgerock has around 50 modules in it, and you’ll probably only want to recompile a couple. I’m not really a Java developer so rather than try to get every module working, I elected for creating my own git repo with am-external in it, keeping a track of customisations in the git history and in README.md. Then manually copying the recompiled jars over into my war file build. I placed these compiled jars into the amExternal folder, with a build script which simply copied them into /WEB-INF/lib/ before the war file is compiled.

xui

This is (in my opinion) a special case from am-external, a module called openam-ui. We already had XUI customisations from a while ago, otherwise I would probably not be bothering with XUI. From my own experience and having discussed this during some on-site Forgerock Professional Services, XUI is a pretty clunky way to do things. The REST API in AM 6.5+ is excellent, you can easily consume it from your own login screen.

For previous versions of AM, it’s been possible just to copy the XUI files into the war file before compilation, but now we have to use the compiled output.

Instructions for downloading the am-external source and the XUI are here. New themes can be added at: /openam-ui-ria/src/resources/themes/ – just copy the ‘dark’ folder and start from there. There are a couple of places where you have to add a reference to the new theme, but the above link should help you out with that as well.

This module needs compiling at openam-ui, and the output copying into the war file under the /XUI folder.

staticResources

We included a web.xml and a keepAlive.jsp as non-XUI resources. I found a nice way to handle these is to recreate the warFile structure in the staticResources folder, add your files there, and use a script to copy the entire folder structure recursively into the war file while maintaining destination files.

Some of these (amExternal and staticResources) could have been left out, and the changes made directly into the war file. I didn’t do this for two reasons:

  1. The build scripts which copy these files into place explain to any new developers what’s going on far better than a git history would.
  2. By leaving the war file clean (no changes at all since downloading from Forgerock), I can confidently replace it with the next version and know I haven’t lost any changes.

The AM-SSOConfiguratorTools

The AM-SSOConfiguratorTools-{version}.zip file can be downloaded from here. The version I was using is 5.1.2.2, but you will probably want the latest version.

Push this zip into Artifactory, so it can be referenced by the Ansible play which installs AM.

You have a choice to make here about how much installation code lives with the Ansible play, and how much is in the Configurator package you push to Artifactory. There are a number of steps which go along with installing and using the Configurator which you might find apply to all usages, in which case I would tend to add them to the Configurator package. These steps are things like:

  • Verify the right version of the JDK is available (1.8).
  • Unzip the tools.
  • Copy to the right locations.
  • Apply permissions.
  • Add certificates to the right trust store / copy your own trust store into place.
  • Execute the Configurator referencing the install config file (which will need to come from your Ansible play, pretty much always).

Repo: Forgerock AM (Ansible Role)

This role has to run the following (high level) steps:

  1. Run the JBoss webserver role.
  2. Configure JBoss’ standalone.xml to point at a certificate store with your SSL cert in it.
  3. Grab the war file from Artifactory and register it with JBoss.
  4. Pull the Configurator package from Artifactory.
  5. Run the Configurator with an install config file from the Role.
  6. Use the dsconfig tool to allow anonymous access to Open DS (if you are running the default install of Open DS).
  7. Add any required certs into the Open DS keystore (/{am config directory}/opends/config/keystore)
  8. Align passwords on certs using the keystore.pin file from the same directory.

More detailed install instructions can be be found here.

I ran into a lot of issues while trying to write an install script which would work. Googling the problems helped, but having a Forgerock Backstage account and being able to ask their support team directly was invaluable.

A lot of issues were around getting certificates into the right stores, with the correct passwords. You need to take special care to make sure that Open DS also has access to the right SSL certs and trust stores.

Repo: Forgerock AM Realm X (Ansible Playbook)

Where ‘X’ is just some name for  your realm. With AM 6.5+ you have a few options for configuring realms: ssoadm, amster, and the REST API. As I already had a number of scripts built for ssoadm from another installation, I went with that. With the exception of a custom auth tree, which ssoadm doesn’t know about. For these you can use either Amster or the REST API, but at the time I was working on this there was a bug in Amster which meant Forgerock were suggesting the REST API was the best choice.

For reference on how to use the command line tools and where to put different files, see here.

Running ssoadm commands one at a time to build a realm is very slow. Instead use the do-batch command, referenced here.

DevOps

Our DevOps tool chain is git, Team City, Octopus, Ansible, and Artifactory. These work together well, but there are some important concepts to allow a nice separation between Dev teams and Platform/Infra teams.

Firstly, Octopus is the deployment platform, not Ansible. Deploying in this situation can be defined as moving the configuration to a new version, and then verifying the new state. It’s the Ansible configuration which is being deployed. Ansible maintains that configuration. When Ansible detects a failure or a scaling scenario, and brings up new instances, it doesn’t need to run the extensive integration tests which Octopus would, because the existing state has been validated already. Ansible just has to hit health checks to verify the instances are in place.

Secondly, developers should be building deployable packages for their protected applications and registering them in Artifactory. This means the Ansible play for a protected application is just ‘choco install blah’ or ‘apt install something’. It also means the developers are somewhat isolated from the in’s and out’s of Ansible – they can run their installer over and over without ever thinking about Ansible.

Fig. 4 and fig. 5 show the Team City builds and the Octopus Deploy jobs.

AM6.5 Automation - Team City Builds
Fig. 4: Strictly speaking, the Ansible Roles don’t need build configurations as Ansible consumes them directly from their git repo’s. However, if  you want a level of automated testing, then a build will need to run in order to trigger an Octopus deploy, which doesn’t auto trigger from a push to git.

Notice that the Forgerock AM Realm X repo is an Ansible Playbook rather than a Role. The realm is “the sharp end”, there will only ever be a single realm X, so there’s never going to be a requirement to share it. We can package this playbook up and push it to Artifactory so it can be ‘yum installed’. We can include a bash script to process the requirements.yml (which installs the dependent roles and their dependencies) and then execute the play. The dependencies of each role (being one of the other roles in each case) are defined in a meta/main.yml file as explained here.

AM6.5 Automation - Octopus Deploy Projects
Fig. 5: Only the protected application will ever be used to deploy a usable production system. The other deploy projects are all for the Ansible Roles. These get triggered with each commit and are there specifically to test the infra. They might still deploy right through to production, to test in all environments, but wouldn’t generally result in usable estate.

The dev owned deploy project for the protected application includes the deployment of the entire stack. In this case that means first deploying the Ansible playbook for AM realm X, which will reference the Forgerock AM role (and so its dependencies) to build the instance. A cleverly defined deploy project might deploy both the protected application and the AM realm role simultaneously, but fig. 6 shows the pipeline deploying one, then the other.

AM6.5 Automation - Pipeline - Commit to Web App
Fig. 6: The deploy pipeline for the protected application. Shown as far as a first, staging environment for brevity.

Now, I am not an Ansible guru by any stretch of the imagination. I’ve enjoyed using Chef in the past, but more often I find companies haven’t matured far enough to have developed an appetite for configuration management, so it happens that I haven’t had any exposure to Ansible until this point. Please keep in mind that I might not have implemented the Ansible components in the most efficient way.

Roles vs playbooks

  • Roles can be pulled directly from git repositories by ansible-galaxy. Playbooks have to be pushed to an Ansible server to run them.
  • Roles get versioned in and retrieved from a git repo’s. Playbooks get built and deployed to a package management platform such as Ansible.
  • Roles are easily shareable and can be consumed from other roles or playbooks. Playbooks reference files which are already in place on the Ansible server, either as part of the play or downloaded by ansible-galaxy.
  • An application developer generally won’t have the exposure to Ansible to be able to write useful roles. An application developer can pretty quickly get their head around a playbook which just installs their application.
  • My view of the world!

Connecting the dots

Take a look at an example I posted to github recently, where I show a role and a playbook being installed via Vagrant. The playbook is overly simplified, without any variables, but it demonstrates how the ‘playbook to role to role…’ dependency chain can be initiated with only three files:

  1. configure.sh
  2. requirements.yml
  3. test_play.yml

Imagine if these files were packaged up and pushed to Artifactory. That is my idea of what would comprise the ‘Realm X Playbook’ package in fig. 6. Of course the test_play.yml would actually be a real play with templates that set up the realm and with variables for each different environment. I hope that diagram easier to trace, with this in mind.

Developer exposure

The only piece of Forgerock AM development which the dev team are exposed to is the realm definition. This is closely related with the protected application, so any developers working on authentication need to understand how the realm is set up and how to modify it. Having the dev team own the realm playbook helps distribute understanding of the platform.

Working in the platform and infra space

The above is just the pipeline for the protected application, every other repo triggers a pipeline as well. fig. 7 shows what happens when there’s a commit to the war file repo.

AM6.5 Automation - Pipeline - Commit to Forgerock AM war file
Fig. 7: The war file install package gets punted into Artifactory and the specific version number updated in the Forgerock AM role. It’s the update of the role that triggers the role’s pipeline which includes the war file in its configuration. The war file is tested as part of the Forgerock AM role.

Ultimately, the platform and infra QA pipelines (I call them QA pipelines because they’re there purely for testing) are kicked off with commits to the role repo’s. It’s probably a good idea to agree a reasonable branching strategy so master is always prod ready, because it could go at any time!

The next step for the pipeline in fig. 7 might be to kick off a deploy of all protected applications consuming that role. This might present a scoping and scale problem. If there are 30 applications being protected by Forgerock AM and a deploy is kicked off for every one of these at the same time into the same test environment, you may see false failures and it may take a LONG time to verify the change. Good CI practice would suggest you integrate and test as early as possible, but if the feedback loop is unpreventably long, then you probably won’t want to kick it off with each commit at 2 minutes apart.

The risk is that changes may build up before you have a chance to find out that they are wrong – mitigate it in whatever way works best for you. The right answer will depend on your own circumstances.

I think that’s it

That pretty much covers everything I’ve had to implement or work around when putting together CI/CD automation for Forgerock Access Manager. Reading back through, I’m struck by how many tiny details need to be taken into consideration in order to make this work. It has been a huge effort to get this far, and yet I know this solution is far from complete – blue/green and rollbacks would be next on the agenda.

I think with the release of version 7, this will get much easier as they move to containers. That leaves the cluster instance management to Ansible and container orchestration to something like Kubernetes – a nice separation of concerns.

Even with containers, AM is still a hugely complicated platform. I’ve worked with it for just over a year and I’m struggling to see the balance of cost vs benefit. I wrote this article because I wanted to show how much complexity there really was. I think if I had read this at the beginning I would have been able to estimate the work better, at least!

Even with the huge complexity, it’s worth noting that this is now redeploying reliably with every commit. It’s easy to see all the moving parts because there is just a single deploy pipeline which deploys the entire stack. Ownership of each different component is nicely visible due to having the repo’s in projects owned by different teams.

A success story?

Where Patterns go to Die

At every point in my career in software there has been a current trend in architecture. The trend has usually changed every couple of years, far faster than there have been significant changes in the underlying languages. This has often struck me as significant, as (ignoring the growing areas of machine learning and AI) the basic problems we’re trying to solve have not changed hugely over the years. So why was it ok in the 1990’s to build a monolith? Why is it no longer a good idea to install an Enterprise Service Bus? What changed?

I’ve often heard people talk about past patterns with phrases like “back when we thought this was a good idea”, but I’m hesitant to believe that so many people could have been wrong about the same thing at the same time. I think there must be more to it than that.

There are always benefits

When I think about all the monoliths I’ve worked on, I can’t say that the experience has always been terrible. There are obvious benefits to a monolithic approach, such as being able to easily share code without the need for package management. They can be a lot easier to test as there are no application boundaries, just about every test you can think of could be implemented as a unit test, although the units may get rather large. Because everything is in one solution we don’t get as many network calls, so we aren’t exposed to network outages or versioning issues during releases.

What about SOA? That was huge at one point. Services focused on business processes which can be called by various applications which need that functionality, it doesn’t sound unreasonable. In fact it sounds a lot like microservices. I have worked on dozens of service oriented architectures over the last decade, none of which would have been described by their builders as SOA.

Enterprise Service Bus – almost no-one likes these any more. Yet the idea of having an integration platform which allows each application to carry on it’s day to day processes without ever having to be aware of any other application which might need to know about its data or processes is not a silly one.

How about: the Service Locator Pattern, Stored Procedures, utility classes, remote procedure calls? I’m sure if you think long enough, there will always be some other ‘good idea at the time’ that is now generally frowned upon.

But what about: microservices, serverless, cloud computing, native apps, no sql databases? Surely these things are destined to be around forever..? We got it right these times. Right?

You still have to design stuff

“How do you build a microservice?”

Is this a good question? If you can answer this question, does that mean you can implement a micro-architecture successfully?

If you know how to deploy, manage, and push code to an enterprise service bus, does that mean you can successfully implement one?

Let me ask these questions in another way:

Which business problem is solved by either micro-architecture or ESB? Because if you aren’t solving a business problem, then you aren’t successfully implementing anything.

It seems to me that an awful lot of technologists follow trends like each one is their next true religion, without ever seeing the actual truths behind them. I know for absolute certainty that every single ‘bad idea’ that has at one time been ‘the latest trend’ will fix a specific problem pretty well. It may lead to other problems if not managed correctly, but that isn’t the point – if you choose an approach, you must choose an approach which is going to work for you and continue to work.

Characteristics

These are some of the characteristics of microservices:

  • They are individually deployable.
  • They are individually testable.
  • They are individually versionable.
  • They represent a business process.
  • They own their own database and data.
  • When changing one service, it’s important to know what other services consume it and test them alongside the change.
  • The number of microservices in an enterprise can grow pretty quickly, managing them requires a high degree of automation.

These are some of the characteristics of monoliths:

  • All code is in a single solution.
  • Boundaries are defined by namespaces.
  • The entire application is redeployed each time.
  • User interfaces are built into the same application as business logic.
  • They often write to databases which are used by other applications.
  • If they become too big, the code becomes gridlocked and difficult to change.

These are some of the characteristics of enterprise service busses:

  • They can be highly available.
  • They allow for moving data in a resilient fashion.
  • Changes can be deployed without interfering with other applications.
  • They can integrate applications across LAN boundaries in a resilient fashion.
  • They can abstract away the implementation of business concerns behind facades.
  • They can quickly become an expensive dependency which can be updated only by a specific few people who understand the technology.

These are some of the characteristics of the service locator pattern:

  • It allows access to an IOC kernel from objects which haven’t necessarily been instantiated by that kernel.
  • It allows access to different implementations based on the requirements of the consuming class.
  • It isn’t immediately obvious that the pattern is in use, which can lead to confusion when working on an unfamiliar codebase.

These are some of the characteristics of a serverless approach:

  • Developers can think more about the code they’re writing and less about the platform.
  • The container running the code is generally the same in dev as test and production.
  • Some serverless implementations are reduced to the function level and are very small.
  • When they become very small, services can become harder to group and manage.
  • Building serverless can sometimes require extra technical knowledge of the specific flavour of serverless in use.

Each of these patterns, approaches, or technologies, all have benefits and down sides. There is a time and a place for each. More importantly, there are more scenarios where each of these patterns would be a mistake than where they would work well. Even where one of these could be a good idea, there’s still plenty of scope to implement it poorly.

Blind faith

I think this is what happens. I think technologists at work get pressured into delivering quickly, or have people telling them they have to work with specific technologies, or their piers laugh when they build something uncool. I think as technologists there are too many of us out there who don’t put enough consideration into whether they are solving the business problem, or whether they are just regurgitating the same stuff that was built previously because ‘that pattern worked ok for them’. I think too may people only see the label, when they should be looking at what’s behind the label.

Piling code on top of code in a situation which isn’t being watched because “it’s worked perfectly fine so far” is what leads to problems. Whether you’re building a single application, pushing into a service fabric, or programming an ESB – if you take your eye off the ball, your pattern itself will rot.

Take SOA for example, how many huge, complicated, poorly documented, misunderstood API’s are deployed which back onto a dozen databases and a range of other services? API’s getting called by all sorts of applications deployed to all sorts of locations, with calls proxied through multiple layers to find their way to this one point of functionality. At some point those API’s were small enough to be a good idea. They were consumed by a few applications deployed somewhere not too distant, and didn’t need much documentation because their functionality was well scoped. Then someone took their eye off the ball and logic which should have been implemented in new services was thrown into the existing one because ‘deploying something else would be tricky’.

This is where patterns get thrown out, as if it was inevitable that the pattern would lead to something unmanageable. Well I have news for you: the pattern doesn’t make decisions, you do.

If you solve a problem by building a service which represents a business process, doesn’t need to call other services, but has been stuck on top of a well used legacy monolithic database, then well done! Who cares that it isn’t quite a microservice? As long as you have understood the downsides to the choices you have made, and know how they are managed or mitigated in your specific circumstances, then that’s just fine. You don’t have to build from the text book every time.

By solving problems rather than implementing cool patterns, we move the focus onto what is actually important. Support the business first.

My First Release Weekend

At the time of writing this post, I am 41 years old, I’ve been in the business of writing software for over 20 years, and I have never ever experienced a release weekend. Until now.

It’s now nearly 1 pm. I’ve been here since 7 am. There are a dozen or so different applications which are being deployed today, which are highly coupled and maddeningly unresilient. For my part, I was deploying a web application and some config to a security platform. We again hit a myriad of issues which hadn’t been seen in prior environments and spent a lot of time scratching our heads. The automated deployment pipeline I built for the change takes roughly a minute do deploy everything, and yet it took us almost 3 hours to get to the point where someone could log in.

The release was immediately labelled a ‘success’ and everyone starts singing praises. As subsequent deployments of other applications start to fail.

This is not success!

Success is when the release takes the 60 seconds for the pipeline to run and it’s all working! Success isn’t having to intervene to diagnose issues in an environment no-one’s allowed access to until the release weekend! Success is knowing the release is good because the deploy status is green!

But when I look at the processes being followed, I know that this pain is going to happen. As do others, who appear to expect it and accept it, with hearty comments of ‘this is real world development’ and ‘this is just how we roll here’.

So much effort and failure thrown at releasing a fraction of the functionality which could have been out there if quality was the barrier to release, not red tape.

And yet I know I’m surrounded here by some very intelligent people, who know there are better ways to work. I can’t help wondering where and why progress is being blocked.