Automation with Forgerock AM 6.5

Beware – here be dragons!

Over the last year, I’ve become very familiar with Forgerock’s Access Manager platform. Predominantly I’ve been working with a single, manually managed, 13.5 instance, but since experiencing 3 days of professional services from Forgerock, I’ve been busily working on automating AM 6.5.1 using Team City, Octopus, and Ansible. While the approach I’ve taken isn’t explicitly the recommended by Forgerock, it isn’t frowned upon and it is inline with the containerised deployment mechanisms which are expected to become popular with AM v7. I can’t share the source code for what was implemented as it would be a breach of client trust, but given the lack of material available on automating AM (and the shier complexity of the task), I think it’s worth outlining the approach.

Disclaimer alert!
What I cover here is a couple of steps on from what was eventually implemented for my client. The reason being that automating something as complex as Forgerock AM is new for them, as are Ansible Roles, and volatile infra. We went as far as having a single playbook for the AM definition, and we had static infra – the next logical step would be to break down into roles and generate the infra with each deploy.

I’ve already been through the pain of distilling non functional requirements down to a final approach, I feel it would be easier here to start at the end. So let’s talk implementation.

Tech Stack

The chosen tech stack was driven by what was already in use by my client. The list is augmented with some things we felt pain for missing.

Code repositories: git in Azure DevOps
Build platform: Team City
Deployment platform: Octopus
Configuration management: Ansible
Package management: JFrog Artifactory
Local infra as code tool: Vagrant

A few points I’d like to make about some of these:

  1. Azure DevOps looks really nice, but has an appalling range of thousands of IP addresses which need to be whitelisted in order to use any of the hosted build / deploy agents. The problem goes away if you self host agents, but it’s a poor effort on the part of Microsoft.
  2. Octopus isn’t my preferred deployment tool. I find Octopus is great for beginners, but it lends itself to point and click rather than versioning deploy code in repos. It’s also very over-engineered and opinionated, forcing their concepts onto users. My personal preference is Thoughtworks’ Go Deploy which takes the opposite appraoch.
  3. You don’t need to use Vagrant for local development, I only call it out here because I believe it can help speed things up considerably. It’s possible to execute Ansible playbooks via the Vagrantfile, or (my preference) write a bash script which can be used manually, via Ansible, or from virtually any other platform.
  4. I don’t have huge amounts of experience with Ansible, but it seems to do the job pretty well. I’m sure I probably missed a few tricks in how I used it.

Architecture

Generally, with a multi-node deployment of Forgerock AM, we end up with something looking like fig. 1.

AM6.5 Automation - AM - Shared Config - Affinity Token Stores
Fig. 1: Basic multi-node configuration with affinity enabled.

There are two items to note about this configuration:

  1. The shared config database means ssoadm/Amster commands only need executing against one instance. The other instance then just needs restarting to pick up the config which has been injected into the config database.
  2. Affinity is the name for the mechanism AM uses to load balance the token stores without risking race conditions and dirty reads. If a node writes a piece of data to token store instance 1, then every node will always go back to instance 1 for that piece of data (failing over to other options if instance 1 is unavailable). This helps where replication takes longer than the gap between writing and reading.

Affinity rocks. Until we realised this was available, there was a proxy in front of the security token stores set to round robin. If you tried to read something immediately after writing, you’d often get a dirty read or an exception. Affinity does away with this by deciding where the data should be stored based on a hash of the data location which all nodes can calculate. Writes and reads from every node will always go to the same STS instance first.

For my purposes, I found that the amount of data I needed to store in the user’s profile was tiny; I had maybe two properties. Which led me down the path of trying to use client based sessions to store the profile. The benefit of this approach is that we don’t really need any security token stores. Our architecture ends up looking like fig. 2.

AM6.5 Automation - AM - Default Config - Client Based Sessions
Fig. 2: No need for token stores.

We don’t just do away with the token stores. Because we are fully automating the deployment, we don’t need to share a config database – we know our config is aligned because it is recreated with every deploy exactly as it is in the source code.

Keys

Ok, so it isn’t quite as easy as that. Because we aren’t sharing config, we can’t allow the deploy process to pick a random encryption keys. These keys are used to encode session info, security tokens, and cookies. To align these we need to run a few commands during deployment.

set-attr-defs --verbose --servicename iPlanetAMSessionService -t global -a "openam-session-stateless-signing-rsa-certificate-alias=<< your cert alias >>"
set-attr-defs --verbose --servicename iPlanetAMSessionService -t global -a "openam-session-stateless-encryption-rsa-certificate-alias=<< your cert alias >>"
set-attr-defs --verbose --servicename iPlanetAMSessionService -t global -a "openam-session-stateless-encryption-aes-key=<< your aes key >>"
set-attr-defs --verbose --servicename iPlanetAMSessionService -t global -a "openam-session-stateless-signing-hmac-shared-secret=<< your hmac key >>"
set-attr-defs --verbose --servicename iPlanetAMAuthService -t organization -a "iplanet-am-auth-hmac-signing-shared-secret=<< your hmac key >>"
set-attr-defs --verbose --servicename iPlanetAMAuthService -t organization -a "iplanet-am-auth-key-alias=<< your cert alias >>"
set-attr-defs --verbose --servicename RestSecurityTokenService -t organization -a "oidc-client-secret=<< your oidc secret >>"

These settings are ssoadm commands mostly found in this helpful doco, but I think I had to dig a bit further for one or two. Some of these have rules over minimum complexity. The format I’ve given is how they would appear if you are using the ssoadm do-batch command to run a number of instructions via batch file.

SAML gotcha

To make client based profiles work for SAML authentication, I was surprised to find that I needed to write a couple of custom classes.

SAML auth isn’t something we wanted to do, but we were forced down this route due to limitations of another 3rd party platform.

I started off with this doco from Forgerock, and with Forgerock’s am-external repo. With some debug level logging, I was able to find that I needed to create a custom AccountMapper and a custom AttributeMapper. It seems that both of the default classes were coded to expect the profile to be stored in a db, regardless of whether client sessions were enabled or not. Rather than modifying the existing classes, I added my own classes to avoid breaking anything else which might be using them.

Referencing the new classes is annoyingly not well documented. Firstly, build the project and drill down into the compiled output (I just used the two .class files created for my new classes) – copy over to the war file in WEB-INF/lib/openam-federation-library.jar. Make sure you put the .class files in the right location. I managed to reference these classes in my ‘identityprovider.properties’ file with these xml elements:

<Attribute name="idpAccountMapper">
    <Value>com.sun.identity.saml2.plugins.YourCustomAccountMapper</Value>
</Attribute>
<Attribute name="idpAttributeMapper">
    <Value>com.sun.identity.saml2.plugins.YourCustomAttributeMapper</Value>
</Attribute>

As code

To fully define the deployment of AM in code which I ended up with, we can use the git repositories shown in fig.3.

AM6.5 Automation - DevOps Git Repo's
Fig. 3: The collection of git repo’s listed against the teams which could own them. The colour coding used here will be continued through other diagrams.

Infra space

Hopefully you’re using fully volatile instances, and creating/destroying new webservers all the time. If you are, then this should make some sense. There’s a JBoss webserver role which references a RedHat server role. These can be reused by various deployments and they’re configured once by the Infra team.

I’m not going to go into much detail about these, as standards for building instances will change from place to place.

We didn’t have fully volatile infra when I implemented AM automation, which meant it was important to completely remove every folder from the deploy before re-deploying. While developing I’d often run into situations where a setting was left over from a previous run and would fail on a new instance.

Platform space

The Platform Space is about managing the 3rd party applications that support the enterprise. This space owns the repo for customising the war file, a copy of the am-external repo from Forgerock, and the Ansible Role defining how to deploy a totally vanilla instance of AM – this references the JBoss webserver role. These are all artifacts which are needed in order to just deploy the vanilla, reusable Forgerock AM platform without any realms.

Dev space

The Dev Space should be pretty straight forward. It contains a repo for the protected application, and a repo for the AM Realm to which the application belongs. The realm definition is an Ansible Playbook, rather than a role. It’s a playbook because there isn’t a scenario where it would be shared. Also, although ansible-galaxy can be used to download the dependencies from git, it doesn’t execute them, you still need a point of entry for running the play and its dependencies, which can be just a playbook. One of the files in the playbook should be a requirements.yml, which is used to initiate the chain of dependencies through the other roles (mentioned below in a little more detail).

Repo: Forgerock AM war file

My solution structure for building the war file looks like this:

- root
  - warFile
  - customCode
    - code
    - tests
  - amExternal
  - xui
    - openam-ui-api
    - openam-ui-ria
  - staticResources

We can go through each of these subfolders in turn.

warFile

The war file is an unzipped copy of the official file, downloaded from here. I was using version 6.5.1 of Access Manager. This folder is a Maven project configured to output a .war file. Before compiling this war file, we need to pull in all the customisations from the rest of the solution.

customCode

This is (unsurprisingly) for custom Java code, built against the Forgerock AM code. The type of things you might find here would be plugins, auth nodes, auth modules, services, and all sorts of other points of extension where you can just create a new ‘thing’ and reference it by classname in the realm config.

Custom code is pretty straight forward. New libraries are the easiest to deal with as you’re writing code to interface. You compile to a jar and copy that jar into the war file under /WEB-INF/lib/ along with any dependencies. As long as you are careful with your namespaces and keep an eye on the size of what you’re writing, you can probably get away with just building a single jar file for all  your custom code. This makes things easier for you in the sense that you can do everything in one project, right along-side an unzipped war file. If you start to need multiple jars to break down your code further, consider moving your custom code to a different repo, and hosting jars on an internal Maven server.

Because you are writing new code here, there is of course the opportunity to add some unit tests, and I suggest you do. I found that keeping my logic out of any classes which implement anything from Forgerock was a good move – allowing me to test logic without worrying about how the Forgerock code hangs together. This is probably sage advice at any time on any other platform, as well.

Useful link: building custom auth nodes (may require a Forgerock account to access)

amExternal

The code in am-external is a little trickier. This repo from Forgerock has around 50 modules in it, and you’ll probably only want to recompile a couple. I’m not really a Java developer so rather than try to get every module working, I elected for creating my own git repo with am-external in it, keeping a track of customisations in the git history and in README.md. Then manually copying the recompiled jars over into my war file build. I placed these compiled jars into the amExternal folder, with a build script which simply copied them into /WEB-INF/lib/ before the war file is compiled.

xui

This is (in my opinion) a special case from am-external, a module called openam-ui. We already had XUI customisations from a while ago, otherwise I would probably not be bothering with XUI. From my own experience and having discussed this during some on-site Forgerock Professional Services, XUI is a pretty clunky way to do things. The REST API in AM 6.5+ is excellent, you can easily consume it from your own login screen.

For previous versions of AM, it’s been possible just to copy the XUI files into the war file before compilation, but now we have to use the compiled output.

Instructions for downloading the am-external source and the XUI are here. New themes can be added at: /openam-ui-ria/src/resources/themes/ – just copy the ‘dark’ folder and start from there. There are a couple of places where you have to add a reference to the new theme, but the above link should help you out with that as well.

This module needs compiling at openam-ui, and the output copying into the war file under the /XUI folder.

staticResources

We included a web.xml and a keepAlive.jsp as non-XUI resources. I found a nice way to handle these is to recreate the warFile structure in the staticResources folder, add your files there, and use a script to copy the entire folder structure recursively into the war file while maintaining destination files.

Some of these (amExternal and staticResources) could have been left out, and the changes made directly into the war file. I didn’t do this for two reasons:

  1. The build scripts which copy these files into place explain to any new developers what’s going on far better than a git history would.
  2. By leaving the war file clean (no changes at all since downloading from Forgerock), I can confidently replace it with the next version and know I haven’t lost any changes.

The AM-SSOConfiguratorTools

The AM-SSOConfiguratorTools-{version}.zip file can be downloaded from here. The version I was using is 5.1.2.2, but you will probably want the latest version.

Push this zip into Artifactory, so it can be referenced by the Ansible play which installs AM.

You have a choice to make here about how much installation code lives with the Ansible play, and how much is in the Configurator package you push to Artifactory. There are a number of steps which go along with installing and using the Configurator which you might find apply to all usages, in which case I would tend to add them to the Configurator package. These steps are things like:

  • Verify the right version of the JDK is available (1.8).
  • Unzip the tools.
  • Copy to the right locations.
  • Apply permissions.
  • Add certificates to the right trust store / copy your own trust store into place.
  • Execute the Configurator referencing the install config file (which will need to come from your Ansible play, pretty much always).

Repo: Forgerock AM (Ansible Role)

This role has to run the following (high level) steps:

  1. Run the JBoss webserver role.
  2. Configure JBoss’ standalone.xml to point at a certificate store with your SSL cert in it.
  3. Grab the war file from Artifactory and register it with JBoss.
  4. Pull the Configurator package from Artifactory.
  5. Run the Configurator with an install config file from the Role.
  6. Use the dsconfig tool to allow anonymous access to Open DS (if you are running the default install of Open DS).
  7. Add any required certs into the Open DS keystore (/{am config directory}/opends/config/keystore)
  8. Align passwords on certs using the keystore.pin file from the same directory.

More detailed install instructions can be be found here.

I ran into a lot of issues while trying to write an install script which would work. Googling the problems helped, but having a Forgerock Backstage account and being able to ask their support team directly was invaluable.

A lot of issues were around getting certificates into the right stores, with the correct passwords. You need to take special care to make sure that Open DS also has access to the right SSL certs and trust stores.

Repo: Forgerock AM Realm X (Ansible Playbook)

Where ‘X’ is just some name for  your realm. With AM 6.5+ you have a few options for configuring realms: ssoadm, amster, and the REST API. As I already had a number of scripts built for ssoadm from another installation, I went with that. With the exception of a custom auth tree, which ssoadm doesn’t know about. For these you can use either Amster or the REST API, but at the time I was working on this there was a bug in Amster which meant Forgerock were suggesting the REST API was the best choice.

For reference on how to use the command line tools and where to put different files, see here.

Running ssoadm commands one at a time to build a realm is very slow. Instead use the do-batch command, referenced here.

DevOps

Our DevOps tool chain is git, Team City, Octopus, Ansible, and Artifactory. These work together well, but there are some important concepts to allow a nice separation between Dev teams and Platform/Infra teams.

Firstly, Octopus is the deployment platform, not Ansible. Deploying in this situation can be defined as moving the configuration to a new version, and then verifying the new state. It’s the Ansible configuration which is being deployed. Ansible maintains that configuration. When Ansible detects a failure or a scaling scenario, and brings up new instances, it doesn’t need to run the extensive integration tests which Octopus would, because the existing state has been validated already. Ansible just has to hit health checks to verify the instances are in place.

Secondly, developers should be building deployable packages for their protected applications and registering them in Artifactory. This means the Ansible play for a protected application is just ‘choco install blah’ or ‘apt install something’. It also means the developers are somewhat isolated from the in’s and out’s of Ansible – they can run their installer over and over without ever thinking about Ansible.

Fig. 4 and fig. 5 show the Team City builds and the Octopus Deploy jobs.

AM6.5 Automation - Team City Builds
Fig. 4: Strictly speaking, the Ansible Roles don’t need build configurations as Ansible consumes them directly from their git repo’s. However, if  you want a level of automated testing, then a build will need to run in order to trigger an Octopus deploy, which doesn’t auto trigger from a push to git.

Notice that the Forgerock AM Realm X repo is an Ansible Playbook rather than a Role. The realm is “the sharp end”, there will only ever be a single realm X, so there’s never going to be a requirement to share it. We can package this playbook up and push it to Artifactory so it can be ‘yum installed’. We can include a bash script to process the requirements.yml (which installs the dependent roles and their dependencies) and then execute the play. The dependencies of each role (being one of the other roles in each case) are defined in a meta/main.yml file as explained here.

AM6.5 Automation - Octopus Deploy Projects
Fig. 5: Only the protected application will ever be used to deploy a usable production system. The other deploy projects are all for the Ansible Roles. These get triggered with each commit and are there specifically to test the infra. They might still deploy right through to production, to test in all environments, but wouldn’t generally result in usable estate.

The dev owned deploy project for the protected application includes the deployment of the entire stack. In this case that means first deploying the Ansible playbook for AM realm X, which will reference the Forgerock AM role (and so its dependencies) to build the instance. A cleverly defined deploy project might deploy both the protected application and the AM realm role simultaneously, but fig. 6 shows the pipeline deploying one, then the other.

AM6.5 Automation - Pipeline - Commit to Web App
Fig. 6: The deploy pipeline for the protected application. Shown as far as a first, staging environment for brevity.

Now, I am not an Ansible guru by any stretch of the imagination. I’ve enjoyed using Chef in the past, but more often I find companies haven’t matured far enough to have developed an appetite for configuration management, so it happens that I haven’t had any exposure to Ansible until this point. Please keep in mind that I might not have implemented the Ansible components in the most efficient way.

Roles vs playbooks

  • Roles can be pulled directly from git repositories by ansible-galaxy. Playbooks have to be pushed to an Ansible server to run them.
  • Roles get versioned in and retrieved from a git repo’s. Playbooks get built and deployed to a package management platform such as Ansible.
  • Roles are easily shareable and can be consumed from other roles or playbooks. Playbooks reference files which are already in place on the Ansible server, either as part of the play or downloaded by ansible-galaxy.
  • An application developer generally won’t have the exposure to Ansible to be able to write useful roles. An application developer can pretty quickly get their head around a playbook which just installs their application.
  • My view of the world!

Connecting the dots

Take a look at an example I posted to github recently, where I show a role and a playbook being installed via Vagrant. The playbook is overly simplified, without any variables, but it demonstrates how the ‘playbook to role to role…’ dependency chain can be initiated with only three files:

  1. configure.sh
  2. requirements.yml
  3. test_play.yml

Imagine if these files were packaged up and pushed to Artifactory. That is my idea of what would comprise the ‘Realm X Playbook’ package in fig. 6. Of course the test_play.yml would actually be a real play with templates that set up the realm and with variables for each different environment. I hope that diagram easier to trace, with this in mind.

Developer exposure

The only piece of Forgerock AM development which the dev team are exposed to is the realm definition. This is closely related with the protected application, so any developers working on authentication need to understand how the realm is set up and how to modify it. Having the dev team own the realm playbook helps distribute understanding of the platform.

Working in the platform and infra space

The above is just the pipeline for the protected application, every other repo triggers a pipeline as well. fig. 7 shows what happens when there’s a commit to the war file repo.

AM6.5 Automation - Pipeline - Commit to Forgerock AM war file
Fig. 7: The war file install package gets punted into Artifactory and the specific version number updated in the Forgerock AM role. It’s the update of the role that triggers the role’s pipeline which includes the war file in its configuration. The war file is tested as part of the Forgerock AM role.

Ultimately, the platform and infra QA pipelines (I call them QA pipelines because they’re there purely for testing) are kicked off with commits to the role repo’s. It’s probably a good idea to agree a reasonable branching strategy so master is always prod ready, because it could go at any time!

The next step for the pipeline in fig. 7 might be to kick off a deploy of all protected applications consuming that role. This might present a scoping and scale problem. If there are 30 applications being protected by Forgerock AM and a deploy is kicked off for every one of these at the same time into the same test environment, you may see false failures and it may take a LONG time to verify the change. Good CI practice would suggest you integrate and test as early as possible, but if the feedback loop is unpreventably long, then you probably won’t want to kick it off with each commit at 2 minutes apart.

The risk is that changes may build up before you have a chance to find out that they are wrong – mitigate it in whatever way works best for you. The right answer will depend on your own circumstances.

I think that’s it

That pretty much covers everything I’ve had to implement or work around when putting together CI/CD automation for Forgerock Access Manager. Reading back through, I’m struck by how many tiny details need to be taken into consideration in order to make this work. It has been a huge effort to get this far, and yet I know this solution is far from complete – blue/green and rollbacks would be next on the agenda.

I think with the release of version 7, this will get much easier as they move to containers. That leaves the cluster instance management to Ansible and container orchestration to something like Kubernetes – a nice separation of concerns.

Even with containers, AM is still a hugely complicated platform. I’ve worked with it for just over a year and I’m struggling to see the balance of cost vs benefit. I wrote this article because I wanted to show how much complexity there really was. I think if I had read this at the beginning I would have been able to estimate the work better, at least!

Even with the huge complexity, it’s worth noting that this is now redeploying reliably with every commit. It’s easy to see all the moving parts because there is just a single deploy pipeline which deploys the entire stack. Ownership of each different component is nicely visible due to having the repo’s in projects owned by different teams.

A success story?

Where Patterns go to Die

At every point in my career in software there has been a current trend in architecture. The trend has usually changed every couple of years, far faster than there have been significant changes in the underlying languages. This has often struck me as significant, as (ignoring the growing areas of machine learning and AI) the basic problems we’re trying to solve have not changed hugely over the years. So why was it ok in the 1990’s to build a monolith? Why is it no longer a good idea to install an Enterprise Service Bus? What changed?

I’ve often heard people talk about past patterns with phrases like “back when we thought this was a good idea”, but I’m hesitant to believe that so many people could have been wrong about the same thing at the same time. I think there must be more to it than that.

There are always benefits

When I think about all the monoliths I’ve worked on, I can’t say that the experience has always been terrible. There are obvious benefits to a monolithic approach, such as being able to easily share code without the need for package management. They can be a lot easier to test as there are no application boundaries, just about every test you can think of could be implemented as a unit test, although the units may get rather large. Because everything is in one solution we don’t get as many network calls, so we aren’t exposed to network outages or versioning issues during releases.

What about SOA? That was huge at one point. Services focused on business processes which can be called by various applications which need that functionality, it doesn’t sound unreasonable. In fact it sounds a lot like microservices. I have worked on dozens of service oriented architectures over the last decade, none of which would have been described by their builders as SOA.

Enterprise Service Bus – almost no-one likes these any more. Yet the idea of having an integration platform which allows each application to carry on it’s day to day processes without ever having to be aware of any other application which might need to know about its data or processes is not a silly one.

How about: the Service Locator Pattern, Stored Procedures, utility classes, remote procedure calls? I’m sure if you think long enough, there will always be some other ‘good idea at the time’ that is now generally frowned upon.

But what about: microservices, serverless, cloud computing, native apps, no sql databases? Surely these things are destined to be around forever..? We got it right these times. Right?

You still have to design stuff

“How do you build a microservice?”

Is this a good question? If you can answer this question, does that mean you can implement a micro-architecture successfully?

If you know how to deploy, manage, and push code to an enterprise service bus, does that mean you can successfully implement one?

Let me ask these questions in another way:

Which business problem is solved by either micro-architecture or ESB? Because if you aren’t solving a business problem, then you aren’t successfully implementing anything.

It seems to me that an awful lot of technologists follow trends like each one is their next true religion, without ever seeing the actual truths behind them. I know for absolute certainty that every single ‘bad idea’ that has at one time been ‘the latest trend’ will fix a specific problem pretty well. It may lead to other problems if not managed correctly, but that isn’t the point – if you choose an approach, you must choose an approach which is going to work for you and continue to work.

Characteristics

These are some of the characteristics of microservices:

  • They are individually deployable.
  • They are individually testable.
  • They are individually versionable.
  • They represent a business process.
  • They own their own database and data.
  • When changing one service, it’s important to know what other services consume it and test them alongside the change.
  • The number of microservices in an enterprise can grow pretty quickly, managing them requires a high degree of automation.

These are some of the characteristics of monoliths:

  • All code is in a single solution.
  • Boundaries are defined by namespaces.
  • The entire application is redeployed each time.
  • User interfaces are built into the same application as business logic.
  • They often write to databases which are used by other applications.
  • If they become too big, the code becomes gridlocked and difficult to change.

These are some of the characteristics of enterprise service busses:

  • They can be highly available.
  • They allow for moving data in a resilient fashion.
  • Changes can be deployed without interfering with other applications.
  • They can integrate applications across LAN boundaries in a resilient fashion.
  • They can abstract away the implementation of business concerns behind facades.
  • They can quickly become an expensive dependency which can be updated only by a specific few people who understand the technology.

These are some of the characteristics of the service locator pattern:

  • It allows access to an IOC kernel from objects which haven’t necessarily been instantiated by that kernel.
  • It allows access to different implementations based on the requirements of the consuming class.
  • It isn’t immediately obvious that the pattern is in use, which can lead to confusion when working on an unfamiliar codebase.

These are some of the characteristics of a serverless approach:

  • Developers can think more about the code they’re writing and less about the platform.
  • The container running the code is generally the same in dev as test and production.
  • Some serverless implementations are reduced to the function level and are very small.
  • When they become very small, services can become harder to group and manage.
  • Building serverless can sometimes require extra technical knowledge of the specific flavour of serverless in use.

Each of these patterns, approaches, or technologies, all have benefits and down sides. There is a time and a place for each. More importantly, there are more scenarios where each of these patterns would be a mistake than where they would work well. Even where one of these could be a good idea, there’s still plenty of scope to implement it poorly.

Blind faith

I think this is what happens. I think technologists at work get pressured into delivering quickly, or have people telling them they have to work with specific technologies, or their piers laugh when they build something uncool. I think as technologists there are too many of us out there who don’t put enough consideration into whether they are solving the business problem, or whether they are just regurgitating the same stuff that was built previously because ‘that pattern worked ok for them’. I think too may people only see the label, when they should be looking at what’s behind the label.

Piling code on top of code in a situation which isn’t being watched because “it’s worked perfectly fine so far” is what leads to problems. Whether you’re building a single application, pushing into a service fabric, or programming an ESB – if you take your eye off the ball, your pattern itself will rot.

Take SOA for example, how many huge, complicated, poorly documented, misunderstood API’s are deployed which back onto a dozen databases and a range of other services? API’s getting called by all sorts of applications deployed to all sorts of locations, with calls proxied through multiple layers to find their way to this one point of functionality. At some point those API’s were small enough to be a good idea. They were consumed by a few applications deployed somewhere not too distant, and didn’t need much documentation because their functionality was well scoped. Then someone took their eye off the ball and logic which should have been implemented in new services was thrown into the existing one because ‘deploying something else would be tricky’.

This is where patterns get thrown out, as if it was inevitable that the pattern would lead to something unmanageable. Well I have news for you: the pattern doesn’t make decisions, you do.

If you solve a problem by building a service which represents a business process, doesn’t need to call other services, but has been stuck on top of a well used legacy monolithic database, then well done! Who cares that it isn’t quite a microservice? As long as you have understood the downsides to the choices you have made, and know how they are managed or mitigated in your specific circumstances, then that’s just fine. You don’t have to build from the text book every time.

By solving problems rather than implementing cool patterns, we move the focus onto what is actually important. Support the business first.

My First Release Weekend

At the time of writing this post, I am 41 years old, I’ve been in the business of writing software for over 20 years, and I have never ever experienced a release weekend. Until now.

It’s now nearly 1 pm. I’ve been here since 7 am. There are a dozen or so different applications which are being deployed today, which are highly coupled and maddeningly unresilient. For my part, I was deploying a web application and some config to a security platform. We again hit a myriad of issues which hadn’t been seen in prior environments and spent a lot of time scratching our heads. The automated deployment pipeline I built for the change takes roughly a minute do deploy everything, and yet it took us almost 3 hours to get to the point where someone could log in.

The release was immediately labelled a ‘success’ and everyone starts singing praises. As subsequent deployments of other applications start to fail.

This is not success!

Success is when the release takes the 60 seconds for the pipeline to run and it’s all working! Success isn’t having to intervene to diagnose issues in an environment no-one’s allowed access to until the release weekend! Success is knowing the release is good because the deploy status is green!

But when I look at the processes being followed, I know that this pain is going to happen. As do others, who appear to expect it and accept it, with hearty comments of ‘this is real world development’ and ‘this is just how we roll here’.

So much effort and failure thrown at releasing a fraction of the functionality which could have been out there if quality was the barrier to release, not red tape.

And yet I know I’m surrounded here by some very intelligent people, who know there are better ways to work. I can’t help wondering where and why progress is being blocked.

Legislature and Off the Shelf Thinking

I’m always pleasantly surprised when I find an aspect of software delivery which I hadn’t previously considered, or seen as fully as I might have.

Today I was chatting with a colleague who it turns out has a long history in the business of superannuation (pensions, for those in the UK). I was expressing my very heart felt belief that building a business’ core domain using an off the peg system is a risky undertaking. I talked about how I could understand a company purchasing a CRM system, as customer management is a well understood space, and why spend money developing an in-house CRM system when you really only want it to do what every other CRM system does? I talked about how leveraging an off the peg system leaves the core domain at the mercy of the business which owns the system. I expressed dissatisfaction at the architecture of off the peg solutions; that they are generally monolithic and impossible to marry with today’s continuous delivery practices.

It was about at this point when she explained that the business of superannuation is predominantly legislated by the Australian government, and the main difference between funds is in choice of investment opportunity. The services made available are mandated by legislation. The way the fund is managed is by and large mandated by legislation. So much is legislated that there are off the shelf systems available which cover pretty much every aspect of managing a superannuation fund, from managing fund investments to giving members access to manage their accounts online. These systems are also kept up to date with legislative changes as and when they happen. So funds stay within the bounds of the law simply by using that software.

Ok, so the off the shelf story always sounds rosier than it is in real life, but there’s an interesting point to see here. Because the core domain of the business is not proprietary business logic, because the core domain is in fact not just well known but it’s enforced by an external third party, then an off the peg solution could perhaps model it perfectly well. Updates to the system from the vendor are driven by the same legislation as drives the business. The main differentiators between systems becomes UX and architecture.

The conversation has left me wondering whether there are any other businesses which are so highly legislated that the same logic would apply. It has also left me wondering how a business which embraces an off the shelf solution for their core domain might also find it difficult to embrace modern software delivery techniques. In fact, I wonder if they would ever reach the tipping point where it becomes necessary to raise the dev teams above simply hacking solutions together.

Scale or Fail

I’ve heard a lot of people say something like “but we don’t need huge scalability” when pushed for reason why their architecture is straight out of the 90’s. “We’re not big enough for devops” is another regular excuse. But while it’s certainly true that many enterprises don’t need to worry so much about high loads and high availability, there are some other, very real benefits to embracing early 21st century architecture principals.

Scalable architecture is simple architecture

Keep it simple, stupid! It’s harder to do than it might seem. What initially appears to be the easy solution can quickly turn into a big ball of unmanageable, tightly coupled string of dependencies where one bad line of code can affect a dozen different applications.

In order to scale easily, a system should be simple. When scaling, you could end up with dozens or even hundreds of instances, so any complexity is multiplied. Complexity is also a recipe for waste. If you scale a complex application, the chances are you’re scaling bits which simply don’t need to scale. Systems should be designed so hot functions can be scaled independently of those which are under utilised.

Simple architecture takes thought and consideration. It’s decoupled for good reason – small things are easier to keep ‘easy’ than big things. An array of small things all built with the same basic rules and standards, can be easily managed if a little effort is put in to working out an approach which works for you. Once you have a few small things all being managed in the same way, growing to lots of small things is easy, if it’s needed.

Simple architecture is also resilient, because simple things tend not to break. And even if you aren’t bothered about a few outages, it’s better to only have the outages you plan for.

Scalable architecture is decoupled

If you need to make changes in anything more than a reverse proxy in order to scale one service, then your architecture is coupled, and shows signs of in-elasticity. Other than being scalable, decoupled architecture is much easier to maintain, and keeps a much higher level of quality because it’s easier to test.

Decoupled architecture is scoped to a specific few modules which can be deployed together repeatedly as a single stack with relative ease, once automated. Outages are easy to fix, as it’s just a case of hitting the redeploy button.

Your end users will find that your decoupled architecture is much nicer to use as well. Without having to make dozens of calls to load and save data in a myriad of different applications and databases, a decoupled application would just make only one or two calls to load or save the data to a dedicated store, then raise events for other systems to handle. It’s called eventual consistency and it isn’t difficult to make work. In fact it’s almost impossible to avoid in an enterprise system, so embracing the principal wholeheartedly makes the required thought processes easier to adopt.

Scalable architecture is easier to test

If you are deploying a small, well understood, stack with very well known behaviours and endpoints, then it’s going to be no-brainer to get some decent automated tests deployed. These can be triggered from a deployment platform with every deploy. As the data store is part of the stack and you’re following micro-architecture rules, the only records in the stack come from something in the stack. So setting up test data is simply a case of calling the API’s you’re testing, which in turn tests those API’s. You don’t have to test beyond the interface, as it shouldn’t matter (functionally) how the data is stored, only that the stack functions correctly.

Scalable architecture moves quicker to market

Given small, easily managed, scalable stacks of software, adding a new feature is a doddle. Automated tests reduce the manual test overhead. Some features can get into production in a single day, even when they require changes across several systems.

Scalable architecture leads to higher quality software

Given that in a scaling situation you would want to know your new instances are going to function, you need attain a high standard of quality in what’s built. Fortunately, as it’s easier to test, quicker to deploy, and easier to understand, higher quality is something you get. Writing test first code becomes second nature, even writing integration tests up front.

Scalable architecture reduces staff turnover

It really does! If you’re building software with the same practices which have been causing headaches and failures for the last several decades, then people aren’t going to want to work for you for very long. Your best people will eventually get frustrated and go elsewhere. You could find yourself in a position where you finally realise you have to change things, but everyone with the knowledge and skills to make the change has left.

Fringe benefits

I guess what I’m trying to point out is that I haven’t ever heard a good reason for not building something which can easily scale. Building for scale helps focus solutions on good architectural practices; decoupled, simple, easily testable, micro-architectures. Are there any enterprises where these benefits are seen as undesirable? Yet, when faced with the decision of either continuing to build the same, tightly coupled, monoliths which require full weekends (or more!) just to deploy, or building something small, light weight, easily deployed, easily maintained, and ultimately scalable, there are plenty of people claiming “Only in an ideal world!” or “We aren’t that big!”.

Bonkers!