Going Deep Enough with Microservices

Moving from a monolith architecture to microservices is a widely debated process, with many recommendations and nuggets of advice available on the web in blogs like this. There are so many different opinions out there mainly because where an enterprise finds their main complexities lay depends on the skillsets of their technologists, the domain knowledge within the business and the existing code base. During the years I’ve spent as a contractor in a very wide range of enterprises, I’ve seen lots of monolith architectures – all of them causing slightly different headaches because those responsible for developing them let different aspects of the architecture slip. After all, the thing that is often forgotten is that if a monolith is maintained well, then it can work. The reverse is also true – if a microservice architecture is left to evolve on its own, it can cause as many problems as a poorly maintained monolith.

Domains

One popular way to break things down is using Domain Driven Design. Two books which cover most concepts involved in this process are ‘Building Microservices’ by Sam Newton (http://shop.oreilly.com/product/0636920033158.do) and ‘Implementing Domain Driven Design’ by Vaughn Vernon (http://www.amazon.com/Implementing-Domain-Driven-Design-Vaughn-Vernon/dp/0321834577) which largely references ‘Domain Driven Design: Tackling Complexity in the Heart of Software’ by Eric Evans (http://www.amazon.com/Domain-Driven-Design-Tackling-Complexity-Software/dp/0321125215). I recommend Vaughn’s book over Evans’ as the latter is a little dry.

If you take on board even just half the content covered in these books, you’ll be on a reasonable footing to get started. You’ll make mistakes but as Sam Newton points out (and I’ve seen for myself) that’s inevitable.

Something that seems to be left out of a lot of domain driven discussions is what happens beyond the basic CRUD processes and domain logic in the application layer. Attention sits primarily with the thin interaction between a web interface and the domain processing by the aggregate in question. When dismantelling a monolith architecture into microservices, focus on just the application layer can give the impression of fast progress but in reality half the picture is missing. It’s likely that in a few months there will be several microservices but instead of them operating solely in their sub-domains, they’ll still be tied to the database that the original monolith was using.

Context

It’s hugely important to pull the domain data out of the monolith store. This is for the very same reasons we segregate service responsibilities into sub-domains. Data pertaining to a given domain may exist in other domains as well but changes will not necessarily be subjected to the same domain rules and individual records may have different properties. There may be a User record in several sub-domains, each with a Username property but the logic around how duplicate Usernames are prevented should sit firmly in a single sub-domain. If a service in a different sub-domain needs to update the username, it should either call a public service from the Profile sub-domain or raise a ‘Username Updated’ event that the Profile sub-domain would handle, process and possibly respond with a ‘Username Update Failed’ event of its own.

This example may be a little contrived – checking for duplicates could be something that’s implemented everywhere it’s needed. But consider what would happen if it became necessary to check for duplicates within another external system every time a Username is updated. That logic could easily be encapsulated behind the call to the Profile service but having to update every service that updates Usernames wouldn’t be good practice.

So if we are now happy that the same data represented in different sub-domains could at any one time be different (given the previous two paragraphs) then we shouldn’t store the data for both sub-domains in the same table.

Local Data

In fact, we’re now pretty well removed from needing a classic relational database for storing data that’s local to the sub-domain. We’re dealing with data that is limited in scope and is intended for use solely by the microservices built to sit in that sub-domain. NoSQL databases are ideal for this scenario and no matter which platform you’ve chosen to build on there are excellent options available. One piece of advice I think is pretty sound is that if you are working in the cloud, you’ll usually get the best performance by using the data services provided by your cloud provider. Make sure you do your homework, though – some have idiosyncracies that can impact performance if you don’t know about them.

So now we have data stored locally to the sub-domain, but this isn’t where the work stops. It’s likely there’s a team of DBA’s jumping around wondering why their data warehouse isn’t getting any new data.

The problem is that the relational database backing the monolith wasn’t just acting as a data-store for the application. There were processes feeding other data-stores for things like customer reporting, machine learning platforms and BI warehouses. In fact, anything that requires a historical view of things will be reading it from one or more stores that are loaded incrementally from the monolith’s relational database. Now data is being stored in a manner best suiting any given sub-domain, there isn’t a central source for that data to be pulled from into these downstream stores.

Shift of Responsibility

Try asking a team of dba’s if they fancy writing CLR based stored procedures to detect changes and pull new records into their warehouse by querying whatever data-store technologies have been decided on in each case – I doubt they’ll be too receptive. The responsibility for getting data out of each local data-store now has to move closer to the application services.

The data guys are interested in recording historical and aggregated records, which is convenient as there is a useful well known tool for informing different systems that something has happened – an event.

It’s been argued that using events to communicate across sub-domains is miss-using an event stream as a message bus. My argument in this case is that the back-end historical data-store is still within the original sub-domain. The data being stored belongs specifically to that sub-domain and still holds the same context as when it was saved. There has been a transition to a new medium of storage but that’s all.

So we are now free to raise events from our application microservices into eventstreams which are then handled by a service specifically designed to transfer data from events into whatever downstream stores were originally being fed from the monolith database. This gives us full extraction from the monolithic architecture and breaks the sub-domain’s dependency on the monolith database.

There is also the possibility that we can now give more fine grained detail of changes than was being recorded previously.

Gaps in the Monolith Database

Of course back end data-stores aren’t the only consumers of the sub-domain’s data. Most likely there will be other application level queries that used to read the data you’re now saving outside of the monolith database. How you manage these dependencies will depend on whether the read requests are coming from the same sub-domain or another. If they’re from the same sub-domain then it’s equally correct to either pull the data from an event stream or from microservices within that sub-domain. Gradually, a sub-domain’s dependency on the monolith database will die. If the queries are coming from a different sub-domain then it’s better to continue to update the monolith database as a consumer of the data stored locally to the sub-domain. The original table no longer containing data that is relevant to the sub-domain you’re working on.

Switching

Obviously we don’t want to have any gaps in the data being sent to our back-end stores, so as we pull functionality into microservices and add new data-stores local to the sub-domain, we also need to build the pipeline for our new back end processing of domain events into the warehouse. As this gets switched on, the loading processes from the original monolith can be switched off.

External Keys

Very few enterprise systems function in isolation. Most businesses make use of off-the-shelf packages or cloud based services such as Salesforce. Mapping records into these systems usually means using the primary key of each record to create a reference. If this has happened then the primary key from the monolith is most likely being relied on to hold things together. Moving away from the monolith database means the primary key generation has probably been lost.

There are two options here and I’d suggest going with whatever is the easiest – they both have their merits and problems.

  1. Continue to generate unique id’s in the same way as the monolith database did and continue to use these id’s for reference across different systems. Don’t rely on the monolith for id generation here, create a new process in the microservice that continues the same pattern.
  2. Start generating a new version of id generation and copy the new keys out to the external systems for reference. The original keys can eventually be lost.

Deeper than Expected

When planning the transition from monolithic architecture to microservices, there may well be promises from the management team that time will be given to build each sub-domain out properly. Don’t take this at face value – Product Managers will still have their roadmaps to fulfill and unfortunately there is maybe only 30% of any given slice of functionality being pulled out of a monolith that an end user will ever see. Expect the process to be difficult no matter what promises are made.

What I really want to get across here is that extracting even a small amount of functionality into microservices carries with it a much deeper dive into the enterprise’s tech stack than just creating a couple of application services. It requires time and focus from more than just the Dev team and before it can even be started, there has to be a architectural plan spanning the full vertical slice of a sub-domain, from front end to warehoused historical data.

Consequences of Not Going Deep Enough

How difficult do you find it in your organisation to get approval for technical upgrade work, or for dealing with technical debt as a project (which I’m not advocating is a good strategy), or for doing anything which doesn’t have a directly measurable positive impact on new product? In my experience, it isn’t easy and I’m not sure it should be, but that’s for another post.

Imagine you’ve managed to extract maybe 70% of your application layer away from your monolith but you’re still tied to the same data model. Have you achieved what you set out to do? You certainly don’t have loose coupling because everything is tied at the data level. You don’t have domain isolation. You are preventing your data team from getting access to the juicy new events you don’t really need to be raising (because the changed data is already available everywhere). You’ve turned a monolith into an abomination – it isn’t really microservices and it isn’t a classic monolith, it isn’t really any desired pattern at all. Even worse, the work you are missing is pretty big and may not directly carry with it any new features. Will you get agreement to remove coupling with the database as a project itself?

How are your developers doing? How many of them see that the strategy is only going half way? How many are moaning about paying lip service to the architecture? Wasn’t that one of the reasons you started with microservices in the first place?

Can you deploy the microservices without affecting other sub-domains? What if there are schema changes? What if there are schema changes in 2 sub-domains and one needs to be rolled back after release because it wasn’t quite right? Wasn’t this something microservices was supposed to prevent?

How many dodgy hacks or ‘surprises’ are there in your new code where devs have managed to make domain isolated services work with a single relational data model? How many devs waste time hand wrangling when they know they’re building something that is going to be technical debt the moment it goes live?

Ok, so I’m painting a darker picture than you’ll probably feel, but each of these scenarios will almost certainly come up, you just might not get to hear about it.

The crux for me is thinking about the reasons for pursuing a microservice architecture. The flexibility, loose coupling, technology agnosticity (if that’s a real term), the speed of continuous delivery that you’re looking for. Unless you go deeper than the low lying fruit of the application layer, you’ll be cheating yourself out of these benefits. Sure, you’ll see improvements short term but you are building something which is already technical debt. No matter what architecture you choose, if you don’t invest in maintaining it properly (or even building it properly in the first place) then it will ultimately become your albatross.

Events vs Commands

In the world of service oriented architectures and CQRS style processes there is a tendancy for nearly everything to raise events. Going back a few years however, before REST became fashionable many interactions were by RPC and often the result of processing commands from a queue.

So when did commands become an anti-pattern? Well of course, they never did. These days we just have to understand when it’s more appropriate to send a command or raise an event.

Here’s a table to help you decide what you should be using:

Events Commands
An event is all about something that has already happened A command is all about something that the originating service wants to happen (although it might not be successful)
A service raising an event doesn’t care what happens to it. Something consuming an event is not critical to the service’s function. A service sending a command needs that command to be processed as part of it’s functionality.
An event could be consumed by one, many or no consumers. A command is intended for one specific consumer.
An event can suggest loose coupling between services. A command definitely indicates tight coupling – the originating service knows about the command target.
A service prevented from raising an event can only report that the event was not raised. A service prevented from sending a command can report the failure to a team with specific domain knowledge about what will happen down stream if the command is not processed. The service may be designed to fail its own process if the command fails.

A really good example of the right use of an event is communicating between services within a bounded context that something has happened. The originating service will have successfully completed its function before raising the event. Consumers of the event do something else in addition that the originating service doesn’t really care about.

A good example of the right use of a command is where two different platforms need to be kept in sync with each other. When data is updated in one system a sync command is sent to update the other. If something stops that command getting sent (e.g. an auth issue between the service and a message queue) then the service can react and alert people to the issue, or it may be that the update in the originating service needs to fail.

Both events and commands are important in a distributed system. Using them in the right places makes your intent much clearer and helps keep your system structured.

User Secrets in asp.NET 5

Accidentally pushing credentials to a public repo has never happened to me, but I know a few people for whom it has. AWS have an excellent workaround for this by using credential stores that can be configured via the CLI or IDE but this technique only works for IAM user accounts, it doesn’t allow you to connect to anything outside of the AWS estate.

Welcome to User Secrets in asp.NET 5 – and they’re pretty cool.

User Secrets are a part of the new asp.NET configuration mechanism. If you open Visual Studio 2015 and create a new Web API project, for example, you’ll be presented with something somewhat different to previous versions. Configuration is carried out in Startup.cs, where we can conditionally load configuration from one or many sources including .config and .json files, environment variables and the User Secret store. To access User Secrets, you want to modify the constructor like so:

public Startup(IHostingEnvironment env, IApplicationEnvironment appEnv)
{
    var builder = new ConfigurationBuilder(appEnv.ApplicationBasePath)
        .AddJsonFile("config.json")
        .AddUserSecrets()
        .AddEnvironmentVariables();

    Configuration = builder.Build();
}

In this example, the order of calls to AddJsonFile(), AddUserSecrets() and AddEnvironmentVariables() makes a difference. If the property ‘Username’ is defined in config.json and also as a secret then the value in config.json will be ignored in favour of the secret. Similarly, if there is a ‘Username’ environment variable set, that would win over the other two. The order loaded dictates which wins.

To create a secret, first open a Developer Command Prompt for VS2015. This is all managed via the command line tool ‘user-secret’. To check if you have everything installed, at the prompt, type ‘user-secret -h’.

C:Program Files (x86)Microsoft Visual Studio 14.0>user-secret -h

If user-secret isn’t recognised then you may need to install the SecretManager command in the .NET Development Utilities (DNU). Do this by typing ‘dnu command install SecretManager’.

C:Program Files (x86)Microsoft Visual Studio 14.0>dnu command install SecretManager

In my case, this was again not recognised, even though I had just completed a full install of every component of Visual Studio 2015 Professional. If this is still not working for you, then you need to update the .NET Version Manager (DNVM). Do this by typing ‘dnvm upgrade’.

C:Program Files (x86)Microsoft Visual Studio 14.0>dnvm upgrade

Hopefully, you should get a similar response to this:

C:Program Files (x86)Microsoft Visual Studio 14.0>dnvm upgrade
Determining latest version
Downloading dnx-clr-win-x86.1.0.0-beta6 from https://www.nuget.org/api/v2
Installing to C:UsersPeter.dnxruntimesdnx-clr-win-x86.1.0.0-beta6
Adding C:UsersPeter.dnxruntimesdnx-clr-win-x86.1.0.0-beta6bin to process PATH
Adding C:UsersPeter.dnxruntimesdnx-clr-win-x86.1.0.0-beta6bin to user PATH
Native image generation (ngen) is skipped. Include -Ngen switch to turn on native image generation to improve application startup time.
Setting alias 'default' to 'dnx-clr-win-x86.1.0.0-beta6'

Now try installing the command. You should see all of your registered NuGet sources being queried for updates and then a whole host of System.* packages being installed. The very end of the response should look something like this:

Installed:
    10 package(s) to C:UsersPeter.dnxbinpackages
    56 package(s) to C:UsersPeter.dnxbinpackages
The following commands were installed: user-secret

Now when you run ‘user-secret -h’ you should get this:

Usage: user-secret [options] [command]

Options:
  -?|-h|--help  Show help information
  -v|--verbose  Verbose output

Commands:
  set     Sets the user secret to the specified value
  help    Show help information
  remove  Removes the specified user secret
  list    Lists all the application secrets
  clear   Deletes all the application secrets

Use "user-secret help [command]" for more information about a command.

You can see five possible commands listed, and getting help on any particular one is also explained. As an example, if you want to set a property ‘Username’ to ‘Guest’ then type this:

C:Program Files (x86)Microsoft Visual Studio 14.0>cd MyProjectFolder
C:MyProjectFolder>user-secret set Username Guest

Where MyProjectFolder is the location of a project.json file.

So there you have it. You’re ready to create secrets that can never be accidentally pushed into a public repo or shared anywhere they shouldn’t be. Just remember that emailing them to the dev sitting next to you might not be much better.

Useful links:

https://github.com/aspnet/Home/wiki/DNX-Secret-Configuration

http://stackoverflow.com/questions/30106225/where-to-find-dnu-command-in-windows

http://typecastexception.com/post/2015/05/17/DNVM-DNX-and-DNU-Understanding-the-ASPNET-5-Runtime-Options.aspx

When Things Just Work

A particularly tricky epic hits the development team’s radar. The Product Manager has been mulling it over for a while, has a few use cases from end users and has scoped things pretty well. He’s had a session with the Product Owner who has since fleshed things out into an initial set of high priority stories from a functional point of view. The Product Owner spends an hour with the Technical Lead and another Developer going over the epic. It quickly becomes apparent that they’re going to have to build some new infrastructure and a new deployment pipeline so they grab one of the Architects to make sure their plans are in line with the technical roadmap. Some technical stories are generated and some new possibilities crop up. The Product Owner takes these to the Product Manager. Between them they agree what is expected functionally and re-prioritise now they have a clearer picture of what’s being built. In the next grooming session the wider team hit the epic hard. There’s a lot of discussion and the Architect gets called back in when there are some disagreements. The QA has already spent some time considering how testing will work and she’s added acceptance criteria to all the stories which are reviewed by the devs during the meeting. Each story is scored and only when everyone is happy that there is a test approach, a deployment plan and no unresolved dependencies, does a story get queued for planning. It takes three sprints to get the epic completed, during which the team focus primarily on that epic. They pair program and work closely with the QA and Product Owner who often seem like walking, talking specifications. There are a few surprises along the way and some work takes longer than expected but everyone’s ok with that. Some work gets completed quicker than expected. The Product Manager has been promoting the new feature so when it finally goes fully live the team get some immediate feedback on what users like and dislike. This triggers a few new stories to make some changes that go out in the next release.

Well, that all sounded really simple. Everyone involved did their bit, no-one expected anything unreasonable, the only real pressure came from the team’s own pride in their work and everyone went home on an evening happy. So why is it that so often this isn’t the case? Why do some teams or even entire companies expend huge amounts of effort but only succeed in increasing everyone’s stress levels, depriving people of a decent night’s sleep and releasing product with more bugs than MI5?

What Worked?

If you’re reading this then hopefully team management is something you’re interested in and you’re probably well aware that there is never just one thing that makes it work. In fact, Roy Osherove described three distinctly different states that a team can find itself in which require very different management styles and expose very different team dynamics.

If you haven’t already done so, take a look at Roy’s blog here: http://5whys.com/ and download his book here: https://leanpub.com/teamleader

Often team members are affected by influences external to the team – for example, the Product Manager is almost certain to be dealing with multiple teams and end users. Their day could have started pretty badly and that will always colour interactions with others – we’re only human. So at any one given moment of interaction, the team could be in one of many states that could either be conducive to a good outcome or that could be leading to a problem.

Let’s try pulling apart this idealistic scenario and see how many opportunities there were for derailment.

Early Attention

In our scenario, the epic isn’t straight forward so the Product Manager has been thinking about what it means to his end users and where it fits into his overall plan. A lesser Product Manager might not have given it any attention yet and instead of having some stories outlined there might not be much more than a one line description of the functionality. Lack of care when defining work speaks to the rest of the team about how little the epic matters. If the Product Manager doesn’t care enough about the work to put effort into defining it properly, then others will often care just as little.

Early Architectural Input

Right at the beginning the Architect is questioned about how they see the work fitting in with the rest of the enterprise. Without talking tech early the team could waste time pursuing an approach which is at best not preferred and possibly just won’t work.

Product Owner and Technical Lead

The Product Owner and the Technical Lead take on the initial task of trying to get to grips with the story together. These two are a perfect balance of product and development. Moving up the seniority tree, the Product Manager and the Architect can often disagree vehemently, but the less senior pair need the relationship of mutual trust they’ve built. Nowhere is there a better meeting of the two disciplines. Lose either of these roles and the team will suffer.

Changing Things

After looking at things from a technical point of view, the Product Owner goes back to the Product Manager to discuss some changes. If the Product Owner isn’t open to this then many opportunities for quick wins will be missed and it becomes much more likely that the team will be opening at least one can of worms that they’d prefer not to.

Grooming Wide

Although strictly speaking all the work that goes into defining a story is ‘grooming’ and doesn’t have to include the whole team, there should be a point (probably in a grooming session) where the majority of the team gets the chance to review things and give their own opinions. If this doesn’t happen then much of the team’s expertise is being wasted. Also, some team members will be specialists in certain areas and may see things that haven’t yet been considered. Lastly (and maybe most importantly) if the team are simply spoon fed a spec and expected to build it, they aren’t being engaged in the work – they are far more likely to care less about what is being built.

Ready for Planning

The team make sure that the stories have test approaches, reams of acceptance criteria, no unresolved dependencies and that everyone believes the stories are clear enough to work on them. This is a prerequisite to allowing the work to be planned into a sprint. Without this gate to progression, work can be started that just can’t be finished. Working on something that can’t yet be delivered is a waste of time.

Questions with Answers

During the sprints, the Product Owner and QA are on hand all the time for direct face to face discussions about the work. “Does this look right?” “Can we reduce the number of columns on the screen?” “Is this loading quick enough?” – developers are at the sharp end and need to have answers right there and then so they can build. Without Product and QA representation available, a simple question like these can stall a story for an hour, a day or maybe too long to complete in the sprint.

And More

This is one small, quite contrived scenario. In real life things are rarely as straightforward but with every interaction and every piece of work within the team and beyond there is the risk for some kind of derailment. To list every scenario would take forever, so how can problems be avoided?

Knowing Their Jobs

Each individual in this team knew where they fitted into the puzzle. Everyone worked together as a team. In fact things need to go a little further than that for the best outcome – team members should know what each other’s jobs entail. Take the Technical Lead role; they should be protecting their team from ill conceived ideas (even when there seems to have been a lot of effort gone into those ideas). It’s part of their job to question whether proceeding with a given piece of work is the best thing to do. When they do raise concerns, these should be listened to and discussed with a level of maturity and mutual respect that befits someone who fulfilling one of the requirements of their job. Equally, when a QA suggests that a feature seems flawed even though it passes acceptance criteria, their issue should be addressed, even if nothing is ultimately changed. This is part of the QA’s job – raising issues of quality.

I’d like to outline what I see as an excellent balance of team members and responsibilities. I don’t mean to insinuate that this is the only way a team can work – in fact I encourage every team to find their own way. Regular retrospectives where the team look at how they’re performing, what works and what doesn’t, allow the team to form their own processes. This is nearly always preferable to mandating how things should be done as people are more likely to stick to their own ideas.

That having been said, I believe the lines are there to blur and if I were to define roles around those lines, it would be like this:

Product Manager

The Product Manager is responsible for the product road map. They are focussed on what the end users are missing in the product as it stands and they are working to define feature sets that will better meet the end user’s requirements and better any competition. It’s a trade off between effort, risk, how badly the feature is wanted, when it is technically feasible to implement it.

It is vitally important that their roadmap is fully understood initially by the Product Owner and then by the senior members of the team. Decisions are constantly being made during delivery which benefit from a knowledge of product focus.

The Product Manager is not responsible for writing stories or working with the development team beyond receiving demo’s of completed or near completed work. They’re not responsible for the velocity of the team or for managing the team in any way, although they do have a vested interest in when things are going to get delivered and will almost certainly require an explanation if things can’t or don’t happen as quickly as they’d like.

Product Owner

The Product Owner can best be thought of as an agile Business Analyst. They are responsible for communicating the vision of the Product Manager to the team. They sit within the team and are answerable to the Team Lead rather than the Product Manager. This might seem odd but the team needs them to put their need to understand the product ahead of anything else. Without good understanding of what is being built, the team cannot hope to build the right thing.

The Product Owner wants their stories to be understood, so they will want the team to be involved in deciding what information is recorded in a story and how it is formatted. Story design will usually evolve over a few sprints from discussions during retrospectives. It is the Product Owner’s sole responsibility to make sure that all functional details are described once and only once and that none are left out. They will often work closely with the team’s QA to make sure all functional aspects have acceptance criteria before the story is presented in a grooming session. At a minimum, before a story is allowed to be planned into a sprint, all functional details must have an acceptance criteria to go with them. It is quite acceptable for these acceptance criteria to be considered the specification, rather than there being two different descriptions of the same thing. This reduces duplication, which is desirable because duplication leaves a story open to misunderstanding.

QA

The QA should be focussing on automation as much as possible. The last thing a QA wants to do is spend their time testing – it isn’t a good use of their time. Yes, there are a few things that are still best done by opening the app and having a look and I’m not saying that automation should be solely relied on, but it should be the default approach. Recognising that something can’t be automated should be the exception rather than noticing that something could. Something automated never needs to be tested again until that thing changes. Something manual adds to the existing manual set of tests. Only so much of the latter can go on before the QA has no time to really think about the quality of the product.

The QA should be happy that they understand how they are going to ensure the quality of every single story picked up by the team. They should have recorded acceptance criteria which act as a gateway to ‘done’ for the developers and a reminder for the QA about different aspects of the story.

Sometimes there will be several different levels of criticality to functionality. For example, imagine a piece of work on a back office system which includes modifications to an integration to a customer facing system. It’s not the end of the world if the back office system needs a further tweak to make things absolutely perfect, but if the integration breaks, then end users may be affected and data may be lost. The integration should definitely be tested with automation – for every field and every defined piece of logic involved. The internal system’s ui, could probably be tested manually, depending on the circumstances.

The QA should make sure they spend time with the Product Owner looking at new stories and working quite abstractly to define acceptance criteria around the functionality. During grooming, technical considerations will kick up other acceptance criteria and a decision around what needs to be automated and what doesn’t can be made by the team. A decision which the QA should make sure is made actively and not just left to a default recipe.

Architect

The Architect is not really a part of a specific team. They will have their own relationship with the Product Manager and will be familiar with the product road map. They will have three main considerations on their mind:

  1. They will be looking at the direction that technology in general is going and will be fighting the rot that long lived systems suffer from. This isn’t because code breaks, quite the opposite – the problem is that working code could stay in place for years after anyone who had anything to do with writing it has left the company, even if that code is not particularly good. Replacing working systems is a difficult thing for businesses to see a need for but if the technology stack that is relied on isn’t continuously refreshed then it becomes ‘technical debt’. Something that no-one wants to touch and that is expensive to fix.
  2. They will be making sure the the development teams have enough information about the desired approach to system building. If following a DDD approach, the teams need to know what the chosen strategy is for getting data from one subdomain to another. They want to know what format they should be raising their domain events in, how they should be versioned. Given new ideas and technologies, they need to know when it’s ok to start using them and more importantly, when it’s too premature to start using them.
  3. In conjunction with the Product Roadmap, they will be defining the Technical Roadmap. This document takes into consideration what Product are wanting to do and what the technical teams are wanting to do. It’s almost a given that the Technical Roadmap will have to feedback into the Product Roadmap as it shows what needs to be done to deliver. For this reason, it’s generally a good idea not to consider a Product Roadmap complete until it has been adjusted to accommodate the technical plan.

In the scenario at the start of this post, the Architect was consulted because additional infrastructure was going to be needed. This is something that could happen several times a week and something that the Architect should be prepared for. They need to give definite answers for what should be done right now, not what that decision would be if the question was asked in a month’s time – this leads to premature adoption of new concepts and technologies that aren’t always fully understood or fully agreed on.

Developers

The Developers are at the sharp end. They should have two main focusses, and make no mistake, both are as important as each other:

  1. They need to build the Product.
  2. They need to work out how they can become better developers.

Notice how I haven’t mentioned anything about requirements gathering. This is something that has classically been considered what most developers overlook, but in a development team this is done by the Product Owner, Product Manager and to a lesser extent the QA. It’s this support net of walking talking specifications combined with well defined stories that allow a developer to focus on doing what they do best, writing code.

Something that can be ignored by less experienced developers is that the Product is not just the code that makes it up; it has to be delivered. If building a web application then this will ultimately have to deployed, most likely into at least one test environment and into production. Maybe this process is automated or is handed on to a ‘Dev Ops’ team, but it’s still down to the developers to build something that can actually be deployed. For example, if a piece of code that would normally talk to a database is changed so that it now only works with a new version of the database, how do we deploy? If we deploy the database first, then the old code won’t work when it calls it. If we deploy the code first then the call to the old database won’t work. There could be dozens of instances of the code, which could take hours to deploy to one by one – it isn’t usually possible to deploy data changes at exactly the same time as deploying code. Basically, the Developers have to keep bringing their heads out of the code to think about how their changes impact other systems.

Technical Lead

Technical Lead is a role that is very close to my heart as it’s a role I tend to find myself in. I believe that if a development team delivers a product, then a Technical Lead delivers the team. They are technically excellent but are split focussed – they are as much about growing the team as they are about making sure the team are delivering what is expected.

There is a lot of noise about a good Technical Lead being a ‘servant leader’ and for the most I think this is true. Still, there are times when it’s necessary to be direct, make a decision and tell people to do it. When to use each tactic is something only experience teaches but get it wrong and the team will lose focus and cohesion.

It’s quite common to find that a Technical Lead has one or more developers in their team that are better coders than themselves. It’s important for someone in the lead role to realise that this is not a slur on their abilities; the better coder is a resource to be used and a person to grow. They aren’t competition. This often happens when developers don’t want to split their focus away from the technology and onto the people. They become incredible developers and eventually often move into a more architectural role. Because of their seniority, they can also stand in for the Technical Lead when it’s needed. Which can allow a lead to occasionally focus on building something specific (it’s what we all enjoy doing, after all).

Ultimately it’s down to the Technical Lead to recognise what is missing or lacking and make sure that it is made available. They see what the team needs and makes sure the team gets it (even if sometimes that’s a kick up the arse).

Blurring the Lines

Not every team has enough people to fill all these roles. Not every company is mature enough to have the staff for each of these roles. So this is where the lines begin to blur.

It isn’t always critical to have each of these roles fulfilled as a full time resource. Each company and team’s specific situation needs to be considered on it’s own merits. What is critical is understanding why these roles exist. This knowledge allows the same individual to have different hats at different times. But beware of giving too much power to one person or to removing the lines of discussion which is what generates great ideas.

So What Actually Makes it Work?

In a highly technical environment I find my own opinion on this very surprising. I don’t think it’s any individual’s technical ability or how many different programming languages they can bring to bear. I don’t think it’s having every detail of every epic prepared completely before development work starts. Primarily, consideration must be given to the structure and dynamics of a team. the above is an excellent starting point although as I’ve already mentioned, it may be that multiple roles are taken by any one individual. Other than this, the I believe a successful team will have these qualities:

Patience, mutual respect and a healthy dose of pragmatism.

If there is one constant in every team I’ve ever worked with or in, it’s that some things will go wrong. Mistakes will happen – no-one is immune (and if they claim otherwise don’t believe them). Making progress forwards doesn’t mean always building the most incredibly perfect solution, it means getting product out the door which end users will be happy with. So if there are mistakes, have the patience to realise that it was inevitable. If you make the mistake, have the respect for the rest of your team to realise that they can help fix things – believe that they aren’t judging you by your mistake. Finally, remember that the perfect solution is not the perfect system – there’s more to delivering software than trying to find all the bugs.

A team with these values will ultimately always outperform a collection of technically excellent individuals.

Code Libraries and Dependencies

Nuget has made it really straight forward to share libraries across multiple applications. It’s really straight forward. Just add a nuspec file and run ‘nuget pack’. But before you do that next time, spare a thought for the poor dev who’s trying to fit your library in their project among a dozen others when any of them may make use of the same 3rd parties you’ve referenced as dependencies.

Breaking Changes

Breaking changes happen. Sometimes intentionally, sometimes not. A breaking change in a popular 3rd party library can be pretty tricky to deal with.

Not so long ago RestSharp made a breaking change. They changed the way a value was set from being a property setter to a function call. The two things are not the same and no amount of assembly redirection will make one version work in place of the other.

My client at the time had a large application with probably about two dozen references to the old version of RestSharp. When people started to use the new version (often without even realising that it was a new version) in their libraries, it took a while before someone hit on the problem of referencing one of these libraries from code that uses the older version.

Chain of Incompatibility

So some app uses two libraries, which both use different and incompatible versions of RestSharp. Ok, so let’s just upgrade the older version to the newer version in the library and everyone’s happy.

Then we find out that it isn’t that library which referenced RestSharp, it was a further library which was referenced. So we open that library and upgrade that. Build a nuget package, re-reference that in our original library and do the same to get that into our app. Great.

Then, a couple of days later, someone is now having trouble after extending the library we just changed with some new functionality. Because it now has an updated version of RestSharp, it won’t work when it’s updated in another app because again the app references the old version.

And so the dance continues…

Just Avoid It

The best way to deal with this? Just avoid it.

A library should only provide the logic that it relates to. Trying to make a library responsible for everything is a mistake.

3rd party dependencies are also a bad thing. They can change and different versions are not always backwards compatible.

A library is not well encapsulated if it depends on 3rd parties, it means those 3rd parties have to be versioned carefully and it’s possible for all developers in your enterprise to decide just to stick on the old version – legacy code in real time, nice.

To avoid it, simple create an interface in your library that defines the contract you need. Then leave it up to the people using your library as to how it should work. You can always provide some sample code in a readme.md or even an additional nuget with a preferred implementation. Give the consumer some options.

An example interface for RestSharp functionality could be as simple as:

public interface IRestClient
{
    void SetBaseUrl(string url);
    dynamic Send(string path, string verb, string payload);
}

Although it could be much better.

A Helpful Circuit Breaker in C#

Introduction

With the increasing popularity of SOA in the guise of ‘microservices’, circuit breakers are now a must have weapon in any developer’s arsenal. Services are rarely 100% reliable; outages happen, network connections get pulled, memory gets filled, routing tables get corrupted. In an environment where multiple services are each calling multiple other services, the result of an outage in a small, seemingly unimportant service can be a random slow down in response times in your web application that gradually leads to complete server lock up. (If you don’t believe me, read Release It by Micheal Nygard from the Pragmatic bookshelf).

The idea of a circuit breaker is to detect that a service is down and fail immediately for subsequent calls in an expected manner that your application can handle gracefully. Then, every so often, the breaker will attempt to close and allow a call to be sent to the troubled service. If that call is successful then the breaker starts allowing calls through, if that call fails then the breaker remains in an open state and continues to fail with an expected exception.

Helpful.CircuitBreaker is a simple implementation that allows a developer to be proactive about the way their code handles failures.

Usage

There are 2 primary ways that the circuit breaker can be used:

  1. Exceptions thrown from the code you wish to break on can trigger the breaker to open.
  2. A returned value from the code you wish to break on can trigger the breaker to open.

Here are some basic examples of each scenario.

In the following example, exceptions thrown from _client.Send(request) will cause the circuit breaker to react based on the injected configuration.

public class MakeProtectedCall
{
    private ICircuitBreaker _breaker;
    private ISomeServiceClient _client;

    public MakeProtectedCall(ICircuitBreaker breaker, ISomeServiceClient client)
    {
        _breaker = breaker;
        _client = client;
    }

    public Response ExecuteCall(Request request)
    {
        Response response = null;
        _breaker.Execute(() => response = _client.Send(request));
        return response;
    }
}

In the following example, exceptions thrown by _client.Send(request) will still trigger the exception handling logic of the breaker, but the lamda applies additional logic to examine the response and trigger the breaker without ever receiving an exception. This is particularly useful when using an HTTP based client that may return failures as error codes and strings instead of thrown exceptions.

public class MakeProtectedCall
{
    private ICircuitBreaker _breaker;
    private ISomeServiceClient _client;

    public MakeProtectedCall(ICircuitBreaker breaker, ISomeServiceClient client)
    {
        _breaker = breaker;
        _client = client;
    }

    public Response ExecuteCall(Request request)
    {
        Response response = null;
        _breaker.Execute(() => {
        response = _client.Send(request));
        return response.Status == "OK" ? ActionResponse.Good : ActionResult.Failure;
    }
}

Initialising

The scope of a circuit breaker must be considered first. When the breaker opens, subsequent calls will not succeed, but if your breaker is in the scope of an HTTP request then there may not be a subsequent request hitting that open breaker. The next request would hit a newly built, closed breaker.

The following code will initialise a basic circuit breaker which once open will not try to close until 1 minute has passed (60 seconds is the default timeout, so there’s no need to specify it).

CircuitBreakerConfig config = new CircuitBreakerConfig
{
    BreakerId = "Some unique and constant identifier that indicates the running instance and executing process"
};
CircuitBreaker circuitBreaker = new CircuitBreaker(config);

To inject a circuit breaker into class TargetClass using Ninject, try code similar to this:

Bind().ToMethod(c => new CircuitBreaker(new CircuitBreakerConfig
{
    BreakerId = string.Format("{0}-{1}-{2}", "Your breaker name", "TargetClass", Environment.MachineName)
})).WhenInjectedInto(typeof(TargetClass)).InSingletonScope();

The above code will reuse the same breaker for all instances of the given class, so the breaker continues to report state continuously across different threads. When opened by one use, all instances of TargetClass will have an open breaker.

Tracking Circuit Breaker State

The suggested method for tracking the state of the circuit breaker is to handle the breaker events. These are defined on the CircuitBreaker class as:

///
/// Raised when the circuit breaker enters the closed state
///
public event EventHandler ClosedCircuitBreaker;

///
/// Raised when the circuit breaker enters the opened state
///
public event EventHandler OpenedCircuitBreaker;

///
/// Raised when trying to close the circuit breaker
///
public event EventHandler TryingToCloseCircuitBreaker;

///
/// Raised when the breaker tries to open but remains closed due to tolerance
///
public event EventHandler ToleratedOpenCircuitBreaker;

///
/// Raised when the circuit breaker is disposed
///
public event EventHandler UnregisterCircuitBreaker;

///
/// Raised when a circuit breaker is first used
///
public event EventHandler RegisterCircuitBreaker;

Attach handlers to these events to send information about the event to a logging or monitoring system. In this way, sending state to Zabbix or logging to log4net is trivial.

CONFIGURATION OPTIONS
Make sure each circuit breaker has it’s own configuration injected using the CircuitBreakerConfig class.

using System;
using System.Collections.Generic;
using Helpful.CircuitBreaker.Events;

namespace Helpful.CircuitBreaker.Config
{
    /// <summary>
    ///
    /// </summary>
    [Serializable]
    public class CircuitBreakerConfig : ICircuitBreakerDefinition
    {
        /// <summary>
        /// Initializes a new instance of the <see cref="CircuitBreakerConfig"/> class.
        /// </summary>
        public CircuitBreakerConfig()
        {
            ExpectedExceptionList = new List<Type>();
            ExpectedExceptionListType = ExceptionListType.None;
            PermittedExceptionPassThrough = PermittedExceptionBehaviour.PassThrough;
            BreakerOpenPeriods = new[] { TimeSpan.FromSeconds(60) };
        }

        /// <summary>
        /// The number of times an exception can occur before the circuit breaker is opened
        /// </summary>
        /// <value>
        /// The open event tolerance.
        /// </value>
        public short OpenEventTolerance { get; set; }

        /// <summary>
        /// Gets or sets the list of periods the breaker should be kept open.
        /// The last value will be what is repeated until the breaker is successfully closed.
        /// If not set, a default of 60 seconds will be used for all breaker open periods.
        /// </summary>
        /// <value>
        /// The array of timespans representing the breaker open periods.
        /// </value>
        public TimeSpan[] BreakerOpenPeriods { get; set; }

        /// <summary>
        /// Gets or sets the expected type of the exception list. <see cref="ExceptionListType"/>
        /// </summary>
        /// <value>
        /// The expected type of the exception list.
        /// </value>
        public ExceptionListType ExpectedExceptionListType { get; set; }

        /// <summary>
        /// Gets or sets the expected exception list.
        /// </summary>
        /// <value>
        /// The expected exception list.
        /// </value>
        public List<Type> ExpectedExceptionList { get; set; }

        /// <summary>
        /// Gets or sets the timeout.
        /// </summary>
        /// <value>
        /// The timeout.
        /// </value>
        public TimeSpan Timeout { get; set; }

        /// <summary>
        /// Gets or sets a value indicating whether [use timeout].
        /// </summary>
        /// <value>
        ///   <c>true</c> if [use timeout]; otherwise, <c>false</c>.
        /// </value>
        public bool UseTimeout { get; set; }

        /// <summary>
        /// Gets or sets the breaker identifier.
        /// </summary>
        /// <value>
        /// The breaker identifier.
        /// </value>
        public string BreakerId { get; set; }

        /// <summary>
        /// Sets the behaviour for passing through exceptions that won't open the breaker
        /// </summary>
        public PermittedExceptionBehaviour PermittedExceptionPassThrough { get; set; }
    }
}

Conclusion

This library has helped me build resilient microservices that have remained stable when half the internet has been falling over. I hope it can help you as well.

Building a Resilient Bidirectional Integration with Salesforce

blockquote {font-size: 12px;}

18 months ago I started building an integration between my client’s existing systems and Salesforce. Up until that point I had no exposure to Salesforce so my client also brought in a consultancy for whom it was a speciality. Between us we came up with a strategy where we would expose a collection of REST services for code within Salesforce to interface with while calls in the opposite direction would use the standard Salesforce REST API. In a room where 50% of us had never worked with Salesforce before, this seemed like a reasonable approach but it turns out we were all being a bit naive.

Some of the Pitfalls

Outbound Messaging

Salesforce has a predetermined method of outgoing sync calls which is pretty inflexible. On every save of any given entity, a SOAP message can be sent to a specified http endpoint with a representation of the changed entity. We did originally try using this but hit on a few problems pretty quickly. One big problem was that after we managed to get it working, we came in the next morning to find it broken. After a lot of debugging we found that the message had changed format very slightly, which our Salesforce consultants explained could happen at any time as Salesforce release updates. As my client had a release cycle of once very two weeks, we all agreed the risk of the integration breaking for that length of time was unacceptable, so we decided that on each save, Salesforce would just send us an entity type and id, then we would use the API to retrieve the new data.

Race Conditions

This pattern worked well until we hit production servers where we suddenly found that at certain times of day, the request to the Salesforce API would result in a dirty read. Right away the problem looked like a race condition and when we looked further into how Salesforce saves records, we realised how it could happen. Here’s a list of steps that Salesforce takes to save a record (taken from the Salesforce online documentation):

1. Loads the original record from the database or initializes the record for an upsert statement.

2. Loads the new record field values from the request and overwrites the old values.

If the request came from a standard UI edit page, Salesforce runs system validation to check the record for:

Compliance with layout-specific rules

Required values at the layout level and field-definition level

Valid field formats

Maximum field length

Salesforce doesn’t perform system validation in this step when the request comes from other sources, such as an Apexapplication or a SOAP API call.

Salesforce runs user-defined validation rules if multiline items were created, such as quote line items and opportunity line items.

3. Executes all before triggers.

4. Runs most system validation steps again, such as verifying that all required fields have a non-null value, and runs any user-defined validation rules. The only system validation that Salesforce doesn’t run a second time (when the request comes from a standard UI edit page) is the enforcement of layout-specific rules.

5. Executes duplicate rules. If the duplicate rule identifies the record as a duplicate and uses the block action, the record is not saved and no further steps, such as after triggers and workflow rules, are taken.

6. Saves the record to the database, but doesn’t commit yet.

7. Executes all after triggers.

8. Executes assignment rules.

9. Executes auto-response rules.

10. Executes workflow rules.

11. If there are workflow field updates, updates the record again.

12. If workflow field updates introduced new duplicate field values, executes duplicate rules again.

13. If the record was updated with workflow field updates, fires before update triggers and after update triggers one more time (and only one more time), in addition to standard validations. Custom validation rules are not run again.

14. Executes processes.

If there are workflow flow triggers, executes the flows.

Flow trigger workflow actions, formerly available in a pilot program, have been superseded by the Process Builder. Organizations that are using flow trigger workflow actions may continue to create and edit them, but flow trigger workflow actions aren’t available for new organizations. For information on enabling the Process Builder in your organization, contact Salesforce.

15. Executes escalation rules.

16. Executes entitlement rules.

17. If the record contains a roll-up summary field or is part of a cross-object workflow, performs calculations and updates the roll-up summary field in the parent record. Parent record goes through save procedure.

18. If the parent record is updated, and a grandparent record contains a roll-up summary field or is part of a cross-object workflow, performs calculations and updates the roll-up summary field in the grandparent record. Grandparent record goes through save procedure.

19. Executes Criteria Based Sharing evaluation.

20. Commits all DML operations to the database.

21. Executes post-commit logic, such as sending email.

Our entity id was being sent from an ‘after trigger’ which was getting run at step 7, data isn’t committed to the database until step 20. Discovering this led us to the path of sending the entire record in the trigger, getting round the need to wait for a committed save. Even this isn’t ideal though, as a save could be rolled back after the trigger is executed, leaving our systems out of sync. The general consensus was that this is a reasonably small risk with limited impact to the business.

Unexpected Changes from Superusers

For the business, one of the big selling points of Salesforce is that it empowers users, allowing them to create workflows, install plugins, add validations, change fields, and so on. To the business this sounds fantastic – none of all the waiting around for technical teams to come up with a solution. The drawback is that every time a change goes in that the technical team aren’t aware of, it has the potential to break everything. It took a few attempts before we managed to reign everyone into cooperating with the technical team and getting them to try their changes in our development and QA orgs before deploying to production. Until then, things would just suddenly stop working. Exceptions would start getting thrown and data would fail to synchronise.

Quick to Diagnose Problems

I think one of the nastiest restrictions we had was being tied to the two-week release cycle. A release cycle that would often break when some piece of code written by one of the other two dozen developers in the company would do something unexpected and require us to roll back the release. The next release may be delayed to 3 or 4 weeks as a result. When the integration develops a problem in production that isn’t seen anywhere else, we have to get some tracing in place, or tweak the logging levels of existing tracing to get enough detail. This is something you want to do that day, not 3 weeks down the line. In an environment where breaking changes can come from the platform itself, it’s really important to be able to get in and see what’s going on right away.

The Key Requirements of the Correct Solution

Ok, so we can probably agree that we didn’t get our solution right. The idea was conceived without really understanding how Salesforce worked and this bit us over and over again as we reacted to architectural problems with pretty large changes in direction. If I could go back and sit in on that first meeting where we conceived our monster, I would interject with the following requirements:

  1. The solution must not be tied to the two weekly deployment cycle of the main project.
  2. It should be easy and quick to change.
  3. All data passed in both directions should be logged for debugging purposes and to allow replay in the case of major outage.
  4. The solution shouldn’t use Salesforce triggers.
  5. The solution should include a space for integration specific business logic that is aware of both Salesforce and the main system (removing all leakage of concepts in either direction).
  6. It should provide its own health analysis to allow monitoring.
  7. Health issues and major errors should trigger notifications
  8. It should be scalable independently of either Salesforce or the existing systems.

The Solution

Overview

My revised solution is to build a piece of middleware architected as microservices working with Amazon’s Simple Queuing Service (SQS) and a Relational Database Service (RDS) instance. Figure 1 is a conceptual diagram giving an overall view of what I mean. I’ve left out logging and notifications for brevity.

FIGURE 1

Figure 1

The Flow

The flow of data is pretty much symmetrical in processing order, so starting from either end with a payload of data to be synchronised:

  1. The payload is dropped into an SQS queue in AWS.
  2. A queue processor picks up the message within a few seconds.
  3. The full payload is logged to the Sync DB’s history (which may have an automatic expiration configured)
  4. The processor checks in the Sync DB for an existing mapping for the entity represented by the payload.
  5. If a mapping is found, then an update payload is sent to the target system.
  6. If a mapping is not found, then a create payload is sent to the target system.
  7. Whether updating or creating, the payload is also recorded in the Sync DB’s history.
  8. A response is received back from the target system, the result of which is recorded into the Sync DB’s history along with updates to the mapping record.

Scalability

Scaling of SQS can be achieved by horizontal scaling and batching. Both strategies can be used in conjunction. Batching may be difficult to achieve from the Salesforce side as I would recommend sticking to their standard outbound messaging system which means a further service may be needed to transpose these payloads into the queue. Horizontal scaling should be completely transparent to all systems allowing throughput of several thousands of messages per second, if taken to its limit.

The queue processors would be deployed to EC2 instances and each would have its own auto-scaling group. An auto scaling policy would be needed for each to scale based on CloudWatch alarms triggered by queue size. Even though the number of consumers for each queue would increase, Amazon hide messages that are ‘mid processing’ so other consumers don’t pick up a message that’s already being handled (although in our scenario, if that did happen, it wouldn’t be likely to cause any problems).

The Sync DB would require some tuning and only running this architecture would really give an idea of what size of instance to use (or indeed whether multiple instances were required). The choice of RDS over dynamoDB is specifically for scalability reasons – dynamoDb is fantastic for light weight requirements but it doesn’t handle bursts of traffic well at all and needs to be carefully configured to avoid read or write failures when under stress.

Resilience

In this scenario, resilience is an interesting topic as if during an outage, we store up payloads and re-run them, we may well be overwriting data that has been added during the outage at the destination. It may be that the data is so sensitive and critical that every write process would have to check the last updated timestamp of the target record to see whether to allow the write. Subsequent collision handling logic would add complexity to the system, though and in my client’s case was voted not worth worrying about.

This architecture is of course a distributed design, so some protection has to be put in place to prevent failures cascading through to other parts of the system. All calls across application boundaries should be made via circuit breakers. This is a fantastic pattern that prevents callers from flooding a service with more requests when it’s obviously already having problems. It also forces the developer to consider what action to take when their call fails with a CircuitBreakerOpenException. When these exceptions occur, events can be logged, monitoring systems (such as Zabbix) can be called, processing can be temporarily suspended, messages written to a dead letter queue, or any combination of the above and more – the precise strategy for different calls depends on the balance between need for resilience and expense of delivery. An excellent implementation of a circuit breaker is Helpful.CircuitBreaker which is very light weight and easy to use. It’s also available in Nuget.

From experience with Salesforce, the one thing that is guaranteed is a breaking change coming from a source you have no control over. This architecture helps you deal with this in two ways. Firstly, the logging of every payload allows you to see what’s changed straight away. Secondly, because this is hosted middleware in AWS it’s a cinch to fix and redeploy. This is one of the widely celebrated features of a microservice philosophy.

Business Logic

As much as possible, each ‘piece’ of business logic should sit on either one side of an integration or the other – preferably on the side where it was triggered. In reality there are often knock on effects from changes on either side that need to be cascaded across that application boundary and it can become difficult to decide exactly if and how the logic should be split. Whatever the split is, a solution for triggering the remote logic is for entities to fall into a state where they are ‘pending’ some action that needs to be carried out on the opposite side of the integration. A flag for this is added to the payload to trigger the logic. The question is: should the consumption of the pending flag occur in the target system or in the queue processor?

One benefit of leveraging the queue processor is that no concept of the integration is leaked to the target system. The queue processor can make sure that the correct processes are triggered in the target system before placing a message on the queue in the opposite direction to update the originating system from a pending status.

When hitting this problem for the first time, splitting this business logic out from the processor into another service (again deployed to an EC2 instance) would maintain good separation of concerns. This is also the implementation I would suggest.

Wrapping Up

With the benefit of hindsight, it seems obvious that the integration strategy we first picked would never work well. There were obvious failures in a lot of places where we didn’t identify the more finer points for how integrations with Salesforce should work, and maybe there was a little too much blind trust placed in ‘the expert 3rd party’.

That having been said, the result of these mistakes is an architecture that could easily be applied to any other integration. I’m sure some would view it as over-engineering but I think that’s only valid if you know both systems intimately and are happy that every breaking change is something you’ll be doing yourself. Even then, this approach maintains a good separation of concerns and allows you to decouple your domain concepts.