Building a Resilient Bidirectional Integration with Salesforce

blockquote {font-size: 12px;}

18 months ago I started building an integration between my client’s existing systems and Salesforce. Up until that point I had no exposure to Salesforce so my client also brought in a consultancy for whom it was a speciality. Between us we came up with a strategy where we would expose a collection of REST services for code within Salesforce to interface with while calls in the opposite direction would use the standard Salesforce REST API. In a room where 50% of us had never worked with Salesforce before, this seemed like a reasonable approach but it turns out we were all being a bit naive.

Some of the Pitfalls

Outbound Messaging

Salesforce has a predetermined method of outgoing sync calls which is pretty inflexible. On every save of any given entity, a SOAP message can be sent to a specified http endpoint with a representation of the changed entity. We did originally try using this but hit on a few problems pretty quickly. One big problem was that after we managed to get it working, we came in the next morning to find it broken. After a lot of debugging we found that the message had changed format very slightly, which our Salesforce consultants explained could happen at any time as Salesforce release updates. As my client had a release cycle of once very two weeks, we all agreed the risk of the integration breaking for that length of time was unacceptable, so we decided that on each save, Salesforce would just send us an entity type and id, then we would use the API to retrieve the new data.

Race Conditions

This pattern worked well until we hit production servers where we suddenly found that at certain times of day, the request to the Salesforce API would result in a dirty read. Right away the problem looked like a race condition and when we looked further into how Salesforce saves records, we realised how it could happen. Here’s a list of steps that Salesforce takes to save a record (taken from the Salesforce online documentation):

1. Loads the original record from the database or initializes the record for an upsert statement.

2. Loads the new record field values from the request and overwrites the old values.

If the request came from a standard UI edit page, Salesforce runs system validation to check the record for:

Compliance with layout-specific rules

Required values at the layout level and field-definition level

Valid field formats

Maximum field length

Salesforce doesn’t perform system validation in this step when the request comes from other sources, such as an Apexapplication or a SOAP API call.

Salesforce runs user-defined validation rules if multiline items were created, such as quote line items and opportunity line items.

3. Executes all before triggers.

4. Runs most system validation steps again, such as verifying that all required fields have a non-null value, and runs any user-defined validation rules. The only system validation that Salesforce doesn’t run a second time (when the request comes from a standard UI edit page) is the enforcement of layout-specific rules.

5. Executes duplicate rules. If the duplicate rule identifies the record as a duplicate and uses the block action, the record is not saved and no further steps, such as after triggers and workflow rules, are taken.

6. Saves the record to the database, but doesn’t commit yet.

7. Executes all after triggers.

8. Executes assignment rules.

9. Executes auto-response rules.

10. Executes workflow rules.

11. If there are workflow field updates, updates the record again.

12. If workflow field updates introduced new duplicate field values, executes duplicate rules again.

13. If the record was updated with workflow field updates, fires before update triggers and after update triggers one more time (and only one more time), in addition to standard validations. Custom validation rules are not run again.

14. Executes processes.

If there are workflow flow triggers, executes the flows.

Flow trigger workflow actions, formerly available in a pilot program, have been superseded by the Process Builder. Organizations that are using flow trigger workflow actions may continue to create and edit them, but flow trigger workflow actions aren’t available for new organizations. For information on enabling the Process Builder in your organization, contact Salesforce.

15. Executes escalation rules.

16. Executes entitlement rules.

17. If the record contains a roll-up summary field or is part of a cross-object workflow, performs calculations and updates the roll-up summary field in the parent record. Parent record goes through save procedure.

18. If the parent record is updated, and a grandparent record contains a roll-up summary field or is part of a cross-object workflow, performs calculations and updates the roll-up summary field in the grandparent record. Grandparent record goes through save procedure.

19. Executes Criteria Based Sharing evaluation.

20. Commits all DML operations to the database.

21. Executes post-commit logic, such as sending email.

Our entity id was being sent from an ‘after trigger’ which was getting run at step 7, data isn’t committed to the database until step 20. Discovering this led us to the path of sending the entire record in the trigger, getting round the need to wait for a committed save. Even this isn’t ideal though, as a save could be rolled back after the trigger is executed, leaving our systems out of sync. The general consensus was that this is a reasonably small risk with limited impact to the business.

Unexpected Changes from Superusers

For the business, one of the big selling points of Salesforce is that it empowers users, allowing them to create workflows, install plugins, add validations, change fields, and so on. To the business this sounds fantastic – none of all the waiting around for technical teams to come up with a solution. The drawback is that every time a change goes in that the technical team aren’t aware of, it has the potential to break everything. It took a few attempts before we managed to reign everyone into cooperating with the technical team and getting them to try their changes in our development and QA orgs before deploying to production. Until then, things would just suddenly stop working. Exceptions would start getting thrown and data would fail to synchronise.

Quick to Diagnose Problems

I think one of the nastiest restrictions we had was being tied to the two-week release cycle. A release cycle that would often break when some piece of code written by one of the other two dozen developers in the company would do something unexpected and require us to roll back the release. The next release may be delayed to 3 or 4 weeks as a result. When the integration develops a problem in production that isn’t seen anywhere else, we have to get some tracing in place, or tweak the logging levels of existing tracing to get enough detail. This is something you want to do that day, not 3 weeks down the line. In an environment where breaking changes can come from the platform itself, it’s really important to be able to get in and see what’s going on right away.

The Key Requirements of the Correct Solution

Ok, so we can probably agree that we didn’t get our solution right. The idea was conceived without really understanding how Salesforce worked and this bit us over and over again as we reacted to architectural problems with pretty large changes in direction. If I could go back and sit in on that first meeting where we conceived our monster, I would interject with the following requirements:

The solution must not be tied to the two weekly deployment cycle of the main project.
It should be easy and quick to change.
All data passed in both directions should be logged for debugging purposes and to allow replay in the case of major outage.
The solution shouldn’t use Salesforce triggers.
The solution should include a space for integration specific business logic that is aware of both Salesforce and the main system (removing all leakage of concepts in either direction).
It should provide its own health analysis to allow monitoring.
Health issues and major errors should trigger notifications
It should be scalable independently of either Salesforce or the existing systems.

The Solution

Overview

My revised solution is to build a piece of middleware architected as microservices working with Amazon’s Simple Queuing Service (SQS) and a Relational Database Service (RDS) instance. Figure 1 is a conceptual diagram giving an overall view of what I mean. I’ve left out logging and notifications for brevity.

FIGURE 1

The Flow

The flow of data is pretty much symmetrical in processing order, so starting from either end with a payload of data to be synchronised:

The payload is dropped into an SQS queue in AWS.
A queue processor picks up the message within a few seconds.
The full payload is logged to the Sync DB’s history (which may have an automatic expiration configured)
The processor checks in the Sync DB for an existing mapping for the entity represented by the payload.
If a mapping is found, then an update payload is sent to the target system.
If a mapping is not found, then a create payload is sent to the target system.
Whether updating or creating, the payload is also recorded in the Sync DB’s history.
A response is received back from the target system, the result of which is recorded into the Sync DB’s history along with updates to the mapping record.

Scalability

Scaling of SQS can be achieved by horizontal scaling and batching. Both strategies can be used in conjunction. Batching may be difficult to achieve from the Salesforce side as I would recommend sticking to their standard outbound messaging system which means a further service may be needed to transpose these payloads into the queue. Horizontal scaling should be completely transparent to all systems allowing throughput of several thousands of messages per second, if taken to its limit.

The queue processors would be deployed to EC2 instances and each would have its own auto-scaling group. An auto scaling policy would be needed for each to scale based on CloudWatch alarms triggered by queue size. Even though the number of consumers for each queue would increase, Amazon hide messages that are ‘mid processing’ so other consumers don’t pick up a message that’s already being handled (although in our scenario, if that did happen, it wouldn’t be likely to cause any problems).

The Sync DB would require some tuning and only running this architecture would really give an idea of what size of instance to use (or indeed whether multiple instances were required). The choice of RDS over dynamoDB is specifically for scalability reasons – dynamoDb is fantastic for light weight requirements but it doesn’t handle bursts of traffic well at all and needs to be carefully configured to avoid read or write failures when under stress.

Resilience

In this scenario, resilience is an interesting topic as if during an outage, we store up payloads and re-run them, we may well be overwriting data that has been added during the outage at the destination. It may be that the data is so sensitive and critical that every write process would have to check the last updated timestamp of the target record to see whether to allow the write. Subsequent collision handling logic would add complexity to the system, though and in my client’s case was voted not worth worrying about.

This architecture is of course a distributed design, so some protection has to be put in place to prevent failures cascading through to other parts of the system. All calls across application boundaries should be made via circuit breakers. This is a fantastic pattern that prevents callers from flooding a service with more requests when it’s obviously already having problems. It also forces the developer to consider what action to take when their call fails with a CircuitBreakerOpenException. When these exceptions occur, events can be logged, monitoring systems (such as Zabbix) can be called, processing can be temporarily suspended, messages written to a dead letter queue, or any combination of the above and more – the precise strategy for different calls depends on the balance between need for resilience and expense of delivery. An excellent implementation of a circuit breaker is Helpful.CircuitBreaker which is very light weight and easy to use. It’s also available in Nuget.

From experience with Salesforce, the one thing that is guaranteed is a breaking change coming from a source you have no control over. This architecture helps you deal with this in two ways. Firstly, the logging of every payload allows you to see what’s changed straight away. Secondly, because this is hosted middleware in AWS it’s a cinch to fix and redeploy. This is one of the widely celebrated features of a microservice philosophy.

Business Logic

As much as possible, each ‘piece’ of business logic should sit on either one side of an integration or the other – preferably on the side where it was triggered. In reality there are often knock on effects from changes on either side that need to be cascaded across that application boundary and it can become difficult to decide exactly if and how the logic should be split. Whatever the split is, a solution for triggering the remote logic is for entities to fall into a state where they are ‘pending’ some action that needs to be carried out on the opposite side of the integration. A flag for this is added to the payload to trigger the logic. The question is: should the consumption of the pending flag occur in the target system or in the queue processor?

One benefit of leveraging the queue processor is that no concept of the integration is leaked to the target system. The queue processor can make sure that the correct processes are triggered in the target system before placing a message on the queue in the opposite direction to update the originating system from a pending status.

When hitting this problem for the first time, splitting this business logic out from the processor into another service (again deployed to an EC2 instance) would maintain good separation of concerns. This is also the implementation I would suggest.

Wrapping Up

With the benefit of hindsight, it seems obvious that the integration strategy we first picked would never work well. There were obvious failures in a lot of places where we didn’t identify the more finer points for how integrations with Salesforce should work, and maybe there was a little too much blind trust placed in ‘the expert 3rd party’.

That having been said, the result of these mistakes is an architecture that could easily be applied to any other integration. I’m sure some would view it as over-engineering but I think that’s only valid if you know both systems intimately and are happy that every breaking change is something you’ll be doing yourself. Even then, this approach maintains a good separation of concerns and allows you to decouple your domain concepts.