Large JSON Responses

The long slog from a 15 year old legacy monolith system to an agile, microservice based system will almost inevitably include throwing some API’s in front of a big old database. Building a cleaner view of the domain allows for some cleaner lines to be drawn between concerns, each with their own service. But inside those services there’s usually a set of ridiculous SQL queries building the nice clean models being exposed. These ugly SQL queries add a bit of time to the responses, and can lead to a bit of complexity but this is the real world, often we can only do so much.

So there you are after a few months of work with a handful of services deployed to an ‘on premises’ production server. Most responses are pretty quick, never more than a couple of seconds. But now you’ve been asked to build a tool for generating an Excel document with several thousand rows in it. To get the data, a web app will make an http request to the on prem API. So far so good. But once you’ve written the API endpoint and requested some realistic datasets through it, you realise the response takes over half an hour. What’s more, while that response is being built, the API server runs pretty hot. If more than a couple of users request a new Excel doc at the same time then everything slows down.

Large responses from API calls are not always avoidable, but there are a couple of things we can do to lessen the impact they have on resources.

Chunked Response

Firstly, lets send the response a bit at a time. In .NET Web Api, this is pretty straight forward to implement. We start with a simple HttpMessageHandler:

public class ChunkedResponseHttpHandler : DelegatingHandler
{
    protected override async Task SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
    {
        var response = await base.SendAsync(request, cancellationToken);
        response.Headers.TransferEncodingChunked = true;
        return response;
    }
}

Now we need to associate this handler with the controller action which is returning our large response. We can’t do that with attribute based routing, but we can be very specific with a routing template.

config.Routes.MapHttpRoute("CustomersLargeResponse",
                "customers",
                new { controller = "Customers", action = "GetByBirthYear" },
                new { httpMethod = new HttpMethodConstraint(HttpMethod.Get)  },
                new ChunkedResponseHttpHandler(config));

In this example, the url ‘customers’ points specifically at CustomersController.GetByBirthYear() and will only accept GET requests. The handler is assigned as the last parameter passed to MapHttpRoute().

The slightly tricky part comes when writing the controller action. Returning everything in chunks won’t help if you wait until you’ve loaded the entire response into memory before sending it. Also, streaming results isn’t something that many database systems natively support. So you need to be a bit creative about how you get data for your response.

Let’s assume you’re querying a database, and that your endpoint is returning a collection of resources which you already have a pretty ugly SQL query for retrieving by id. The scenario is not as contrived as you might think. Dynamically modifying the ‘where’ clause of the ‘select by id’ query and making it return all the results you want would probably give the fastest response time. It’s a valid approach, but if you know you’re going to have a lot of results then you’re risking memory issues which can impact other processes, plus you’re likely to end up with some mashing of SQL strings to share the bulk of the select statement and add different predicates which isn’t easily testable. The approach I’m outlining here is best achieved by breaking the processing into two steps. First, query for the ID’s for the entities you’re going to return. Secondly, use your ‘select by ID’ code to retrieve them one at a time, returning them in an enumerator, rather than a fully realised collection type. Let’s have a look at what a repository method might look like for this.

public IEnumerator GetByBirthYear(int birthYear)
{
    IEnumerable customerIds = _customersRepository.GetIdsForBirthYear(birthYear);
    foreach (var id in customerIds)
    {
        Customer customer;
        try
        {
            customer = Get(id);
        }
        catch (Exception e)
        {
            customer = new Customer
            {
                Id = id,
                CannotRetrieveException = e
            };
        }
        yield return customer;
    }
}

public Customer Get(int customerId)
{
    ...
}

The important things to notice here are:

  1. The first call is to retrieve the customer ID’s we’re interested in.
  2. Each customer is loaded from the same Get(int customerId) method that is used to return customers by ID.
  3. We don’t want to terminate the whole process just because one customer couldn’t be loaded. Equally, we need to do something to let the caller know there might be some missing data. In this example we simply return an empty customer record with the exception that was thrown while loading. You might not want to do this if your API is public, as you’re leaking internal details, but for this example let’s not worry.

The controller action which exposes this functionality might look a bit like this:

public IEnumerable GetByBirthYear(int birthYear)
{
    IEnumerator iterator;
    iterator = _customersServices.GetForBirthYear(birthYear);
    while (iterator.MoveNext())
    {
        yield return iterator.Current;
    }
}

Things to notice here are:

  1. There’s no attribute based routing in use here. Because we need to assign our HttpHandler to the action, we have to use convention based routing.
  2. At no point are the results loaded into a collection of any kind. We retrieve an enumerator and return one result at a time until there are no more results.

JSON Stream

Using this mechanism is enough to return the response in a chunked manner and start streaming the response as soon as there’s a single result ready to return. But there’s still one more piece to the puzzle for our service. Depending what language the calling client is written in, it can either be straight forward to consume the JSON response as we have here or it can be easier to consume what’s become known as a JSON stream. For a DotNet consumer, sending our stream as a comma delimited array is sufficient. If we’re expecting calls from a Ruby client then we should definitely consider converting our response to a JSON stream.

For our customers response, we might send a response which looks like this (but hopefully much bigger):

{"id":123,"name":"Fred Bloggs","birthYear":1977}
{"id":133,"name":"Frank Bruno","birthYear":1961}
{"id":218,"name":"Ann Frank","birthYear":1929}

This response is in a format called Line-Delimited JSON (LDJSON). There’s no opening square bracket to say this is a collection, because it isn’t a collection. This is a stream of individual records which can be processed without having to wait for the entire response to be evaluated. Which makes a lot of sense; just as we don’t want to have to load the entire response on the server, we also don’t want to load the entire response on the client.

A chunked response is something that most HTTP client packages will handle transparently. Unless the client application is coded specifically to receive each parsed object in each chunk, then there’s no difference on the client side to receiving an unchunked response. LDJSON breaks this flexibility, because the response is not valid JSON – one client will consume it easily, but another would struggle. At the time of writing, DotNet wants only standard JSON whereas it’s probably easier in Ruby to process LDJSON. That’s not to say it’s impossible to consume the collection in Ruby or LDJSON in DotNet, it just requires a little more effort for no real reward. To allow different clients to still consume the endpoint, we can add a MediaTypeFormatter specifically for the ‘application/x-json-stream’ media type (this isn’t an official media type, but it has become widely used). So any consumer can either expect JSON or an LDJSON stream.

public class JsonStreamFormatter : MediaTypeFormatter
{
    public JsonStreamFormatter()
    {
        SupportedMediaTypes.Add(new MediaTypeHeaderValue("application/x-json-stream"));
    }

    public override async Task WriteToStreamAsync(Type type, object value, Stream writeStream, HttpContent content,
        TransportContext transportContext)
    {
        using (var writer = new StreamWriter(writeStream))
        {
            var response = value as IEnumerable;
            if (response == null)
            {
                throw new NotSupportedException($"Cannot format for type {type.FullName}.");
            }
            foreach (var item in response)
            {
                string itemString = JsonConvert.SerializeObject(item, Formatting.None);
                await writer.WriteLineAsync(itemString);
            }
        }
    }

    public override bool CanReadType(Type type)
    {
        return false;
    }

    public override bool CanWriteType(Type type)
    {
        Type enumerableType = typeof(IEnumerable);
        return enumerableType.IsAssignableFrom(type);
    }
}

This formatter only works for types implementing IEnumerable, and uses Newtonsoft’s JsonConvert object to serialise each object in turn before pushing it into the response stream. Enable the formatter by adding it to the Formatters collection:

config.Formatters.Add(new JsonStreamFormatter());

DotNet Consumer

Let’s take a look at a DotNet consumer coded to expect a normal JSON array of objects delivered in a chunked response.

public class Client
{
    public IEnumerator Results()
    {
        var serializer = new JsonSerializer();
        using (var httpClient = new HttpClient())
        {
            httpClient.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
            using (var stream = httpClient.GetStreamAsync("http://somedomain.com/customers?birthYear=1977").Result)
            {
                using (var jReader = new JsonTextReader(new StreamReader(stream)))
                {
                    while (jReader.Read())
                    {
                        if (jReader.TokenType != JsonToken.StartArray && jReader.TokenType != JsonToken.EndArray)
                            yield return serializer.Deserialize(jReader);
                    }
                }
            }
        }
    }
}

Here we’re using the HttpClient class and we’re requesting a response as “application/json”. Instead of using a string version of the content, we’re working with a stream. The really cool part is that we don’t have to do much more than just throw that stream into a JsonTextReader (part of NewtonSoft.Json). We can yield the response of each JSON token as long as we ignore the first and last tokens, which are the open and closing square brackets of the JSON array. Calling jReader.Read() at the next level reads the whole content of that token, which is one full object in the JSON response.

This method allows each object to be returned for processing while the stream is still being received. The client can save on memory usage just as well as the service.

Ruby Consumer

I have only a year or so experience with Ruby on Rails, but I’ve found it to be an incredibly quick way to build services. In my opinion, there’s a trade off when considering speed of development vs complexity – because the language is dynamic the developer has to write more tests which can add quite an overhead as a service grows.

To consume our service from a Ruby client, we might write some code such as this:

def fetch_customers(birth_year)
  uri  = "http://somedomain.com/customers"
  opts = {
  query: { birthYear: birth_year },
    headers: {'Accept' => 'application/x-json-stream'},
    stream_body: true,
    timeout: 20000
  }

  parser.on_parse_complete = lambda do |customer|
      yield customer.deep_transform_keys(&:underscore).with_indifferent_access
  end

  HTTParty.get(uri, opts) do |chunk|
     parser << chunk
  end
end

private
def parser
  @parser ||= Yajl::Parser.new
end

Here we're using HTTParty to manage the request, passing it ‘stream_body: true’ in the options and ‘Accept’ => ‘application/x-json-stream’ set. This tells HTTParty that the response will be chunked, the header tells our service to respond with LDJSON.

From the HTTParty.get block, we see that each chunk is being passed to a JSON parser Yajl::Parser, which understands LDJSON. Each chunk may contain a full JSON object, or several, or partial objects. The parser will recognise when it has enough JSON for a full object to be deserialized and it will send it to the method assigned to parser.on_parse_complete where we’re simply returning the object as a hash with indifferent access.

The Result

Returning responses in a chunked fashion is more than just a neat trick, the amount of memory used by a service returning data in this fashion compared to loading the entire result set into memory before responding is tiny. This means more consumers can request these large result sets and other processes on the server are not impacted.

From my own experience, the Yajl library seems to become unstable after handling a response which streams for more than half an hour or so. I haven’t seen anyone else having the same issue, but on one project I’ve ended up removing Yajl and just receiving the entire response with HTTParty and parsing the collection fully in memory. It isn’t ideal, but it works. It also doesn’t stop the service from streaming the response, it’s just the client that waits for the response to load completely before parsing it.

It’s a nice strategy to understand and is useful in the right place, but in an upcoming post I’ll be explaining why I think it’s often better to avoid large JSON responses altogether and giving my preferred strategy and reasoning.

Integration Testing Behaviour with Mountebank

Developer’s machine > dev shared environment > staging environment > UAT > production.

Probably not exactly how everyone structures their delivery pipelines but probably not that far off. It allows instant feedback on whether what a developer is writing actually works with the code other developers are writing. And that’s a really good thing. Unfortunately, it misses something…

Each environment (other than the developer’s own machine) is shared with other developers who are also deploying new code at the same time. So how do you get an integration test for component A that relies on component B behaving in a custom manner (maybe even failing) to run automatically, without impacting the people who are trying to build and deploy component B?

If we were writing a unit test we would simply inject a mocked dependency. Fortunately there’s now a fantastic piece of kit available for doing exactly this but on an integration scale: enter Mountebank.

This clever piece of kit will intercept a network call for ANY protocol and respond in the way you ask it to. Transparent to your component and as easy to use as most mocking frameworks. I won’t go into detail about how to configure ‘Imposters’ as their own documentation is excellent, but suffice to say it can be easily configured in a TestFixtureSetup or similar.

So where does this fit into our pipeline? Personally, I think the flow should be:

Push code to repo > Code is pulled onto a build server > Build > Unit test > Integration test > Start deployment pipeline

The step where Mountebank comes in is obviously ‘integration testing’.

Keep in mind that installing the component and running it on the build agent is probably not a great idea, so make good use of the cloud or docker or both to spin up a temporary instance which has Mountebank already installed and running. Push your component to it, and run your integration tests. Once your tests have run then the instance can be blown away (or if constantly destroying environments gets a bit slow, maybe have them refreshing every night so they don’t get cluttered). Docker will definitely help keeping these processes efficient.

This principle of spinning up an isolated test instance can work in all kinds of situations, not just where Mountebank would be used. Calls to SQL Server can be redirected to a .mdf file for data dependent testing. Or DynamoDb tables can be generated specifically scoped to the running test.

What we end up with is the ability to test more behaviours than we can do in a shared environment where other people are trying to run their tests at the same time. Without this, our integration tests can get restricted to only very basic ‘check they talk to each other’ style tests which although have value do not cover everything we’d like.

Going Deep Enough with Microservices

Moving from a monolith architecture to microservices is a widely debated process, with many recommendations and nuggets of advice available on the web in blogs like this. There are so many different opinions out there mainly because where an enterprise finds their main complexities lay depends on the skillsets of their technologists, the domain knowledge within the business and the existing code base. During the years I’ve spent as a contractor in a very wide range of enterprises, I’ve seen lots of monolith architectures – all of them causing slightly different headaches because those responsible for developing them let different aspects of the architecture slip. After all, the thing that is often forgotten is that if a monolith is maintained well, then it can work. The reverse is also true – if a microservice architecture is left to evolve on its own, it can cause as many problems as a poorly maintained monolith.

Domains

One popular way to break things down is using Domain Driven Design. Two books which cover most concepts involved in this process are ‘Building Microservices’ by Sam Newton (http://shop.oreilly.com/product/0636920033158.do) and ‘Implementing Domain Driven Design’ by Vaughn Vernon (http://www.amazon.com/Implementing-Domain-Driven-Design-Vaughn-Vernon/dp/0321834577) which largely references ‘Domain Driven Design: Tackling Complexity in the Heart of Software’ by Eric Evans (http://www.amazon.com/Domain-Driven-Design-Tackling-Complexity-Software/dp/0321125215). I recommend Vaughn’s book over Evans’ as the latter is a little dry.

If you take on board even just half the content covered in these books, you’ll be on a reasonable footing to get started. You’ll make mistakes but as Sam Newton points out (and I’ve seen for myself) that’s inevitable.

Something that seems to be left out of a lot of domain driven discussions is what happens beyond the basic CRUD processes and domain logic in the application layer. Attention sits primarily with the thin interaction between a web interface and the domain processing by the aggregate in question. When dismantelling a monolith architecture into microservices, focus on just the application layer can give the impression of fast progress but in reality half the picture is missing. It’s likely that in a few months there will be several microservices but instead of them operating solely in their sub-domains, they’ll still be tied to the database that the original monolith was using.

Context

It’s hugely important to pull the domain data out of the monolith store. This is for the very same reasons we segregate service responsibilities into sub-domains. Data pertaining to a given domain may exist in other domains as well but changes will not necessarily be subjected to the same domain rules and individual records may have different properties. There may be a User record in several sub-domains, each with a Username property but the logic around how duplicate Usernames are prevented should sit firmly in a single sub-domain. If a service in a different sub-domain needs to update the username, it should either call a public service from the Profile sub-domain or raise a ‘Username Updated’ event that the Profile sub-domain would handle, process and possibly respond with a ‘Username Update Failed’ event of its own.

This example may be a little contrived – checking for duplicates could be something that’s implemented everywhere it’s needed. But consider what would happen if it became necessary to check for duplicates within another external system every time a Username is updated. That logic could easily be encapsulated behind the call to the Profile service but having to update every service that updates Usernames wouldn’t be good practice.

So if we are now happy that the same data represented in different sub-domains could at any one time be different (given the previous two paragraphs) then we shouldn’t store the data for both sub-domains in the same table.

Local Data

In fact, we’re now pretty well removed from needing a classic relational database for storing data that’s local to the sub-domain. We’re dealing with data that is limited in scope and is intended for use solely by the microservices built to sit in that sub-domain. NoSQL databases are ideal for this scenario and no matter which platform you’ve chosen to build on there are excellent options available. One piece of advice I think is pretty sound is that if you are working in the cloud, you’ll usually get the best performance by using the data services provided by your cloud provider. Make sure you do your homework, though – some have idiosyncracies that can impact performance if you don’t know about them.

So now we have data stored locally to the sub-domain, but this isn’t where the work stops. It’s likely there’s a team of DBA’s jumping around wondering why their data warehouse isn’t getting any new data.

The problem is that the relational database backing the monolith wasn’t just acting as a data-store for the application. There were processes feeding other data-stores for things like customer reporting, machine learning platforms and BI warehouses. In fact, anything that requires a historical view of things will be reading it from one or more stores that are loaded incrementally from the monolith’s relational database. Now data is being stored in a manner best suiting any given sub-domain, there isn’t a central source for that data to be pulled from into these downstream stores.

Shift of Responsibility

Try asking a team of dba’s if they fancy writing CLR based stored procedures to detect changes and pull new records into their warehouse by querying whatever data-store technologies have been decided on in each case – I doubt they’ll be too receptive. The responsibility for getting data out of each local data-store now has to move closer to the application services.

The data guys are interested in recording historical and aggregated records, which is convenient as there is a useful well known tool for informing different systems that something has happened – an event.

It’s been argued that using events to communicate across sub-domains is miss-using an event stream as a message bus. My argument in this case is that the back-end historical data-store is still within the original sub-domain. The data being stored belongs specifically to that sub-domain and still holds the same context as when it was saved. There has been a transition to a new medium of storage but that’s all.

So we are now free to raise events from our application microservices into eventstreams which are then handled by a service specifically designed to transfer data from events into whatever downstream stores were originally being fed from the monolith database. This gives us full extraction from the monolithic architecture and breaks the sub-domain’s dependency on the monolith database.

There is also the possibility that we can now give more fine grained detail of changes than was being recorded previously.

Gaps in the Monolith Database

Of course back end data-stores aren’t the only consumers of the sub-domain’s data. Most likely there will be other application level queries that used to read the data you’re now saving outside of the monolith database. How you manage these dependencies will depend on whether the read requests are coming from the same sub-domain or another. If they’re from the same sub-domain then it’s equally correct to either pull the data from an event stream or from microservices within that sub-domain. Gradually, a sub-domain’s dependency on the monolith database will die. If the queries are coming from a different sub-domain then it’s better to continue to update the monolith database as a consumer of the data stored locally to the sub-domain. The original table no longer containing data that is relevant to the sub-domain you’re working on.

Switching

Obviously we don’t want to have any gaps in the data being sent to our back-end stores, so as we pull functionality into microservices and add new data-stores local to the sub-domain, we also need to build the pipeline for our new back end processing of domain events into the warehouse. As this gets switched on, the loading processes from the original monolith can be switched off.

External Keys

Very few enterprise systems function in isolation. Most businesses make use of off-the-shelf packages or cloud based services such as Salesforce. Mapping records into these systems usually means using the primary key of each record to create a reference. If this has happened then the primary key from the monolith is most likely being relied on to hold things together. Moving away from the monolith database means the primary key generation has probably been lost.

There are two options here and I’d suggest going with whatever is the easiest – they both have their merits and problems.

  1. Continue to generate unique id’s in the same way as the monolith database did and continue to use these id’s for reference across different systems. Don’t rely on the monolith for id generation here, create a new process in the microservice that continues the same pattern.
  2. Start generating a new version of id generation and copy the new keys out to the external systems for reference. The original keys can eventually be lost.

Deeper than Expected

When planning the transition from monolithic architecture to microservices, there may well be promises from the management team that time will be given to build each sub-domain out properly. Don’t take this at face value – Product Managers will still have their roadmaps to fulfill and unfortunately there is maybe only 30% of any given slice of functionality being pulled out of a monolith that an end user will ever see. Expect the process to be difficult no matter what promises are made.

What I really want to get across here is that extracting even a small amount of functionality into microservices carries with it a much deeper dive into the enterprise’s tech stack than just creating a couple of application services. It requires time and focus from more than just the Dev team and before it can even be started, there has to be a architectural plan spanning the full vertical slice of a sub-domain, from front end to warehoused historical data.

Consequences of Not Going Deep Enough

How difficult do you find it in your organisation to get approval for technical upgrade work, or for dealing with technical debt as a project (which I’m not advocating is a good strategy), or for doing anything which doesn’t have a directly measurable positive impact on new product? In my experience, it isn’t easy and I’m not sure it should be, but that’s for another post.

Imagine you’ve managed to extract maybe 70% of your application layer away from your monolith but you’re still tied to the same data model. Have you achieved what you set out to do? You certainly don’t have loose coupling because everything is tied at the data level. You don’t have domain isolation. You are preventing your data team from getting access to the juicy new events you don’t really need to be raising (because the changed data is already available everywhere). You’ve turned a monolith into an abomination – it isn’t really microservices and it isn’t a classic monolith, it isn’t really any desired pattern at all. Even worse, the work you are missing is pretty big and may not directly carry with it any new features. Will you get agreement to remove coupling with the database as a project itself?

How are your developers doing? How many of them see that the strategy is only going half way? How many are moaning about paying lip service to the architecture? Wasn’t that one of the reasons you started with microservices in the first place?

Can you deploy the microservices without affecting other sub-domains? What if there are schema changes? What if there are schema changes in 2 sub-domains and one needs to be rolled back after release because it wasn’t quite right? Wasn’t this something microservices was supposed to prevent?

How many dodgy hacks or ‘surprises’ are there in your new code where devs have managed to make domain isolated services work with a single relational data model? How many devs waste time hand wrangling when they know they’re building something that is going to be technical debt the moment it goes live?

Ok, so I’m painting a darker picture than you’ll probably feel, but each of these scenarios will almost certainly come up, you just might not get to hear about it.

The crux for me is thinking about the reasons for pursuing a microservice architecture. The flexibility, loose coupling, technology agnosticity (if that’s a real term), the speed of continuous delivery that you’re looking for. Unless you go deeper than the low lying fruit of the application layer, you’ll be cheating yourself out of these benefits. Sure, you’ll see improvements short term but you are building something which is already technical debt. No matter what architecture you choose, if you don’t invest in maintaining it properly (or even building it properly in the first place) then it will ultimately become your albatross.

A Helpful Circuit Breaker in C#

Introduction

With the increasing popularity of SOA in the guise of ‘microservices’, circuit breakers are now a must have weapon in any developer’s arsenal. Services are rarely 100% reliable; outages happen, network connections get pulled, memory gets filled, routing tables get corrupted. In an environment where multiple services are each calling multiple other services, the result of an outage in a small, seemingly unimportant service can be a random slow down in response times in your web application that gradually leads to complete server lock up. (If you don’t believe me, read Release It by Micheal Nygard from the Pragmatic bookshelf).

The idea of a circuit breaker is to detect that a service is down and fail immediately for subsequent calls in an expected manner that your application can handle gracefully. Then, every so often, the breaker will attempt to close and allow a call to be sent to the troubled service. If that call is successful then the breaker starts allowing calls through, if that call fails then the breaker remains in an open state and continues to fail with an expected exception.

Helpful.CircuitBreaker is a simple implementation that allows a developer to be proactive about the way their code handles failures.

Usage

There are 2 primary ways that the circuit breaker can be used:

  1. Exceptions thrown from the code you wish to break on can trigger the breaker to open.
  2. A returned value from the code you wish to break on can trigger the breaker to open.

Here are some basic examples of each scenario.

In the following example, exceptions thrown from _client.Send(request) will cause the circuit breaker to react based on the injected configuration.

public class MakeProtectedCall
{
    private ICircuitBreaker _breaker;
    private ISomeServiceClient _client;

    public MakeProtectedCall(ICircuitBreaker breaker, ISomeServiceClient client)
    {
        _breaker = breaker;
        _client = client;
    }

    public Response ExecuteCall(Request request)
    {
        Response response = null;
        _breaker.Execute(() => response = _client.Send(request));
        return response;
    }
}

In the following example, exceptions thrown by _client.Send(request) will still trigger the exception handling logic of the breaker, but the lamda applies additional logic to examine the response and trigger the breaker without ever receiving an exception. This is particularly useful when using an HTTP based client that may return failures as error codes and strings instead of thrown exceptions.

public class MakeProtectedCall
{
    private ICircuitBreaker _breaker;
    private ISomeServiceClient _client;

    public MakeProtectedCall(ICircuitBreaker breaker, ISomeServiceClient client)
    {
        _breaker = breaker;
        _client = client;
    }

    public Response ExecuteCall(Request request)
    {
        Response response = null;
        _breaker.Execute(() => {
        response = _client.Send(request));
        return response.Status == "OK" ? ActionResponse.Good : ActionResult.Failure;
    }
}

Initialising

The scope of a circuit breaker must be considered first. When the breaker opens, subsequent calls will not succeed, but if your breaker is in the scope of an HTTP request then there may not be a subsequent request hitting that open breaker. The next request would hit a newly built, closed breaker.

The following code will initialise a basic circuit breaker which once open will not try to close until 1 minute has passed (60 seconds is the default timeout, so there’s no need to specify it).

CircuitBreakerConfig config = new CircuitBreakerConfig
{
    BreakerId = "Some unique and constant identifier that indicates the running instance and executing process"
};
CircuitBreaker circuitBreaker = new CircuitBreaker(config);

To inject a circuit breaker into class TargetClass using Ninject, try code similar to this:

Bind().ToMethod(c =&gt; new CircuitBreaker(new CircuitBreakerConfig
{
    BreakerId = string.Format("{0}-{1}-{2}", "Your breaker name", "TargetClass", Environment.MachineName)
})).WhenInjectedInto(typeof(TargetClass)).InSingletonScope();

The above code will reuse the same breaker for all instances of the given class, so the breaker continues to report state continuously across different threads. When opened by one use, all instances of TargetClass will have an open breaker.

Tracking Circuit Breaker State

The suggested method for tracking the state of the circuit breaker is to handle the breaker events. These are defined on the CircuitBreaker class as:

///
/// Raised when the circuit breaker enters the closed state
///
public event EventHandler ClosedCircuitBreaker;

///
/// Raised when the circuit breaker enters the opened state
///
public event EventHandler OpenedCircuitBreaker;

///
/// Raised when trying to close the circuit breaker
///
public event EventHandler TryingToCloseCircuitBreaker;

///
/// Raised when the breaker tries to open but remains closed due to tolerance
///
public event EventHandler ToleratedOpenCircuitBreaker;

///
/// Raised when the circuit breaker is disposed
///
public event EventHandler UnregisterCircuitBreaker;

///
/// Raised when a circuit breaker is first used
///
public event EventHandler RegisterCircuitBreaker;

Attach handlers to these events to send information about the event to a logging or monitoring system. In this way, sending state to Zabbix or logging to log4net is trivial.

CONFIGURATION OPTIONS
Make sure each circuit breaker has it’s own configuration injected using the CircuitBreakerConfig class.

using System;
using System.Collections.Generic;
using Helpful.CircuitBreaker.Events;

namespace Helpful.CircuitBreaker.Config
{
    /// <summary>
    ///
    /// </summary>
    [Serializable]
    public class CircuitBreakerConfig : ICircuitBreakerDefinition
    {
        /// <summary>
        /// Initializes a new instance of the <see cref="CircuitBreakerConfig"/> class.
        /// </summary>
        public CircuitBreakerConfig()
        {
            ExpectedExceptionList = new List<Type>();
            ExpectedExceptionListType = ExceptionListType.None;
            PermittedExceptionPassThrough = PermittedExceptionBehaviour.PassThrough;
            BreakerOpenPeriods = new[] { TimeSpan.FromSeconds(60) };
        }

        /// <summary>
        /// The number of times an exception can occur before the circuit breaker is opened
        /// </summary>
        /// <value>
        /// The open event tolerance.
        /// </value>
        public short OpenEventTolerance { get; set; }

        /// <summary>
        /// Gets or sets the list of periods the breaker should be kept open.
        /// The last value will be what is repeated until the breaker is successfully closed.
        /// If not set, a default of 60 seconds will be used for all breaker open periods.
        /// </summary>
        /// <value>
        /// The array of timespans representing the breaker open periods.
        /// </value>
        public TimeSpan[] BreakerOpenPeriods { get; set; }

        /// <summary>
        /// Gets or sets the expected type of the exception list. <see cref="ExceptionListType"/>
        /// </summary>
        /// <value>
        /// The expected type of the exception list.
        /// </value>
        public ExceptionListType ExpectedExceptionListType { get; set; }

        /// <summary>
        /// Gets or sets the expected exception list.
        /// </summary>
        /// <value>
        /// The expected exception list.
        /// </value>
        public List<Type> ExpectedExceptionList { get; set; }

        /// <summary>
        /// Gets or sets the timeout.
        /// </summary>
        /// <value>
        /// The timeout.
        /// </value>
        public TimeSpan Timeout { get; set; }

        /// <summary>
        /// Gets or sets a value indicating whether [use timeout].
        /// </summary>
        /// <value>
        ///   <c>true</c> if [use timeout]; otherwise, <c>false</c>.
        /// </value>
        public bool UseTimeout { get; set; }

        /// <summary>
        /// Gets or sets the breaker identifier.
        /// </summary>
        /// <value>
        /// The breaker identifier.
        /// </value>
        public string BreakerId { get; set; }

        /// <summary>
        /// Sets the behaviour for passing through exceptions that won't open the breaker
        /// </summary>
        public PermittedExceptionBehaviour PermittedExceptionPassThrough { get; set; }
    }
}

Conclusion

This library has helped me build resilient microservices that have remained stable when half the internet has been falling over. I hope it can help you as well.

Building a Resilient Bidirectional Integration with Salesforce

blockquote {font-size: 12px;}

18 months ago I started building an integration between my client’s existing systems and Salesforce. Up until that point I had no exposure to Salesforce so my client also brought in a consultancy for whom it was a speciality. Between us we came up with a strategy where we would expose a collection of REST services for code within Salesforce to interface with while calls in the opposite direction would use the standard Salesforce REST API. In a room where 50% of us had never worked with Salesforce before, this seemed like a reasonable approach but it turns out we were all being a bit naive.

Some of the Pitfalls

Outbound Messaging

Salesforce has a predetermined method of outgoing sync calls which is pretty inflexible. On every save of any given entity, a SOAP message can be sent to a specified http endpoint with a representation of the changed entity. We did originally try using this but hit on a few problems pretty quickly. One big problem was that after we managed to get it working, we came in the next morning to find it broken. After a lot of debugging we found that the message had changed format very slightly, which our Salesforce consultants explained could happen at any time as Salesforce release updates. As my client had a release cycle of once very two weeks, we all agreed the risk of the integration breaking for that length of time was unacceptable, so we decided that on each save, Salesforce would just send us an entity type and id, then we would use the API to retrieve the new data.

Race Conditions

This pattern worked well until we hit production servers where we suddenly found that at certain times of day, the request to the Salesforce API would result in a dirty read. Right away the problem looked like a race condition and when we looked further into how Salesforce saves records, we realised how it could happen. Here’s a list of steps that Salesforce takes to save a record (taken from the Salesforce online documentation):

1. Loads the original record from the database or initializes the record for an upsert statement.

2. Loads the new record field values from the request and overwrites the old values.

If the request came from a standard UI edit page, Salesforce runs system validation to check the record for:

Compliance with layout-specific rules

Required values at the layout level and field-definition level

Valid field formats

Maximum field length

Salesforce doesn’t perform system validation in this step when the request comes from other sources, such as an Apexapplication or a SOAP API call.

Salesforce runs user-defined validation rules if multiline items were created, such as quote line items and opportunity line items.

3. Executes all before triggers.

4. Runs most system validation steps again, such as verifying that all required fields have a non-null value, and runs any user-defined validation rules. The only system validation that Salesforce doesn’t run a second time (when the request comes from a standard UI edit page) is the enforcement of layout-specific rules.

5. Executes duplicate rules. If the duplicate rule identifies the record as a duplicate and uses the block action, the record is not saved and no further steps, such as after triggers and workflow rules, are taken.

6. Saves the record to the database, but doesn’t commit yet.

7. Executes all after triggers.

8. Executes assignment rules.

9. Executes auto-response rules.

10. Executes workflow rules.

11. If there are workflow field updates, updates the record again.

12. If workflow field updates introduced new duplicate field values, executes duplicate rules again.

13. If the record was updated with workflow field updates, fires before update triggers and after update triggers one more time (and only one more time), in addition to standard validations. Custom validation rules are not run again.

14. Executes processes.

If there are workflow flow triggers, executes the flows.

Flow trigger workflow actions, formerly available in a pilot program, have been superseded by the Process Builder. Organizations that are using flow trigger workflow actions may continue to create and edit them, but flow trigger workflow actions aren’t available for new organizations. For information on enabling the Process Builder in your organization, contact Salesforce.

15. Executes escalation rules.

16. Executes entitlement rules.

17. If the record contains a roll-up summary field or is part of a cross-object workflow, performs calculations and updates the roll-up summary field in the parent record. Parent record goes through save procedure.

18. If the parent record is updated, and a grandparent record contains a roll-up summary field or is part of a cross-object workflow, performs calculations and updates the roll-up summary field in the grandparent record. Grandparent record goes through save procedure.

19. Executes Criteria Based Sharing evaluation.

20. Commits all DML operations to the database.

21. Executes post-commit logic, such as sending email.

Our entity id was being sent from an ‘after trigger’ which was getting run at step 7, data isn’t committed to the database until step 20. Discovering this led us to the path of sending the entire record in the trigger, getting round the need to wait for a committed save. Even this isn’t ideal though, as a save could be rolled back after the trigger is executed, leaving our systems out of sync. The general consensus was that this is a reasonably small risk with limited impact to the business.

Unexpected Changes from Superusers

For the business, one of the big selling points of Salesforce is that it empowers users, allowing them to create workflows, install plugins, add validations, change fields, and so on. To the business this sounds fantastic – none of all the waiting around for technical teams to come up with a solution. The drawback is that every time a change goes in that the technical team aren’t aware of, it has the potential to break everything. It took a few attempts before we managed to reign everyone into cooperating with the technical team and getting them to try their changes in our development and QA orgs before deploying to production. Until then, things would just suddenly stop working. Exceptions would start getting thrown and data would fail to synchronise.

Quick to Diagnose Problems

I think one of the nastiest restrictions we had was being tied to the two-week release cycle. A release cycle that would often break when some piece of code written by one of the other two dozen developers in the company would do something unexpected and require us to roll back the release. The next release may be delayed to 3 or 4 weeks as a result. When the integration develops a problem in production that isn’t seen anywhere else, we have to get some tracing in place, or tweak the logging levels of existing tracing to get enough detail. This is something you want to do that day, not 3 weeks down the line. In an environment where breaking changes can come from the platform itself, it’s really important to be able to get in and see what’s going on right away.

The Key Requirements of the Correct Solution

Ok, so we can probably agree that we didn’t get our solution right. The idea was conceived without really understanding how Salesforce worked and this bit us over and over again as we reacted to architectural problems with pretty large changes in direction. If I could go back and sit in on that first meeting where we conceived our monster, I would interject with the following requirements:

  1. The solution must not be tied to the two weekly deployment cycle of the main project.
  2. It should be easy and quick to change.
  3. All data passed in both directions should be logged for debugging purposes and to allow replay in the case of major outage.
  4. The solution shouldn’t use Salesforce triggers.
  5. The solution should include a space for integration specific business logic that is aware of both Salesforce and the main system (removing all leakage of concepts in either direction).
  6. It should provide its own health analysis to allow monitoring.
  7. Health issues and major errors should trigger notifications
  8. It should be scalable independently of either Salesforce or the existing systems.

The Solution

Overview

My revised solution is to build a piece of middleware architected as microservices working with Amazon’s Simple Queuing Service (SQS) and a Relational Database Service (RDS) instance. Figure 1 is a conceptual diagram giving an overall view of what I mean. I’ve left out logging and notifications for brevity.

FIGURE 1

Figure 1

The Flow

The flow of data is pretty much symmetrical in processing order, so starting from either end with a payload of data to be synchronised:

  1. The payload is dropped into an SQS queue in AWS.
  2. A queue processor picks up the message within a few seconds.
  3. The full payload is logged to the Sync DB’s history (which may have an automatic expiration configured)
  4. The processor checks in the Sync DB for an existing mapping for the entity represented by the payload.
  5. If a mapping is found, then an update payload is sent to the target system.
  6. If a mapping is not found, then a create payload is sent to the target system.
  7. Whether updating or creating, the payload is also recorded in the Sync DB’s history.
  8. A response is received back from the target system, the result of which is recorded into the Sync DB’s history along with updates to the mapping record.

Scalability

Scaling of SQS can be achieved by horizontal scaling and batching. Both strategies can be used in conjunction. Batching may be difficult to achieve from the Salesforce side as I would recommend sticking to their standard outbound messaging system which means a further service may be needed to transpose these payloads into the queue. Horizontal scaling should be completely transparent to all systems allowing throughput of several thousands of messages per second, if taken to its limit.

The queue processors would be deployed to EC2 instances and each would have its own auto-scaling group. An auto scaling policy would be needed for each to scale based on CloudWatch alarms triggered by queue size. Even though the number of consumers for each queue would increase, Amazon hide messages that are ‘mid processing’ so other consumers don’t pick up a message that’s already being handled (although in our scenario, if that did happen, it wouldn’t be likely to cause any problems).

The Sync DB would require some tuning and only running this architecture would really give an idea of what size of instance to use (or indeed whether multiple instances were required). The choice of RDS over dynamoDB is specifically for scalability reasons – dynamoDb is fantastic for light weight requirements but it doesn’t handle bursts of traffic well at all and needs to be carefully configured to avoid read or write failures when under stress.

Resilience

In this scenario, resilience is an interesting topic as if during an outage, we store up payloads and re-run them, we may well be overwriting data that has been added during the outage at the destination. It may be that the data is so sensitive and critical that every write process would have to check the last updated timestamp of the target record to see whether to allow the write. Subsequent collision handling logic would add complexity to the system, though and in my client’s case was voted not worth worrying about.

This architecture is of course a distributed design, so some protection has to be put in place to prevent failures cascading through to other parts of the system. All calls across application boundaries should be made via circuit breakers. This is a fantastic pattern that prevents callers from flooding a service with more requests when it’s obviously already having problems. It also forces the developer to consider what action to take when their call fails with a CircuitBreakerOpenException. When these exceptions occur, events can be logged, monitoring systems (such as Zabbix) can be called, processing can be temporarily suspended, messages written to a dead letter queue, or any combination of the above and more – the precise strategy for different calls depends on the balance between need for resilience and expense of delivery. An excellent implementation of a circuit breaker is Helpful.CircuitBreaker which is very light weight and easy to use. It’s also available in Nuget.

From experience with Salesforce, the one thing that is guaranteed is a breaking change coming from a source you have no control over. This architecture helps you deal with this in two ways. Firstly, the logging of every payload allows you to see what’s changed straight away. Secondly, because this is hosted middleware in AWS it’s a cinch to fix and redeploy. This is one of the widely celebrated features of a microservice philosophy.

Business Logic

As much as possible, each ‘piece’ of business logic should sit on either one side of an integration or the other – preferably on the side where it was triggered. In reality there are often knock on effects from changes on either side that need to be cascaded across that application boundary and it can become difficult to decide exactly if and how the logic should be split. Whatever the split is, a solution for triggering the remote logic is for entities to fall into a state where they are ‘pending’ some action that needs to be carried out on the opposite side of the integration. A flag for this is added to the payload to trigger the logic. The question is: should the consumption of the pending flag occur in the target system or in the queue processor?

One benefit of leveraging the queue processor is that no concept of the integration is leaked to the target system. The queue processor can make sure that the correct processes are triggered in the target system before placing a message on the queue in the opposite direction to update the originating system from a pending status.

When hitting this problem for the first time, splitting this business logic out from the processor into another service (again deployed to an EC2 instance) would maintain good separation of concerns. This is also the implementation I would suggest.

Wrapping Up

With the benefit of hindsight, it seems obvious that the integration strategy we first picked would never work well. There were obvious failures in a lot of places where we didn’t identify the more finer points for how integrations with Salesforce should work, and maybe there was a little too much blind trust placed in ‘the expert 3rd party’.

That having been said, the result of these mistakes is an architecture that could easily be applied to any other integration. I’m sure some would view it as over-engineering but I think that’s only valid if you know both systems intimately and are happy that every breaking change is something you’ll be doing yourself. Even then, this approach maintains a good separation of concerns and allows you to decouple your domain concepts.