Integration Testing Behaviour with Mountebank

Developer’s machine > dev shared environment > staging environment > UAT > production.

Probably not exactly how everyone structures their delivery pipelines but probably not that far off. It allows instant feedback on whether what a developer is writing actually works with the code other developers are writing. And that’s a really good thing. Unfortunately, it misses something…

Each environment (other than the developer’s own machine) is shared with other developers who are also deploying new code at the same time. So how do you get an integration test for component A that relies on component B behaving in a custom manner (maybe even failing) to run automatically, without impacting the people who are trying to build and deploy component B?

If we were writing a unit test we would simply inject a mocked dependency. Fortunately there’s now a fantastic piece of kit available for doing exactly this but on an integration scale: enter Mountebank.

This clever piece of kit will intercept a network call for ANY protocol and respond in the way you ask it to. Transparent to your component and as easy to use as most mocking frameworks. I won’t go into detail about how to configure ‘Imposters’ as their own documentation is excellent, but suffice to say it can be easily configured in a TestFixtureSetup or similar.

So where does this fit into our pipeline? Personally, I think the flow should be:

Push code to repo > Code is pulled onto a build server > Build > Unit test > Integration test > Start deployment pipeline

The step where Mountebank comes in is obviously ‘integration testing’.

Keep in mind that installing the component and running it on the build agent is probably not a great idea, so make good use of the cloud or docker or both to spin up a temporary instance which has Mountebank already installed and running. Push your component to it, and run your integration tests. Once your tests have run then the instance can be blown away (or if constantly destroying environments gets a bit slow, maybe have them refreshing every night so they don’t get cluttered). Docker will definitely help keeping these processes efficient.

This principle of spinning up an isolated test instance can work in all kinds of situations, not just where Mountebank would be used. Calls to SQL Server can be redirected to a .mdf file for data dependent testing. Or DynamoDb tables can be generated specifically scoped to the running test.

What we end up with is the ability to test more behaviours than we can do in a shared environment where other people are trying to run their tests at the same time. Without this, our integration tests can get restricted to only very basic ‘check they talk to each other’ style tests which although have value do not cover everything we’d like.

PaaS

My last two clients have had completely contrasting views on PaaS, specifically on whether it should be used at all. Both clients deploy to AWS and Azure. Both want to embrace software volatility. Neither want to introduce unnecessary complexity. Both have a similarly scaled online offering where traffic is subject to peaks and troughs which aren’t always predictable.

With such similar goals and problems to solve I’m intrigued by how different their approaches have been. Admittedly one client has a much more mature relationship with the cloud where the other is jumping in with both feet but still not sure how to swim. Perhaps that’s the crux of the matter and both will eventually become more similar in their approaches.

For this article I want to focus on the perceived issues with PaaS and try to explain why I think many concerns are unfounded.

The Concerns

My current client has raised a number of concerns about PaaS and I’ve dug around on the internet to find what has been worrying other people. Here’s a list of the most popular concerns I’ve seen.

  • Vendor lock in – the fear that if software makes use of PaaS from one cloud provider, it will be too difficult to move to a different provider in future.
  • Compliance – the fear of audit.
  • B.A.U. – the fear of managing a PaaS based solution after the developers have left the building.
  • Lack of published SLAs – the fear that a platform may not be as reliable as you need.
  • Confusing marketing message – the fear of relying on something that isn’t defined the same way by two different providers anywhere.
  • Lack of standard approach – the fear of ending up with software tightly coupled to a single platform.

This is certainly not an exhaustive list but I think it covers all the popular problems and those concerns raised directly to me from my clients. So now let’s try to address things.

Vendor Lock In

This sounds very scary. The idea that once we start allowing our software to make use of the APIs and services provided by one cloud provider, we’ll be unable to move to a different provider next year.

First of all, let’s talk about what’s driving this footloose requirement. At some level in the business, someone or some people are unsure that the currently chosen cloud provider is going to remain so. They may even want to review how suitable they are on an annual basis and have reserved the right to change their minds at that point. This isn’t unusual and it could be the right thing to do – any company that blindly continues to use the same vendors and service providers without questioning if they still offer the right solution is destined to find themselves hindered by a provider who can no longer meet the business needs. So for example, let’s assume that there is a distinct possibility that although AWS is the flavour of the month, this time next year might see a shift to Microsoft Azure.

At the point of that shift to Azure, what is the expectation for existing systems? There has been a year of development effort pushing software into AWS, does the business think that it can be re-deployed into Azure ‘as is’? I would expect that there would be a plan for a period of transition. I would also expect that it would be recognised that there are some systems for which it isn’t worth spending the money to move. New development will undoubtedly happen in Azure with as little effort as possible focused on AWS. The business doesn’t expect a ‘big bang’ change (which would be incredibly high risk).

Now let’s think about how well your software currently running in AWS will run in Azure. Both AWS and Azure offer hosting with the same Operating Systems, so we’re off to a good start – you should at least be able to manually deploy and get something running. The catch is in the way that the virtual environments are delivered. If your app relies on local HD storage, then moving from AWS to Azure may mean quite a hit. At the time of writing this article, the best throughput you can get from Azure Premium storage is 200MB/s whereas AWS’ EBS Provisioned Volumes will give you a throughput of 320MB/s. So moving to Azure could impact your application’s performance under load, especially if it relies on a self managed instance of a database (Mongo DB for example). In fact, if you want high performance storage in Azure then Table Storage or DocumentDB are probably the best options – both of which are PaaS.

This is only one example of how moving cloud provider could impact your software, there are others. The virtual machine options are different – not just in hard disc size but in available memory, processor speeds and in how their performance changes with load. So what you’re deploying quite happily onto half a dozen instances with one cloud provider may require nine or ten instances on another, plus a few tweaks to how the software stores its data.

What I’m trying to highlight here isn’t that using PaaS won’t be a barrier to moving from one cloud provider to another, rather that it isn’t the one you would have to worry about. Changing the API that is used for caching data is a well defined problem with easily understood steps to implement. Understanding the impact of the subtle differences in how each cloud provider delivers your virtual environments – that’s hard.

That’s not the end of this issue. Lets look at this from the software side. How often do developers use 3rd party software? From my own experience, I don’t think I remember the last time I spent a day writing code which didn’t involve several NuGet Install-Package statements. Every time I’m always careful to prevent tight coupling between my code and the installed packages. Why wouldn’t I expect the same care to be taken when working with PaaS? It’s really good practice to write a client for your PaaS interaction that abstracts the detail of implementation away from the logical flow of your software. This is good programming 101. When moving to another cloud provider the impact of changing the API is predominantly limited to the client. By far not the biggest problem you’ll have to deal with.

Compliance

Depending on what your business does, you may have restrictions on where you can store your data. Conversely, storing your data in some territories may incur restrictions on how that data must be encrypted. Some territories may just not allow certain types of data to be stored at all; or you may need to be certified in some way and prove correct storage policies by external audit.

These rules don’t change if you store your data in a traditional data-center. You still have to be aware of where your data is going and what that means. It isn’t just a cloud provider that might make use of geolocation for resilience purposes. So your problem exists either way.

Cloud providers are aware of this issue and are very clear on where their data is stored and what control you have over this. This is specifically for compliance reasons.

B.A.U.

Once a system is in place and running, the developers are rarely interested in maintaining it from day to day. That job usually falls to a combination of Operations and Dev Ops. The concern with PaaS is that it will in some way be harder for a non-development team to manage than if something well known and self managed is used. I think this falls into the category of ‘fear of the unknown’ – the question I would ask is “will a service that is managed for you be harder to look after than something that you have to fully manage yourself?” Even if you have a dedicated team with a lot of expertise in managing a particular technology, they will still have to do the work to manage it. PaaS usually is configured and then left. With nothing else to do than respond to any alerts which might suggest a need to provision more resources. It’s made resilient and highly available by clicking those buttons during configuration or setting those values in an automation script.

Perhaps there is a concern that in future, it will be harder to find development resource to make changes. This is a baseless fear. No-one debates this problem when referencing 3rd party libraries via NuGet – there really isn’t any difference. Sure there may be some more subtle behaviours of a PaaS service which a developer may not be aware of but any problems should be caught by testing. Often the documentation for PaaS services is pretty good and quite to the point; I’d expect any developer working with a PaaS service to spend as much time in their documentation as they would for any 3rd party library they used.

Take a look at the AWS docs for DynamoDB – the behaviour of the database when spikes take reads or writes beyond what has been provisioned is a pretty big gotcha, but it’s described really well and is pretty obvious just from a quick read through.

There is definitely going to be a need to get some monitoring in place but that is true for the whole system anyway. When establishing the monitoring and alerts, there will have to be some decisions made around what changes are worthy of monitoring and what warrant alerts. Thinking of the utilised PaaS as just something else pushing monitoring events is a pretty good way to make sure the right people will know well in advance if any problems are going to be encountered.

Lack of Published SLAs

This can cause some worries and it’s something that annoys me a lot about AWS in particular. I don’t see any reason why an SLA wouldn’t be published – people want to know what they’re buying and that’s an important part of it. But let’s get our sensible heads on – we’re talking pretty damned decent up times even if it isn’t always 99.999%.

In my opinion, worrying about the SLA for a PaaS service provided by people such as Amazon, Microsoft or Google doesn’t always make much sense. These guys have massive resources behind them – you’re far more likely to mess it up than they are. But let’s think about how failures in any service should be handled. There should always be a failure state which defaults to something which at least isn’t broken, otherwise your SLA is tied to a multiple of the SLAs of every 3rd party. Your system has to be resilient to outages of services you rely on. Also, let’s remember where you system is hosted – in the same data centre as the PaaS service is running in. If there is an outage of the PaaS service, it could be also impacting your own system. Leveraging the flexibility of geolocation and availability zones allows you to get around those kinds of outages. I’m not saying you’re guaranteed constant availability but how often have you seen amazon.co.uk go down?

Given the nature of cloud hosting coupled with a resilient approach to calling 3rd party services, a lack of published SLA isn’t as terrifying as it seems. Code for outages and do some research about what problems have occurred in the past for any given service.

Confusing Marketing Message

This is an interesting one. What is PaaS? Where does infrastructure end and platform begin? That might be pretty easy to answer in a world of traditional data-centers, but in the cloud things are a bit more fluffy. Take Autoscaling Groups, for example, or more specifically the ability to automatically scale the number of instances of your application horizontally across new instances based on some measure. I’ve heard this described as IaaS, PaaS and once as ‘IaaS plus’.

The line between IaaS and PaaS is being continuously blurred by cloud providers who I don’t think are particularly worried about the strict categorisation of the services they provide. With services themselves consisting of several elements, some of which might or might not fall neatly into either PaaS or IaaS, the result is neither.

I think this categorisation is causing an amount of analysis paralysis among some people who feel the need for services to be pigeon holed in some way. Perhaps being able to add a service into a nice, pre-defined category makes it somehow less arduous to decide whether it’s something that could be useful. “Oh, IaaS – yeah, we like that! Use it everywhere.” Such categorisations give comfort with an ivory tower, fully top down approach but don’t change the fundamental usefulness of any given service.

This feels a little 1990’s to me. Architecture is moving on and people are becoming more comfortable with the idea of transferring responsibility for the problematic bits to our cloud provider’s solution. We don’t have to do everything for ourselves to have confidence that it’s as good as it could be – in fact that idea is being turned on its head.

I love the phrase “do the hard things often”, well no-one does any of this as often as the people who provide your cloud infrastructure. Way more often than you do and they’re far better at it, which is fine – your company isn’t a cloud provider, it’s good at something else.

So should we worry that a service might or might not be neatly described as either PaaS or IaaS? I think it would be far more sensible to ask the question “is it useful?” or even “how much risk is being removed from our architecture by using it?” and that isn’t going anywhere near the cost savings involved.

Lack of Standard Aproach

In my mind, this could be a problem as it does seem to push toward vendor lock in. But, let’s consider the differing standards across cloud providers – where are they the same? The different mechanisms for providing hard disks for VM’s results in Amazon being half as fast again as Azure’s best offering. What about the available VM types? I’m not sure there is much correlation. What about auto-scaling mechanisms? Now they are definitely completely different. Code deployment services? Definitely not the same.

I suppose what I’m trying to get at is that each cloud provider has come up with their own service which does things in its own specific way. Not surprising really. We don’t complain when an Android device doesn’t have a Windows style Start button, why would we expect two huge feats of engineering which are cloud services to obey the same rules? They were created by different people with different ideas and to initially solve different problems.

So there is a lack of standards, but this doesn’t just impact PaaS. If this is a good reason to fear PaaS then it must be a good reason to fear the cloud altogether. I think we’ve found the 1990’s again.

Round Up

I’m not in any way trying to say that PaaS is some kind of silver bullet, or that it is inherently always less risky than a self managed solution. What I am trying to make clear is that much of the fear around PaaS is from a lack of understanding. The further away an individual is from dealing with the different implementations (writing the code), the harder it is to see the truth of the detail. We’ve had decades of indoctrination telling us that physical architecture forms a massive barrier to change but the cloud and associated technologies (such as Dev Ops) removes that barrier. We don’t have less points of contact with external systems, we actually have more, but each of those points is far more easily changed than was once true.

Some Useful Links

http://www.forbes.com/sites/mikekavis/2014/09/15/top-8-reasons-why-enterprises-are-passing-on-paas/

http://devops.com/2014/05/01/devops-paas-give-platform-lets-rock-lets-rock-today/

https://azure.microsoft.com/en-gb/documentation/articles/storage-scalability-targets/

 

Building a Resilient Bidirectional Integration with Salesforce

blockquote {font-size: 12px;}

18 months ago I started building an integration between my client’s existing systems and Salesforce. Up until that point I had no exposure to Salesforce so my client also brought in a consultancy for whom it was a speciality. Between us we came up with a strategy where we would expose a collection of REST services for code within Salesforce to interface with while calls in the opposite direction would use the standard Salesforce REST API. In a room where 50% of us had never worked with Salesforce before, this seemed like a reasonable approach but it turns out we were all being a bit naive.

Some of the Pitfalls

Outbound Messaging

Salesforce has a predetermined method of outgoing sync calls which is pretty inflexible. On every save of any given entity, a SOAP message can be sent to a specified http endpoint with a representation of the changed entity. We did originally try using this but hit on a few problems pretty quickly. One big problem was that after we managed to get it working, we came in the next morning to find it broken. After a lot of debugging we found that the message had changed format very slightly, which our Salesforce consultants explained could happen at any time as Salesforce release updates. As my client had a release cycle of once very two weeks, we all agreed the risk of the integration breaking for that length of time was unacceptable, so we decided that on each save, Salesforce would just send us an entity type and id, then we would use the API to retrieve the new data.

Race Conditions

This pattern worked well until we hit production servers where we suddenly found that at certain times of day, the request to the Salesforce API would result in a dirty read. Right away the problem looked like a race condition and when we looked further into how Salesforce saves records, we realised how it could happen. Here’s a list of steps that Salesforce takes to save a record (taken from the Salesforce online documentation):

1. Loads the original record from the database or initializes the record for an upsert statement.

2. Loads the new record field values from the request and overwrites the old values.

If the request came from a standard UI edit page, Salesforce runs system validation to check the record for:

Compliance with layout-specific rules

Required values at the layout level and field-definition level

Valid field formats

Maximum field length

Salesforce doesn’t perform system validation in this step when the request comes from other sources, such as an Apexapplication or a SOAP API call.

Salesforce runs user-defined validation rules if multiline items were created, such as quote line items and opportunity line items.

3. Executes all before triggers.

4. Runs most system validation steps again, such as verifying that all required fields have a non-null value, and runs any user-defined validation rules. The only system validation that Salesforce doesn’t run a second time (when the request comes from a standard UI edit page) is the enforcement of layout-specific rules.

5. Executes duplicate rules. If the duplicate rule identifies the record as a duplicate and uses the block action, the record is not saved and no further steps, such as after triggers and workflow rules, are taken.

6. Saves the record to the database, but doesn’t commit yet.

7. Executes all after triggers.

8. Executes assignment rules.

9. Executes auto-response rules.

10. Executes workflow rules.

11. If there are workflow field updates, updates the record again.

12. If workflow field updates introduced new duplicate field values, executes duplicate rules again.

13. If the record was updated with workflow field updates, fires before update triggers and after update triggers one more time (and only one more time), in addition to standard validations. Custom validation rules are not run again.

14. Executes processes.

If there are workflow flow triggers, executes the flows.

Flow trigger workflow actions, formerly available in a pilot program, have been superseded by the Process Builder. Organizations that are using flow trigger workflow actions may continue to create and edit them, but flow trigger workflow actions aren’t available for new organizations. For information on enabling the Process Builder in your organization, contact Salesforce.

15. Executes escalation rules.

16. Executes entitlement rules.

17. If the record contains a roll-up summary field or is part of a cross-object workflow, performs calculations and updates the roll-up summary field in the parent record. Parent record goes through save procedure.

18. If the parent record is updated, and a grandparent record contains a roll-up summary field or is part of a cross-object workflow, performs calculations and updates the roll-up summary field in the grandparent record. Grandparent record goes through save procedure.

19. Executes Criteria Based Sharing evaluation.

20. Commits all DML operations to the database.

21. Executes post-commit logic, such as sending email.

Our entity id was being sent from an ‘after trigger’ which was getting run at step 7, data isn’t committed to the database until step 20. Discovering this led us to the path of sending the entire record in the trigger, getting round the need to wait for a committed save. Even this isn’t ideal though, as a save could be rolled back after the trigger is executed, leaving our systems out of sync. The general consensus was that this is a reasonably small risk with limited impact to the business.

Unexpected Changes from Superusers

For the business, one of the big selling points of Salesforce is that it empowers users, allowing them to create workflows, install plugins, add validations, change fields, and so on. To the business this sounds fantastic – none of all the waiting around for technical teams to come up with a solution. The drawback is that every time a change goes in that the technical team aren’t aware of, it has the potential to break everything. It took a few attempts before we managed to reign everyone into cooperating with the technical team and getting them to try their changes in our development and QA orgs before deploying to production. Until then, things would just suddenly stop working. Exceptions would start getting thrown and data would fail to synchronise.

Quick to Diagnose Problems

I think one of the nastiest restrictions we had was being tied to the two-week release cycle. A release cycle that would often break when some piece of code written by one of the other two dozen developers in the company would do something unexpected and require us to roll back the release. The next release may be delayed to 3 or 4 weeks as a result. When the integration develops a problem in production that isn’t seen anywhere else, we have to get some tracing in place, or tweak the logging levels of existing tracing to get enough detail. This is something you want to do that day, not 3 weeks down the line. In an environment where breaking changes can come from the platform itself, it’s really important to be able to get in and see what’s going on right away.

The Key Requirements of the Correct Solution

Ok, so we can probably agree that we didn’t get our solution right. The idea was conceived without really understanding how Salesforce worked and this bit us over and over again as we reacted to architectural problems with pretty large changes in direction. If I could go back and sit in on that first meeting where we conceived our monster, I would interject with the following requirements:

  1. The solution must not be tied to the two weekly deployment cycle of the main project.
  2. It should be easy and quick to change.
  3. All data passed in both directions should be logged for debugging purposes and to allow replay in the case of major outage.
  4. The solution shouldn’t use Salesforce triggers.
  5. The solution should include a space for integration specific business logic that is aware of both Salesforce and the main system (removing all leakage of concepts in either direction).
  6. It should provide its own health analysis to allow monitoring.
  7. Health issues and major errors should trigger notifications
  8. It should be scalable independently of either Salesforce or the existing systems.

The Solution

Overview

My revised solution is to build a piece of middleware architected as microservices working with Amazon’s Simple Queuing Service (SQS) and a Relational Database Service (RDS) instance. Figure 1 is a conceptual diagram giving an overall view of what I mean. I’ve left out logging and notifications for brevity.

FIGURE 1

Figure 1

The Flow

The flow of data is pretty much symmetrical in processing order, so starting from either end with a payload of data to be synchronised:

  1. The payload is dropped into an SQS queue in AWS.
  2. A queue processor picks up the message within a few seconds.
  3. The full payload is logged to the Sync DB’s history (which may have an automatic expiration configured)
  4. The processor checks in the Sync DB for an existing mapping for the entity represented by the payload.
  5. If a mapping is found, then an update payload is sent to the target system.
  6. If a mapping is not found, then a create payload is sent to the target system.
  7. Whether updating or creating, the payload is also recorded in the Sync DB’s history.
  8. A response is received back from the target system, the result of which is recorded into the Sync DB’s history along with updates to the mapping record.

Scalability

Scaling of SQS can be achieved by horizontal scaling and batching. Both strategies can be used in conjunction. Batching may be difficult to achieve from the Salesforce side as I would recommend sticking to their standard outbound messaging system which means a further service may be needed to transpose these payloads into the queue. Horizontal scaling should be completely transparent to all systems allowing throughput of several thousands of messages per second, if taken to its limit.

The queue processors would be deployed to EC2 instances and each would have its own auto-scaling group. An auto scaling policy would be needed for each to scale based on CloudWatch alarms triggered by queue size. Even though the number of consumers for each queue would increase, Amazon hide messages that are ‘mid processing’ so other consumers don’t pick up a message that’s already being handled (although in our scenario, if that did happen, it wouldn’t be likely to cause any problems).

The Sync DB would require some tuning and only running this architecture would really give an idea of what size of instance to use (or indeed whether multiple instances were required). The choice of RDS over dynamoDB is specifically for scalability reasons – dynamoDb is fantastic for light weight requirements but it doesn’t handle bursts of traffic well at all and needs to be carefully configured to avoid read or write failures when under stress.

Resilience

In this scenario, resilience is an interesting topic as if during an outage, we store up payloads and re-run them, we may well be overwriting data that has been added during the outage at the destination. It may be that the data is so sensitive and critical that every write process would have to check the last updated timestamp of the target record to see whether to allow the write. Subsequent collision handling logic would add complexity to the system, though and in my client’s case was voted not worth worrying about.

This architecture is of course a distributed design, so some protection has to be put in place to prevent failures cascading through to other parts of the system. All calls across application boundaries should be made via circuit breakers. This is a fantastic pattern that prevents callers from flooding a service with more requests when it’s obviously already having problems. It also forces the developer to consider what action to take when their call fails with a CircuitBreakerOpenException. When these exceptions occur, events can be logged, monitoring systems (such as Zabbix) can be called, processing can be temporarily suspended, messages written to a dead letter queue, or any combination of the above and more – the precise strategy for different calls depends on the balance between need for resilience and expense of delivery. An excellent implementation of a circuit breaker is Helpful.CircuitBreaker which is very light weight and easy to use. It’s also available in Nuget.

From experience with Salesforce, the one thing that is guaranteed is a breaking change coming from a source you have no control over. This architecture helps you deal with this in two ways. Firstly, the logging of every payload allows you to see what’s changed straight away. Secondly, because this is hosted middleware in AWS it’s a cinch to fix and redeploy. This is one of the widely celebrated features of a microservice philosophy.

Business Logic

As much as possible, each ‘piece’ of business logic should sit on either one side of an integration or the other – preferably on the side where it was triggered. In reality there are often knock on effects from changes on either side that need to be cascaded across that application boundary and it can become difficult to decide exactly if and how the logic should be split. Whatever the split is, a solution for triggering the remote logic is for entities to fall into a state where they are ‘pending’ some action that needs to be carried out on the opposite side of the integration. A flag for this is added to the payload to trigger the logic. The question is: should the consumption of the pending flag occur in the target system or in the queue processor?

One benefit of leveraging the queue processor is that no concept of the integration is leaked to the target system. The queue processor can make sure that the correct processes are triggered in the target system before placing a message on the queue in the opposite direction to update the originating system from a pending status.

When hitting this problem for the first time, splitting this business logic out from the processor into another service (again deployed to an EC2 instance) would maintain good separation of concerns. This is also the implementation I would suggest.

Wrapping Up

With the benefit of hindsight, it seems obvious that the integration strategy we first picked would never work well. There were obvious failures in a lot of places where we didn’t identify the more finer points for how integrations with Salesforce should work, and maybe there was a little too much blind trust placed in ‘the expert 3rd party’.

That having been said, the result of these mistakes is an architecture that could easily be applied to any other integration. I’m sure some would view it as over-engineering but I think that’s only valid if you know both systems intimately and are happy that every breaking change is something you’ll be doing yourself. Even then, this approach maintains a good separation of concerns and allows you to decouple your domain concepts.

From Azure to Amazon Webservices

I’ve been building software with .NET since it first appeared and I’ve always been a fan. With the recent surge of cloud offerings I got right behind Microsoft and launched myself into the world of Azure without really considering the options too much. After all, my MSDN license gives me a stack of free usage. It wasn’t until my two most recent contracts, both of which used Amazon Web Services (AWS), that I really started to question whether Azure was enabling me as well as other options might do.

Just Write Code

Probably the most appealing thing about Azure is the ability to just write code and have it hosted for you without having to get all involved with the whole VM management side of things. It lowers the barrier for entry and allows you to get work deployed and available incredibly quickly. That Azure is software, platform and infrastructure as a service is one of its greatest strengths in my opinion – allowing you to grow your architecture only when you need to.

In contrast, AWS wants you to deploy a platform for your work before you can deploy even a “Hello World” service. Not only do you need an EC2 instance but you’ll also need to define some inbound and outbound rules and an IAM role so you don’t need to keep passing credentials around. Not straight forward for the uninitiated.

All One Platform

Another bonus with Azure is that you can do everything you need to with the standard set of tools you already have at your disposal as a .NET engineer. If you make use of Visual Studio Online (http://www.visualstudio.com) then you’ll find that you can fully automate your builds, your deployments into Azure, and your automated testing. Continuous integration and continuous deployment without moving outside of the one tech stack means you’re less likely to find breaking changes made by different companies who have different ideas. Plus, once you’ve bought your MSDN license, it’s all there available for you.

AWS takes a different approach. There’s an excellent API which will do everything you need. In fact Amazon are quite vocal about their tendency to ‘dog food’ their own work. Their console for manually setting things up is pretty intuitive but that’s really the only two things you get. If you want to automate deployment in AWS, you have to get your hands dirty writing code, scripts or using a 3rd party. Chef, Chocolatey, Powershell, Ruby, Team City, Go and countless other 3rd party tools and technologies are all going to become familiar to you. Many of these are open source (hippy cred to them) but also some originating from a Linux background and bringing the complexities that you’d expect.

A Decent Read

So given the great things Microsoft are doing with Azure, why am I finding myself drawn more and more AWS?

The answer might be unexpected – help. Amazon have one of the best, most intuitive online help documents I’ve seen. It’s brilliant. No matter what you want to do in AWS, you’ll find tutorials, explanations or examples. Sure, Microsoft has the MSDN library, but it’s not the easiest place to find what you’re after and suffers from a too rigid format. There are forums and blogs where Azure questions are answered in detail but the platform seems to have made so many huge leaps forward in such a short time that it’s difficult to find the relevant information for the work you’re doing today. The AWS library, on the other hand is mostly up to date. There are a few topics that are a couple of updates behind but they’re still helpful – the platform has become stable enough so that the documentation remains fresh.

To draw a direct comparison, after 3 years of deploying work into Azure, I’m still not completely happy with the architecture of my solutions. I’m making more decisions based on my chosen cloud than I would like (although designing for the cloud is expected) and compromising where I have to use workarounds to get the functionality I need. When I deploy to AWS I’m happy that I’m less likely to find something that just isn’t possible (even though there are lots of blogs with instructions on how to do it as of 18 months ago which are no longer correct).

Use Cases

I predominantly build software. More recently more of what I’ve been building has been microservice based, an architecture that I feel lends itself better to AWS than Azure. But there are some fantastic things about Azure. The big data features are excellent – I would never try to reproduce some of the facilities already available in Azure in another environment. What would be the point? Also, like I said earlier, with Azure the simple things are very simple to do. You can build things quickly and get things moving with far less planning. But if you’re building an enterprise level platform yourself, then you’ve probably already done all the planning and designing you need to. You’re aware of where you’re most likely heading and have the skills to set up a few routing rules.

The next piece of work I’m looking at will be to move a system I’ve been building in Azure for the last couple of years into AWS. I reckon I’ll need about a day.