What’s Slowing Your Business?

There are lots of problems that prevent businesses from responding to market trends as quickly as they’d like. Many are not IT related, some are. I’d like to discuss a few problems that I see over and over again, and maybe present some useful solutions. As you read this, please remember that there are always exceptions. But deciding that you have one of these exceptional circumstances is always easier when starting from a sensible basic idea.

Business focused targeting.

For many kinds of work, quicker is better. For software development, quicker is better. But working faster isn’t the same thing as delivering faster.

I remember working as technical lead for a price comparison site in the UK, where once a week each department would read out a list of the things they achieved in the last week and how that had benefited the business. For many parts of the business there was a nice and easy line that could be drawn from what they did each week and a statistic of growth (even if some seemed quite contrived). But the development team was still quite inexperienced, and struggling to do CI never mind CD. For the less experienced devs, being told to “produce things quicker” had the opposite effect. Traditional stick and carrot doesn’t have the same impact on software development as on other functions, because a lot of the time what speeds up delivery seems counter intuitive.

  • Have two people working on each task (pair programming)
  • Focus on only one feature at a time
  • Write as much (or more) test code as functional code
  • Spend time discussing terminology and agreeing a ubiquitous language
  • Decouple from other systems
  • Build automated delivery pipelines

These are just a few examples of things which can be pushed out because someone wants the dev team to work faster. But in reality, having these things present is what enables a dev team to work faster.

Development teams feel a lot of pressure to deliver, because they know how good they can be. They know how quickly software can be written, but it takes mature development practices to deliver quickly and maintain quality. Without the required automation, delivering quick will almost always mean a reduction in quality and more time taken fixing bugs. Then there are the bugs created while fixing other bugs, and so on. Never mind the huge architectural spirals because not enough thought went into things at the start. In the world of software, slow and steady may lose the first round, but it sets the rest of the race up for a sure win.

Tightly coupling systems.

I can’t count how often I’ve heard someone say “We made a tactical decision to tightly couple with <insert some system>, because it will save us money in the long run.”

No.

Just no.

Please stop thinking this.

Is it impossible for highly coupled systems to be beneficial? No. Is yours one of these cases? Probably not.

There are so many hidden expenses incurred due to tightly coupled designs that it almost never makes any sense. The target system is quite often the one thing everything ends up being coupled with, because it’s probably the least flexible ‘off the shelf’ dinosaur which was sold to the business without any technical review. There are probably not many choices for how to work with it. Well the bottom line is: find a way, or get rid. Ending up with dozens of applications all tightly bound to one central monster app. Changes become a nightmare of breaking everyone else’s code. Deployments take entire weekends. License fees for the dinosaur go through the roof. Vendor lock in turns into shackles and chains. Reality breaks down. Time reverses, and mullets become cool.

Maybe I exaggerated with the mullets.

Once you start down this path, you will gradually lose whatever technical individuals you have who really ‘get’ software delivery. The people who could make a real difference to your business will gradually go somewhere their skills can make a difference. New features will not only cost you more to implement but they’ll come with added risk to other systems.

If you are building two services which have highly related functionality, ie. they’re in the same sub-domain (from a DDD perspective), then you might decide that they should be aware of each other on a conceptual level, and have some logic which spans both services and depends on both being ‘up’, and which get versioned together. This might be acceptable and might not lead to war or famine, but I’m making no promises.

It’s too hard to implement Dev Ops.

No, it isn’t.

Yes, you need at least someone who understands how to do it, but moving to a Dev Ops approach doesn’t mean implementing it across the board right away. That would be an obscene way forwards. Start with the next thing you need to build. Make it deployable, make it testable with integration tests written by the developer. Work out how to transform the configuration for different environments. Get it into production. Look at how you did it, decide what you can do better. Do it better with the next thing. Update the first thing. Learn why people use each different type of technology, and whether it’s relevant for you.

Also, it’s never too early to do Dev Ops. If you are building one ‘thing’ then it will be easier to work with if you are doing Dev Ops. If you have the full stack defined in a CI/CD pipeline and you can get all your changes tested in pre-production environments (even infra changes) then you’re winning from the start. Changes become easy.

If you have a development team who don’t want to do Dev Ops then you have a bigger problem. It’s likely that they aren’t the people who are going to make your business succeed.

Ops do routing, DBA’s do databases.

Your developers should be building the entire stack. They should be building the deployment pipeline for the entire stack. During deployment, the pipeline should configure DNS, update routing tables, configure firewalls, apply WAF rules, deploy EC2 instances, install the built application, run database migration scripts, and run tests end to end to make sure the whole lot is done correctly. Anything other than this is just throwing a problem over the fence to someone else.

The joke of the matter is that the people doing the developer’s ‘dirty work’ think this is how they support the business. When in reality, this is how they allow developers to build software that can never work in a deployed state. This is why software breaks when it gets moved to a different environment.

Ops, DBA’s, and other technology specialists should be responsible for defining the overall patterns which get implemented, and the standards which must be met. The actual work should be done by the developer. If for no other reason than the fact that when the developer needs a SQL script writing, there will never be a DBA available. The same goes for any out-of-team dependencies – they’re never available. This is one of the biggest blockers to progress in software development: waiting for other people to do their bit. It’s another form of tight coupling, building inter-dependent teams. It’s a people anti-pattern.

If you developers need help to get their heads around routing principals or database indexing, then get them in a room with your experts. Don’t get those people to do the dirty work for everyone else, that won’t scale.

BAU handle defects.

A defect found by a customer should go straight back to the team which built the software. If that team is no longer there, then whichever team was ‘given’ responsibility for that piece of software gets to fix the bug.

Development teams will go a long way to give themselves an easy life. That includes adding enough error handling, logging, and resilient design practices to make bug fixing a cinch, but only if they’re the ones who have to deal with the bugs.

Fundamental design flaws won’t get fixed unless they’re blocking the development team.

Everything else.

This isn’t an exhaustive list. Even now there are more and more things springing to mind, but if I tried to shout every one out then I’d have a book, not a blog post. The really unfortunate truth is that 90% of the time I see incredibly intelligent people at the development level being ignored by the business, by architects, even by each other, because even though a person hears someone saying ‘this is a good/bad idea’ being able to see past their own preconceptions to understand that point of view is often incredibly difficult. Technologists all too often lack the soft skills required to make themselves heard and understood. It’s up to those who have made a career from their ‘soft skills’ to recognise that and pay extra attention. A drowning person won’t usually thrash about and make a noise.

Don’t Stream JSON Data (Part 2)

I’ve discussed the merits of JSON streaming in two prior posts: Large JSON Responses and Don’t Stream JSON Data, if  you haven’t read these yet then take a quick look first, they’re not long reads.

I’m attracted to the highly scalable proposition of scaling out the consumer, so many requests can be made individually rather than returning a huge result from the service. It places the complexity with the consumer, rather than with the service which really shouldn’t be bothered about how it’s being used. At the service side, scaling happens with infrastructure which is a pattern we should be embracing. Making multiple simultaneous requests from the consumer is reasonably straight forward in most languages.

But let’s say our service isn’t deployed somewhere which is easily scalable and that simultaneous requests at a high enough rate to finish in a reasonable time would impact the performance of the service for other consumers. What then?

In this situation, we need to make our service go as fast as possible. One way to do this would be to pull all the data in one huge SQL query and build our response objects from that. It would definitely run as quick as we can go but there are some issues with this:

  1. Complexity in embedded SQL strings is hard to manage.
  2. From a service developer’s point of view, SQL is hard to test.
  3. We’re using completely new logic to generate our objects which will need to be tested. In our example scenario (in Large JSON Responses) we already have tested, proven logic for building our objects but it builds one at a time.

Complexity and testability are pretty big issues, but I’m more interested in issue 3: ignoring and duplicating existing logic. API’s in front of legacy databases are often littered with crazy unknowable logic tweaks; “if property A is between 3 and 10 then override property B with some constant, otherwise set property C to the value queried from some other table but just include the last 20 characters” – I’m sure you’ve seen the like, and getting this right the first time around was probably pretty tough going, so do you really want to go through that again?

We could use almost the same code as for our chunked response, but parallelise the querying of each record. Now our service method would look something like this:

public IEnumerable GetByBirthYear(int birthYear)
{
    IEnumerable customerIds = _customersRepository.GetIdsForBirthYear(birthYear);
    IList customerList = new List();
    Parallel.ForEach(customerIds, id =>
    {
        Customer customer;
        try
        {
            customer = Get(id);
        }
        catch (Exception e)
        {
            customer = new Customer
            {
                Id = id,
                CannotRetrieveException = e
            };
        }
        customerList.Add(customer);
    });
    return customerList;
}

public Customer Get(int customerId)
{
    ...
}

Firstly, the loop through the customer id’s is no longer done in a foreach loop, we’ve added a call to Parallel.ForEach. This method of parallelisation is particularly clever in that it gradually increases the degree of parallelism to a level determined by available resources – it’s one of the easiest ways to achieve parallel processing. Secondly, we’re now populating a full list of customers and returning the whole result in one go. This is because it’s simply not possible to yield inside the parallel lambda expression. It also means that responding with a chunked response is pretty redundant and probably adds a bit of extra complexity unnecessarily.

This strategy will only work if all code called by the Get() method is thread safe. Something to be very careful with is the database connection, SqlConnection is not thread safe.

Don’t keep your SqlConnection objects hanging around, new up a new object for every time you want to query the database unless you need to continue the current transaction. No matter how many SqlConnection objects you create, the number of connections are limited by the server and by what’s configured in the connection string. A new connection will be requested from the pool but will only be retrieved when one is available.

So now we have an n+1 scenario where we’re querying the database possibly thousands of times to build our response. Even though we may be making these queries on several threads and the processing time might be acceptable, given all the complexity is now in our service we can take advantage of the direct relationship with the database to make this even quicker.

Let’s say our Get() method needs to make 4 separate SQL queries to build a Customer record, each taking one integer value as an ID. It might look something like this:

public Customer Get(int customerId)
{
    var customer = _customerRepository.Get(customerId);
    customer.OrderHistory = _orderRepository.GetManyByCustomerId(customerId);
    customer.Address = _addressRepository.Get(customer.AddressId);
    customer.BankDetails = _bankDetailsRepository.Get(customer.BankDetailsId);
}

To stop each of these .Get() methods hitting the database we can cache the data up front, one SQL query per repository class. This preserves our logic but presents a problem – assuming we are using Microsoft SQL Server, then there is a practical limit to the number of items we can add into an ‘IN’ clause, so we can’t just stick thousands of customer ID’s in there (https://docs.microsoft.com/en-us/sql/t-sql/language-elements/in-transact-sql). If we can select by multiple ID’s, then we can turn our n+1 scenario into just 5 queries.

It turns out that we can specify thousands of ID’s in an ‘IN’ clause with a sub-query. So our problem shifts to how to create a temporary table with all our customer ID’s in it to use in our sub-query. Unless you’re using a very old version of SQL Server, then you can have multiple rows in a basic ‘INSERT’ statement. For example:

INSERT INTO #TempCustomerIDs (ID)
VALUES
(1)
(2)
(3)
(4)
(5)
(6)

Which will result in 6 rows in the table with the values 1 through 6 in the ID column. However we will once again hit a limit – it’s only possible to insert 1000 rows in this way with each insert statement.

Fortunately, we’re working one level above raw SQL, and we can work our way around this limitation. An example is in the code below.

public void LoadCache(IEnumerable customerIds)
{
    string insertCustomerIdsQuery = string.Empty;
    foreach (IEnumerable customerIdList in customerIds.ToPagedList(500))
    {
        insertCustomerIdsQuery +=
            $" INSERT INTO #TempCustomerIds (CustomerId) VALUES ('{string.Join("'),('", customerIdList)}');";
    }
    string queryByCustomerId =
        $@"IF OBJECT_ID('tempdb..#TempCustomerIds') IS NOT NULL DROP TABLE #TempCustomerIds;
CREATE TABLE #TempCustomerIds (CustomerId int);

{insertCustomerIdsQuery}

{CustomerQuery.SelectBase} WHERE c.CustomerId IN (SELECT CustomerId FROM #TempCustomerIds);

IF OBJECT_ID('tempdb..#TempCustomerIds') IS NOT NULL DROP TABLE #TempCustomerIds;";
    var customers = _repo.FindAll(queryByCustomerId);
    foreach (var customer in customers)
    {
        Cache.Add(customer.CustomerId, customer);
    }
}

A few things from the code snippet above:

  • ToPagedList() is an extension method that returns a list of lists of the number of items passed in. So .ToPagedList(500) will break down a list into multiple lists, each with 500 items. The idea is to use a number which is less than the 1000 row limit for inserts. You could achieve the same thing in different ways.
  • The string insertCustomerIdsQuery is the result of concatenating all the insert statements together.
  • CustomerQuery.SelectBase is the select statement that would have had the ‘select by id’ predicate, with that predicate removed.
  • The main SQL statement first checks whether the temp table exists, and then creates it if it doesn’t. We then insert all the ID’s into that table. Then we select all matching records where the ID’s are in the temp table, and finally delete the temp table.
  • Cache is a simple dictionary of customers by ID.

Using this method, each repository can have the data we expect to be present loaded into it before the request is made. It’s far more efficient to load these thousands of records in one go rather than making thousands of individual queries.

In our example, we are retrieving addresses and bank details by the ID’s retrieved on the Customer objects. To support this, we need to retrieve the bank detail ID’s and address ID’s from the cache of Customers before loading those caches. Then all subsequent logic will run, but pretty blindingly fast seeing as it’s only accessing memory and not having to make calls to the database.

Summing Up

The strategy for the fastest response, is probably to hit the database with one big query, but there are down sides to doing this. Specifically we don’t want to have lots of logic in a SQL query, and we’d like to re-use the code we’ve already written and tested for building individual records.

Loading all the ID’s from the database and iterating through the existing code one record at a time would work fine for small result sets where performance isn’t an issue, but if we’re expecting thousands of records and we want it to run in a few minutes then it’s not enough.

Caching the data using a few SQL queries is far more efficient and means we can re-use any logic easily. Even most of the SQL is refactored out of the existing queries.

Running things asynchronously will speed things up even more. If you’re careful with your use of database connections, then the largest improvement can probably be found by running the queries in parallel, as these will probably be your longest running processes.