Scale Summit 2017 – It’s not just microservices that need to scale

This post was written by Chris O’Dell co-author of Team Guide to Software Releasability.

Friday saw another great Scale Summit event which Skelton Thatcher Consulting were proud to sponsor. Scale Summit is an annual unconference which brings together operations and software development communities with a particular interest in scalable, high performance systems. There’s no schedule created in advance and no speakers – attendees make our own conference through the Open Space Technology format. Frank discussions with your peers leads to advanced topics and in-depth conversation you may normally find only in the hallways of most events. There were so many good sessions to choose from.

Two attendees stand in the front of the crowdsourced schedule appearing to discuss the topics — Photo – Jon Topper for Scale Summit

The attendees suggest the topics for discussion and this year’s topics included:

Current usage of containers
DNS for Service Discovery
Scaling deployments
Is there something better than Jenkins or TeamCity?
CloudFormation, Terraform, aaargh!
Scaling / Maintaining culture as teams grow
Cloud Regret?
Monorepos
Greenfield in 2017
On-call: “you built it, you run it” vs “your free time is yours”

Keynote: Sarah Wells – Microservices and Scale

Sarah J Wells stands in front of a lectern with her slides projected on the wall behind her — Photo – Jon Topper for Scale Summit

To set the tone of the day, Sarah opened with a keynote about Microservices and scale. Contrary to expectations, the talk did not focus on the app architecture and how it scales to handle traffic. Instead, the focus was on the scaling required within the tooling and teams to support the disparate platform. They gradually automated and shortened an 18 step manual release process allowing them to increase deployments from once every few months to multiple times a day.

If you scale up your existing monitoring, alerting strategies with your microservices, you’ll soon find yourselves inundated. What was a reasonable amount of noise for a handful of services soon becomes a deluge when you have a couple of hundred. The data needs to be aggregated and approached in terms of business impact. Merely notifying that a data source is offline also include what impact this has to the end users, for example “Comments cannot be displayed on posts”. This helps you to prioritise.

Include correlation Ids in requests to allow for tracing through a single user journey of the system. Add http healthchecks that can be dashboarded and include detailed information in the response body to allow further diagnosis.

These extra capabilities take time and engineering to build and support. A build engineering team should grow in relation to the number of development teams to ensure they can continue to move quickly.

DNS for Service Discovery

The topic was raised by a person looking to get a handle on the 10k+ disparate services within their entire business. Their hope is that using DNS for service discovery will allow for the sprawl to be mapped. Other attendees with smaller estates are using configuration management tooling such as Apache ZooKeeper for this purpose.

The question was raised about whether DNS is even the right tool for service discovery. DNS is built to be heavily cached and as such many systems don’t handle invalidation well (as an example, the JVM needs restarting to flush). It was never designed to handle systems with a high state of flux. DNS works better when the estate is more static. The use of loadbalancers was suggested to allow for fast handling of server failures.

One team were using AWS tags for discovery. Although, this was only used as a bootstrap with the servers being behind a loadbalancer.

Consul was brought up as a solution for service discovery, but only one attendee had used it in anger. They stated that more than half of their outages are caused by the Consul cluster losing quorum. It appears the Raft protocol that Consul uses is particularly sensitive to non-symmetric network latency.

Scaling Deployments

As the number of deployments increases, how do the tooling and support systems cope with this? How are other companies handling multiple releases a day from a release tooling standpoint?

One attendee mentioned that their company currently releases nearly a hundred times a day and that they wish to increase this by a factor of 10. Every commit is automatically built and pushed straight to production where a battery of monitoring picks up any issues. Interestingly, they don’t (yet) mandate automated testing as part of the pipeline and instead rely on a quick “revert and redeploy” feature.

For most attendees, high amounts of automated testing gave them the confidence to perform a deploy. Alongside this, most were happy for changes to go straight to production when supported by a gradual rollout plan, e.g. canary deployments, blue/green and phased deployments. These techniques build confidence in the changes due to real customer usage whilst also allowing for quick recovery if issues are discovered (pilot instances can be removed from rotation). The downsides of this approach are in managing multiple versions. All changes must be backwards compatible and able to work alongside each other, this includes database migrations. The blog post “Database migrations done right” was mentioned. Extra care must be taken with caching when data varies between versions and static resources will need to served from another location to avoid requests for newly added images being routed to a server which doesn’t contain it.

Two key phrases from this session were “Rollback aren’t a thing – stop it!” and “Customers shouldn’t notice a software deployment”. The first is in the context of database migrations and the second refers to using feature toggles to separate feature release from application deployment.

View through a window at a group of people sitting in a circle in a discussion — Photo – Jon Topper for Scale Summit

Is there something better than Jenkins or TeamCity?

Jenkins was described as an explosion of builds, configurations and snowflake servers, which often required a lot of time to maintain. Some tactics included using Jenkins Pipelines and keeping the config in source control with the applications. Avoiding too many plugins was another recommendation.

Listening to the discussions, it seemed people were using Jenkins (and TeamCity) to perform a variety of tasks: CI, automated testing, package creation, deployment and as a general task runner. The latter seemed the most prevalent and painful. One person mentioned that they used a Jenkins build to “allow Marketing to refresh the website cache whenever they wanted”. For this use case Rundeck was suggested as was HTCondor.

BuildKite, Zuul, Bazel and Concourse were all mentioned for CI and Pipeline Management. Each with varying reports of support and success. GoCD only had a couple of users and was described as “OK”.

CloudFormation, Terraform, aaargh!

This well attended session echoes a similar one from the previous year where people were moving away from CloudFormation to Terraform in the hopes of better wrangling their infrastructure whilst also gaining cloud agnosticism. Many people had made the leap and this year they were airing their frustrations with Terraform. There’s no perfect tool and trade offs must be made.

Tips for Terraform included: avoid as much state as possible, break your stacks up to reduce blast radius, and make use of tfvars to template your files. No one was making use of community modules and instead were using them only as reference for building their own. The feeling was that this may change over time as the community matures, much like with Chef Cookbooks. Also, Charity Majors’ blog is highly recommended reading.

View through window at three people in a discussion, one has their hand up as they talk — Photo – Jon Topper for Scale Summit

On-call: “you built it, you run it” vs “your free time is yours”

The participants were happy with the “builders” being on call – many were themselves the “builders”. There was a sense of pride in being the people to respond to issues in their own code. It meant that they could feed any issues back into the development and improve the product.

The discussions revolved around how to run on-call to prevent burnout or hero behaviour where one person picks up all the alerts. Some teams are running an approach where all the developers are on-call all the time and investigate issues as “best efforts”. In one case a list of mobile numbers is taped to the monitor of a customer service assistant! This leads to background levels of stress and a feeling that you can’t “switch off” or go on holiday.

A Tuesday to Tuesday weekly rota was recommended with the flexibility of allowing swaps if people needed cover for any last minute events. This does require a large enough team to allow sufficient relaxation time between rotas but not so big that the techniques and tools are forgotten and must be re-learnt. About 6-8 people seemed to be consensus.

Tools such as PagerDuty and OpsGenie were recommended for managing the rotas. Unfortunately, these tools are charged per person and so encourages smaller on-call teams. One attendee called this a “Tax on success”. Tactics, such as sharing a phone, were suggested but not recommended.

Summary

The day was a good event, one which I enjoyed much better this year. I came away from each discussion with many new things to research and a rethink on my view of parts of the world. In general, people are comfortable with tackling the scale of production systems and have turned their attentions to scaling the supporting systems and teams around them. I also enjoyed the vegan sausage rolls for breakfast. I look forward to next year.