You build it, You run it (Why developers should also be on call)

Summary

Published by
Chris O'Dell (@ChrisAnnODell)
PUBLISHED: October 18, 2017 8:17 am
ESTIMATED READING TIME: 6 MINUTES

It’s all about Feedback Loops

The Continuous Delivery Pipeline is familiar to most developers.  It’s a collaborative process built upon loops of feedback at every stage.  A new feature will defined between a Product Manager, a Developer and a Quality Analyst.  A pair of Developers will work together on the implementation following a Test Driven Development approach using the tests as feedback to guide progress.  The new code is compiled with the master branch and all of the unit tests are run.  The application may be deployed to a test environment and a series of Acceptance Tests will be run.  All of this feedback gives the confidence to proceed to a Production deployment at which point Ops take over and the developers pick up a new change.

This is wrong. The Developers’ involvement stops too soon.  The change is thrown over the wall to another team and the feedback cycles stop.

Bring down the wall

The feedback from the deployment itself and from the running application are vital for building a reliable and fit for purpose application.  In all other parts of the process we rely on feedback loops, but there is a tendency to drop this once the code is deployed.  It is only in Production, when the code is in the hands of the users, that any code matters.  It is this feedback which is the most important of all and yet, we as Developers, are not receiving it.

There could be organisational reasons for this separation – traditionally Developers write code and Operations maintain running services.  We are all aware that DevOps is a cultural shift to bring these two areas closer together and part of this includes Developers being involved in the feedback from Production. Collating and aggregating log files is a way of getting some of this feedback to the Developers.  Log data gives an insight into errors that improves debugging and highlights weak areas in the application.

Adding a metrics collection service provides another insight into the running application.  Data points can be sent from the any part of the application and can be written around business requirements, for example the number of credit card payments vs debit card.  They are simple to add and can be used to answer questions such as “how many requests to we receive to a particular API endpoint, and can we deprecate it?”. Metrics values can be rolled up into percentiles and outliers smoothed away allowing feedback to focus on the majority use cases, not just alerts.

All of this telemetry feedback can be received in near real-time and goes a long way to improving the mental models we’ve built up of the production system.  But, it is still a step removed from fully owning the application.

Harder, Better, Faster, Stronger

Owning and being directly responsible for the successful execution of something you build gives a different perspective as to what is important, what needs focus and what does not.  Issues that seem trivial in development become far more important to you personally when it’s your phone which rings.  Random memory leaks that require app reboots are easy to ignore when they occur during the day, but not so much when they happen during the night.  You’ve always known that this affects the users – that they have a terrible experience when they see the error page – but it was always difficult to track down and service so easily restored.  But now you feel the pain too and prioritise the operational fixes.

Developers On Call is an evolution of Metrics Driven Development

When it comes to application errors it is the Developers who generally have the greatest context for fixing things.  As applications move to the cloud and on-premise infrastructure matures, over time application errors become more frequent than infrastructure ones.  As microservices become more common and messaging becomes the glue, it will take the Developers’ system knowledge to isolate the cause and get things running again as quickly as possible.

It is this level of ownership which brings the purpose in the driving forces of Autonomy, Mastery and Purpose. Combined with the desire to master the knowledge needed to make the application successful and the autonomy to prioritise operational fixes.  I believe it is this ownership which enables us to have pride in the systems we build.

I have more to say…

I have a full talk on this topic.  In it I also cover some tips for being on call, for running an on call rota and for avoiding burnout.  I will be giving the full talk at NDC London in January 2018.  You can have a sneak peak at the slides here:

Comments