This post was written by Jovile Bartkeviciute.
After the success last year, the Operability.io returned to London on Monday 19th September. The two-day conference was single line, same as last year, and the speakers were hand-picked and carefully chosen by the organisers. Day 1 started with an opening speech by Marco Abis, with Mark Burgess delivering the first talk.
Buy the Team Guide to Software Operability book – operabilitybook.com
Here are some quotes and key ideas from each of the talks:
HOW DO YOU KNOW YOUR SYSTEM IS WORKING WELL? – Mark Burgess
- How do you know your system is working well? Generally, the answer is more complicated than the usual “The system is working well when the users are happy”. First, we need to define every single word in this sentence – what does “we”, “know”, “system”, “working” and “well” mean?
- Modularity is a red herring – we are taught that modularity is a good thing, but in practice it often runs flat. (Tidiness is not godliness)
- Basics of promise theory:
- The agent can only make a promise about it’s own behaviors.
- Obliging others is an ineffective strategy.
- An agent can only assess other’s promises from its own perspective.
- Any reliance (dependency) on another agent invalidates a promise.
- Clearly defined desired state and desired outcome allows the system to converge.
- Clear definitions are essential – need to have a clear set of promises.
- Local time trumps space, because space is non-local time!
- “Systems are made of people as well as machines”.
- When designing a system, we usually base it on what a human can do, but to scale it, we need to take the human out of the picture and design for the machines.
- Repairing systems quickly is the best thing you can do to know if your system is working well.
- A promise is not a guarantee – there are no guarantees.
- The Netflix approach keeps systems human-sized so that they always have clear human intent.
To conclude:
- Know – a relationship between the system curator and its agents
- System – a collection of agents collaborating by promises
- Working – promises made and kept
- Well – what stats about promise keeping?
If we cannot formalize these, we cannot answer the question.
Slides – http://www.slideshare.net/MarkBurgess32/september16
SERVICES: BEYOND MICROSERVICES – Kaimar Karu
- Popular view of operations team – “No” people.
- If the developers can do everything themselves, why do we need operations?
- Why operations? -good definition by Charity Majors
- Scalability
- Resiliency
- Availability
- Maintainability
- Simplicity in complex systems
- Instrumentation and visibility
- Graceful degradation
- Why operations? -good definition by Charity Majors
- Operations is still relevant – NoOps is not the best idea. While your system might be working fine without operations, when you will want to scale – you will most likely fail.
- “Blameless is hard to achieve, but we can start by being more blame-aware.”
- When you try to implement something new, you need to make sure that the changes match the vision of the entire organisation.
- Scientific method for experimenting with operations:
“WHY WOULD ANYONE DO OUT-OF-HOURS SUPPORT FOR FREE?” – Sarah Wells
Sarah Wells talked about why DevOps, what they did and the hard stuff.
- To do Devops properly you need to be ready to change the culture, which is the most difficult part.
- Main ideas of how they implemented DevOps:
- Automation and Infrastructure as code
- Collaboration – tech ops joined other teams.
- Created “Ops cops” – and dedicated Kanban line
- You build it – you run it – if you aren’t doing this, you aren’t doing DevOps
- Operations spend their entire career stopping developers from breaking stuff. Development teams needed to prove that they care to the operations. Trust has to be built.
- Tools and standards are better than platforms – they let you pick the things for you, platforms are too generic.
- Don’t treat every decision like it’s irreversible.
Slides – https://speakerdeck.com/sarahjwells/why-would-anyone-do-out-of-hours-support-for-free
ACHIEVING CLOUD-NATIVE OPERABILITY – Casey West
- Old and busted? – Not cloud native. New and good? – Cloud native.
- Definition:
- Continuous delivery, DevOps, and microservices descrive the why, how, and what of being cloud-native. In the most advanced expression of these concepts they are intertwined to the point of being inseparable.”
- “A microservice is an application small enough that an engineer new to the source code can reason about it in a day or less.” – Kenny Bastani
- “There is nothing more expensive than going to the wrong direction.”
- Cloud-Native Operability is:
- Microservices architecture
- Continuous delivery process
- Devops culture
- Platform automation
- Microservices are for people, not machines.
- It does not matter how beautiful your architecture is, how easy deployment is, or how great your culture is if production is tire fire.
- DevOps culture – you can’t buy this: collaboration, automation, learning, measuring, sharing.
- It is important that the people who build the system take responsibility for the operations.
Slides – https://speakerdeck.com/caseywest/operabilityio-2016-achieving-cloud-native-operability
TAMING YOUR CONTAINERS – Liz Rice
Liz Rice started her talk with a poem illustrating the power of naming things (and pets) clear and run a live demo on making a container image.
- Naming is important, it makes promises.
- Tags are good, but in your mind, read ‘latest’ as ‘default’, ‘latest’ – is not necessarily most recent.
- Using Docker tags to mess with people’s minds – https://medium.com/microscaling-systems/using-docker-tags-to-mess-with-peoples-minds-367bb2c93bd0#.u7ti5y6lq
- Labels – they are great, because they are immutable – people are more likely to be clear when using them.
THE ROAD TO GROWNOPS – Daniel Otte & Tom Shacham
Daniel Otte and Tom Shacham delivered the most creative talk of the day – they have prepared a double act retrospective on the topics of operability of the team, operability of the devops and operability of the community.
- To improve the culture of an organisation – ‘kill it with kindness’ always works!
- If, as a user, you see any bad things, errors, etc., – contribute, give criticism – this is how systems improve.
- GrownOps:
- Working together over delivering fast
- Grow up and get to know each other
- Contribute instead of criticise
- You are always in a state of partial knowledge
- Ask for information don’t state judgement
- Systems are built on trust
Slides – http://www.slideshare.net/DanielOtte3/the-road-to-grownops
LESSONS FROM DATABASE FAILURE – Colin Charles
Colin Charles talked about backups (and verification), replication (and failover), security (and enryption).
- Learning from database failures – “I think what we mostly learn from database outages is what NEVER TO DO AGAIN”
- Why replicate:
- scale out,
- (automatic) [master] failover
- geographical redundancy across multiple data centres
- online schema changes.
- In conclusion, to avoid system failure:
- Use semi-sync replication with failover solution that ensures you don’t failover too often.
- Make good backups. Test them. Save them.
- You’ll most definitely need to shard your data, use proven frameworks and get a proxy involved. Complete backups with multi-source replication when needed.
- Use mysqldump and xtrabackup together (and mydumper for parallel backup/restore; mysqlpump)
- Security is key: prevent SQL injections, encrypt your data at rest.
And the last talk of the day:
CENTRALISING THE RIGHT THINGS – Tom Booth
- Build a central team to empower and support others.
- Deploying continuously reduces risk and is better for users.
- A team should be in control of its own architecture and infrastructure.
- You need a team that enables great Ops not owns it directly.
- Give teams room to experiment, to find what works for them.
- Make the tools so easy to understand and so good that teams WANT to use them.
- Work together and not separately – human aspect cannot be underrated.
Overall, containers are becoming the norm and many of the speakers mentioned that Ops is still relevant and NoOps is not the best idea, human interaction and improving company culture is key.
>> Day 2 <<
Here is what other people thought:
Now onto Day 2!