This post was written by Jovile Bartkeviciute.
WHAT EVEN IS OPERABLE? By Andrew Clay Shafer (@littleidea)
Andrew Shafer was the first speaker of the day and started the conference with explaining what Operability is and why IT Operations is/can be a huge competitive advantage.
- Context and purpose – need to know why you are doing what you are doing and different situations require different solutions.
- Need to know where the problem is – error pile up and accomplishments disappear.
- Why some software succeeds is just a case of fashion and tribalism.
- ‘Continuous partial failure’ is okay. Broken gets fixed – shitty lives forever.
- Principles > Practices > Tools.
- The problem is not technical or the people, the problem is socio-technical.
- Operability is the intersection of capability and usability.
- “Back Pressure” – never use an unbounded queue.
To advance your career learn to speak in public and learn to write.
Operations is the “secret sauce”!
INOPERABILITY.IO by Colin Humphreys (@hatofmonkeys)
Colin Humphreys was explaining how bad can it get when you do not take operability into consideration while sharing real life experiences.
- Launch issues are mainly cause by lack of budget and lack of team communication.
- “Surprisingly”, the budget for operations after the launch has to be more than $0.
- Heroism != Success
- PHP is more difficult than it looks
It was quite interesting to hear about the funny side of PHP: http://stackoverflow.com/questions/2382490/how-does-true-false-work-in-php
The talk was absolutely hilarious, but cannot be summarised – you HAD to be there.
The only “Silver Bullet” is a conversation.
PROCESSES: HOW SMALL TEAMS ACCOMPLISH BIG THINGS by Anthony Eden (@aeden)
Anthony Eden talked us through how a small team can accomplish big things. Communication and clearly defined and documented processes are essential.
- Triage first. Need to differentiate what is essential and the order of actions to be taken.
- Instead of being 100% reliable try to recover as soon as possible. The main principles of that are on-call rotation and clearly defined process to handle a crisis.
- One of the main security problem – social engineering attacks. A very simple thing like passwprd rotation and two-step authorisation can resolve it!
- Customer service should be taken care of by the entire organisation – it is a great way to know where customer pain points are.
- Polices need to be born from experience and be written down and shown to everyone
Now, how to get started with it?
- Create a wiki page
- Evaluate history
- Write recurring event
- Write down the steps
- Execute – have a game day!
Automate where possible
DISTRIBUTED: OF SYSTEMS AND TEAMS by Bridget Kromhout (@bridgetkromhout)
According to Bridget Kromhout, distributed systems are complex and distributed teams are even more so. She shared her experiences on how to make sure your distributed team is an advantage.
- People are more important than tools, but tool are also essential.
- Make sure to clearly state your expectations.
- Asking for help is important in building a team – it gives a gift of trust.
- Distribute decision making.
- OVERCOMUNICATE – nothing is worse than a misunderstanding due to the lack of communication. Tell your team hi, say what you are doing, use “lol” or emoticons to express your feelings/mood.
- Co-working spaces do not necessarily work.
- Creating reality by writing words (i.e. coding) is as close to wizardry as it gets.
- Distributed teams are a competitive advantage:
- Timezone wise
- Different backgrounds
- No matter where you live, there is more talent somewhere else
If using a chat, make sure the conversation is searchable – in 6 months from now you will need an answer even more than now.
IN GOD WE TRUST, EVERYONE ELSE CAN BRING DATA by Colin Hemmings (@thegonzohunter)
Colin Hemmings emphasized the importance of having usable data.
- DASHBOARDS. Seriously – dashboards for everything: globally, for teams, for NOC, etc etc. It is an easy way to leverage data.
- You need to focus on the right things:
- No point of working on the wrong things
- Deliver value
- Stability beats features.
- Need to remove opinions and use data – you cannot know what a customer wants, you can make a good guess, but that is not enough.
Startups – “Yes, the lunatics are running the asylum”.
PRAGMATIC ALERT CORRELATION IN MODERN PRODUCTION ENVIRONMENTS by Elik Eizenberg (@elik_eizenberg)
Elik Elzenberg over-viewed the most common practices of alert notifications and showed us a pragmatical solution to correlate the incidents and alerts.
- Outages evolve – after the first incident is recorded many more incidents happen as a causation.
- Many alert fall through the crack. Detection happens very late.
- The days with most alerts do not necessarily have the most incidents. Customer is not necessarily is affected by them.
- Humans should interact with actual issues and leave the alerts for an automated system.
- Service Hierarchy does not work – you always have some alerts and since one alert makes the whole section red, you tend to ignore it until it goes really bad and then you are too late.
- Stateful Alert Correlation is a solution to this.
- How do you know that alerts belong to the same incident?
- Topology – has the same tag
- Time – has the timestamp in the same 15 min window.
- Modeling – you can use statistical analysis to differentiate.
If an alert looks like it could belong to a few separate incidents, just assign it to one as per predefined rules – it is okay to make a mistake in this instance.
INFRASTRUCTURE AS CODE: AUTOMATING FOR AGILITY by Kief Morris (@kief)
Kief took us through the principles behind infrastructure automation, including the organizational and team angles.
In a drive to automation you need to be explicit, because expectations might be different than what you are able to achieve.
- Avoid automating for wrong reasons
- Operations- automations expect one button push
- It’s not about the tools
- Chinese & Japanese manufacturers have not been focusing on spending less money on staff:
- They focused on time
- Market changes – can adapt easily
- Did not have big inventory
- Automation won’t let you hide from authority
- You need technical knowledge to get good results
- Software is eating the world
- Iron Age of IT – … presents a problem
- Mistakes are expensive
- Cloud Era – change represents learning
- Automation – to make things fast, cheap, safe
- Discipline and building good habits
- Anti-fragile systems. We can make the systems strong. The secret ingredient – the people.
- Self-service vs. Empowerment
- Don’t just assume what your users need
- Make platforms of simple pieces.
- Different Teams are trying to provide everything which is not a good thing.
- Letting the customers customise the platform.
Do one thing and do it very well.
Automating for the wrong reasons will make you sad.
TIME AND RELATIVE DIMENSIONS IN SYSTEMS by Anne Currie (
After the previous talk taking us to the Iron Age, it was only fitting that Annie Currie looked at the different direction – the future.
- We no longer have the same constraints with hardware as in the past, but our assumptions about software have not changed to match
- We’re moving away from manual optimization specifications
- We can now prioritise speed over efficiency
- Containers – invented decadesago to improve efficiency of data centres.
- Strengths of containers:
- Containers – make Devs happy and makes Ops happy – gives additional resources and reduces the bills.
The future is ‘data centre as operating system’, containers, schedulers
Takeaways from the day
- Everybody loves Docker, but it will not solve all your problems.
- Communication is key.
- Slack is a thing now.
- Mickey Mouse the Sorcerer is still popular.
Really looking forward to Day 2!
2 thoughts on “Operability.io Conference 2015 – Day 1”
I said this in person, but I’ll say it again here: thank you so much for posting this summary! I find these so useful after conferences are over – I remember like 3-5 key moments and the rest is summarised here.
Thank you, it is great to hear that! 🙂