Operability.io Conference 2015 – Day 1

This post was written by Jovile Bartkeviciute.

WHAT EVEN IS OPERABLE? By Andrew Clay Shafer (@littleidea)

Andrew Shafer was the first speaker of the day and started the conference with explaining what Operability is and why IT Operations is/can be a huge competitive advantage.

Key concepts:

Context and purpose – need to know why you are doing what you are doing and different situations require different solutions.

Need to know where the problem is – error pile up and accomplishments disappear.

Why some software succeeds is just a case of fashion and tribalism.

‘Continuous partial failure’ is okay. Broken gets fixed – shitty lives forever.

Principles > Practices > Tools.

The problem is not technical or the people, the problem is socio-technical.

Operability is the intersection of capability and usability.

“Back Pressure” – never use an unbounded queue.

Quotations:

To advance your career learn to speak in public and learn to write.

Operations is the “secret sauce”!

INOPERABILITY.IO by Colin Humphreys (@hatofmonkeys)

Colin Humphreys was explaining how bad can it get when you do not take operability into consideration while sharing real life experiences.

Key Points:

Launch issues are mainly cause by lack of budget and lack of team communication.
“Surprisingly”, the budget for operations after the launch has to be more than $0.
Heroism != Success
PHP is more difficult than it looks

It was quite interesting to hear about the funny side of PHP: http://stackoverflow.com/questions/2382490/how-does-true-false-work-in-php

The talk was absolutely hilarious, but cannot be summarised – you HAD to be there.

Quotation:

The only “Silver Bullet” is a conversation.

PROCESSES: HOW SMALL TEAMS ACCOMPLISH BIG THINGS by Anthony Eden (@aeden)

Slides

Anthony Eden talked us through how a small team can accomplish big things. Communication and clearly defined and documented processes are essential.

Key Points:

Triage first. Need to differentiate what is essential and the order of actions to be taken.
Instead of being 100% reliable try to recover as soon as possible. The main principles of that are on-call rotation and clearly defined process to handle a crisis.
One of the main security problem – social engineering attacks. A very simple thing like passwprd rotation and two-step authorisation can resolve it!
Customer service should be taken care of by the entire organisation – it is a great way to know where customer pain points are.
Polices need to be born from experience and be written down and shown to everyone

Now, how to get started with it?

Create a wiki page
Evaluate history
Write recurring event
Write down the steps
Execute – have a game day!

Quotation:

Automate where possible

DISTRIBUTED: OF SYSTEMS AND TEAMS by Bridget Kromhout (@bridgetkromhout)

Slides

According to Bridget Kromhout, distributed systems are complex and distributed teams are even more so. She shared her experiences on how to make sure your distributed team is an advantage.

Key points:

People are more important than tools, but tool are also essential.
Make sure to clearly state your expectations.
Asking for help is important in building a team – it gives a gift of trust.
Distribute decision making.
OVERCOMUNICATE – nothing is worse than a misunderstanding due to the lack of communication. Tell your team hi, say what you are doing, use “lol” or emoticons to express your feelings/mood.
Co-working spaces do not necessarily work.
Creating reality by writing words (i.e. coding) is as close to wizardry as it gets.
Distributed teams are a competitive advantage:
- Timezone wise
- Different backgrounds
- No matter where you live, there is more talent somewhere else

Quotation:

If using a chat, make sure the conversation is searchable – in 6 months from now you will need an answer even more than now.

IN GOD WE TRUST, EVERYONE ELSE CAN BRING DATA by Colin Hemmings (@thegonzohunter)

Colin Hemmings emphasized the importance of having usable data.

Key points:

DASHBOARDS. Seriously – dashboards for everything: globally, for teams, for NOC, etc etc. It is an easy way to leverage data.
Statistically,
You need to focus on the right things:
No point of working on the wrong things
- Deliver value
- Stability beats features.
Need to remove opinions and use data – you cannot know what a customer wants, you can make a good guess, but that is not enough.

Quotation:

Startups – “Yes, the lunatics are running the asylum”.

PRAGMATIC ALERT CORRELATION IN MODERN PRODUCTION ENVIRONMENTS by Elik Eizenberg (@elik_eizenberg)

Elik Elzenberg over-viewed the most common practices of alert notifications and showed us a pragmatical solution to correlate the incidents and alerts.

Key points:

Outages evolve – after the first incident is recorded many more incidents happen as a causation.
Many alert fall through the crack. Detection happens very late.
The days with most alerts do not necessarily have the most incidents. Customer is not necessarily is affected by them.
Humans should interact with actual issues and leave the alerts for an automated system.
Service Hierarchy does not work – you always have some alerts and since one alert makes the whole section red, you tend to ignore it until it goes really bad and then you are too late.
Stateful Alert Correlation is a solution to this.
How do you know that alerts belong to the same incident?
- Topology – has the same tag
- Time – has the timestamp in the same 15 min window.
- Modeling – you can use statistical analysis to differentiate.
- Training

Advice:

If an alert looks like it could belong to a few separate incidents, just assign it to one as per predefined rules – it is okay to make a mistake in this instance.

INFRASTRUCTURE AS CODE: AUTOMATING FOR AGILITY by Kief Morris (@kief)

Kief took us through the principles behind infrastructure automation, including the organizational and team angles.

In a drive to automation you need to be explicit, because expectations might be different than what you are able to achieve.

Key points:

Avoid automating for wrong reasons
Operations- automations expect one button push
It’s not about the tools
Chinese & Japanese manufacturers have not been focusing on spending less money on staff:
- They focused on time
- Customizing
- Market changes – can adapt easily
- Did not have big inventory
Automation won’t let you hide from authority
You need technical knowledge to get good results
Software is eating the world
Iron Age of IT – … presents a problem
Mistakes are expensive
Cloud Era – change represents learning
Automation – to make things fast, cheap, safe
Discipline and building good habits
Anti-fragile systems. We can make the systems strong. The secret ingredient – the people.
Self-service vs. Empowerment
Don’t just assume what your users need
Make platforms of simple pieces.
Different Teams are trying to provide everything which is not a good thing.
Letting the customers customise the platform.

Advice:

Do one thing and do it very well.

Automating for the wrong reasons will make you sad.

TIME AND RELATIVE DIMENSIONS IN SYSTEMS by Anne Currie (@anne_e_currie)

Slides

After the previous talk taking us to the Iron Age, it was only fitting that Annie Currie looked at the different direction – the future.

Key points:

We no longer have the same constraints with hardware as in the past, but our assumptions about software have not changed to match
We’re moving away from manual optimization specifications
We can now prioritise speed over efficiency
Containers – invented decadesago to improve efficiency of data centres.
Strengths of containers:
- Speed,
- Lightweight
- Encapsulation
Containers – make Devs happy and makes Ops happy – gives additional resources and reduces the bills.

Quotation:

The future is ‘data centre as operating system’, containers, schedulers

Takeaways from the day

Everybody loves Docker, but it will not solve all your problems.
Communication is key.
Slack is a thing now.
Mickey Mouse the Sorcerer is still popular.

Really looking forward to Day 2!

Day 2 >>

2 thoughts on “Operability.io Conference 2015 – Day 1”

Jonathan Clarke says:

2015-09-29 at 14:32

I said this in person, but I’ll say it again here: thank you so much for posting this summary! I find these so useful after conferences are over – I remember like 3-5 key moments and the rest is summarised here.

LikeLike

1. Jovile B says:
  
  2015-10-06 at 09:18
  
  Thank you, it is great to hear that! 🙂
  
  LikeLike