This post was written by Jovile Bartkeviciute.
BUILDING WORLD CLASS OPS TEAMS by Charity Majors (@mipsytipsy)
Charity Majors started the second day with her energetic and passionate talk about bootstrapping the ops team, highlighting the main problems and solutions.
- First, you need to know and differentiate if you actually need an ops team. It does not necessarily has to be a new person in the company. And if you get an ops person it does not necessarily mean that developers are off the hook.
- Operations engineering is a very specialized skill set. It is not someone to do all the stuff that you do not want to.
- You need an ops team if you have hard operational problems: extreme reliability, extreme scalability, extreme security requirements, solving operational problems for the whole Internet. If you need an ops team- congratulations! You are doing something right.
- Regarding the hiring process, you want a unicorn – everybody does! But you cannot have one, you need to settle and sacrifice some expectations. Team building is like Jenga – need to find the pieces that fit. People have potential, but it is always a risk.
- “Devops” means that all share responsibility – developers and operations.
- Big company is more of a safety net, a startup – it all ends with you. You need people with different skillset for them, one will thrive in a gig corporation, another will not.
- When hiring, do not hire for the lack of weaknesses.
- Screen for learned helplessness. Startups cannot afford this.
- How to spot a bad ops engineer? – tweaking indefinitely, adding complexity, won’t admit that does not know things, disconnected from customer experience.
- How to lose good ops people? – Give all responsibility and no authority, stick them with all the tedious work, blameful attitude, no interesting operational problems.
Treat your people well.
Having people who want to work for you is a superpower.
HUMAN OPS – SCALING TEAMS AND HANDLING INCIDENTS by David Mytton (@davidmytton)
David Mytton talked about the perfect procedure to handle the incidents with real life examples.
- Cost of uptime is a capital expenditure.
- 100% uptime is impossible. You have to expect and prepare for the problems.
- When dealing with incidents, there are three stages: Prepare, Respond, Postmortem.
- Engineers should be able to solve most of the issues. if they cannot – devops team is the backup.
- Having a good checklist is essential. It avoids complexity, stress and fatigue (lessens the human error), removes the “Ego” (tendency to “wing it”, winging it is not good enough).
- The key is to have backup options – when one system is down, you need to have clear protocol and everybody aware which other system to use to, for example, communicate.
- Key credentials: making sure everybody has access and their own login.
- Understand disaster recovery.
- War Games – simulate the situation and have everybody in the team to solve it.
- After getting an alert:
- Load the response checklist
- Log into ops War Room (no non-essential info in the chat room)
- Log into JIRA
- Begin investigation
- Understand why the issue happened and make sure it does not happen again.
Log absolutely everything.
Have a backup option for your backup option.
EMPHATICALLY EMPATHETIC by Emma Jane Hogbin Westby (@emmajanehw)
The talk by Emma Jane Hogbin Westby was aimed at making the best out of team collaboration by using empathy to connect with each other.
- Empathy is a skill that can be practiced.
- Empathy and Sympathy are not the same thing.
- Three levels of empathy:
- Level 1: collecting stories, listening and referring back
- Level 2: The biggest obstacle : thinking that your way is the only way – there are a couple ways to do one thing! Uncover motivators – understand why a person behaves one or the other way. Creative outcome based interactions- help me help you.
- Level 3: Foster creative problem solving. Seek to understand: complain about yourself from another person’s prospective, live the day trough another person’s eyes.
- All of us have empathy, but some of us may not have enough courage to discover it.
- To get people to empathize with each other, a good idea is to make them work against a common enemy, which, if you cannot find a real issue, can be you.
Convert from resources to human beings.
EFFECTIVE INCIDENT COMMUNICATION by Scott Klein (@scootklein)
Scott Klein has started his talk with a short breathing exercise and then continued onto how to convert a crisis to a positive outcome.
- During a critical event, your empathy shuts down, survival skills kick in.
- During a crisis, have a status page and keep it updated.
- What makes a good status page?
- Time stamps
- Mobile friendly
- Handling accountability
- Communicate early, even if you have no idea what the problem is.
- Do not do ETA’s
- Trolls and tickets:
- Do not mention names
- Be personal (Not “we”, but “I” am sorry)
- Give details, they give confidence.
- Close the loop.
- People share bad experiences more than good experiences.
- “Service recovery paradox” – customers are likely to be happier after a nicely handled incident than before it!
Status page should not be that thing the intern built last summer.
LEADING A TEAM WITH VALUES by Rich Archbold (@rich_archbold)
Next was Rich Archbold, who explained how to create a high performing team.
- Core values.
- Core values have to fit with the culture of the team, be personal, specific, relevant every day and not a dogma.
- Value 1: Security, Availability, Performance, Scalability, Cost – prioritize for maximum impact.
- Value 2: Faster, Safer, Easier Shipping.
- Value 3: Zero Touch Ops
- Value 5: Run Less Software
Guinness Ambulance is a real thing.
UN-BROKEN LOGGING – THE FOUNDATION OF OPERABLITY by Matthew Skelton (@matthewpskelton)
Our co-founder Matthew Skelton talked about the importance of logging which should no be sacrificed for other features.
- Logging is often forgotten.
- Log aggregation and search tools should be first class tools, right up there with source code management
- Logging is another system component and should be tested as such.
- Log everything, not just errors, to explore how software actually works in production when things are going well.
- You need to allocate the right resources to logging, including budget.
- How to make logging awesome?
- Continuous event IDS
- Transaction tracing
- Log aggregation & search tools
- Design for logging
- Decoupled severity
Logging makes things work.
TAKING THE OPERATING SYSTEM OUT OF OPERATIONS by Gareth Rushgrove (@garethr)
Gareth Rushgrove gave a fascinating talk about hypervisors and unikernels.
- Systems that have not been designed to be complex are becoming more complex.
- General Purpose OS is a dead concept. Reduce the complexity and only use what you need.
- We are running more software than we can understand.
- Applications start “fighting” with each other.
- Everyone wants to be the VMware of containers.
- Everyone, that is not running a hypervisor is an applications developer.
- Unikernel advantages: hypervisor/hardware isolation, smaller attack surface area, running less code, enforced immutability, no default remote access.
- Unikernels allow us to be specific about everything.
- Infrastructure is code.
- Collaborate on hard problems, rather than marveling about your Docker PaaS.
I’m limited here by the slide, rather than the insanity of our infrastructure!
SECURITY FOR NON-UNICORNS by Ben Hughes (@benjammingh)
Ben Hughes gave an inspiring talk about the biggest security vulnerabilities that are often overlooked.
- Small bugs turn into giant acorns that end up on the front page.
- Who can you trust? -Nobody.
- Bug Bounties – a good way to have your system examined for security issues.
- Curl & Bash – just don’t!
- Do not run things as root.
- Stop trusting files on Internet.
- Check your public info – you can find many interesting things on Google Search.
- Docker needs to be secured: don’t use –privileged, use –cap-drop and –cap-drop <thing> to get the minimum capabilities, use Docker Notary, use GRSecurity, use SELinux..
“It’s got legit in the name. It must be secure, right?”
Takeaways from the day
- Communication is still key.
- Checklists and logging makes incident handling run smoothly.
- AIX OS is old and you should probably stop using it.
- MongoDB is hard to use.
- Unicorns are a theme!
Fun fact of the day – you can order sleeping bees on the Internet and deliver them to a “friend”.
2 thoughts on “Operability.io Conference 2015 – Day 2”
Thanks for a great summary of the conference presentations 🙂
Thank you! 🙂