This post was written by Jovile Bartkeviciute.
BUILDING WORLD CLASS OPS TEAMS by Charity Majors (@mipsytipsy)
Charity Majors started the second day with her energetic and passionate talk about bootstrapping the ops team, highlighting the main problems and solutions.
Key concepts:
- First, you need to know and differentiate if you actually need an ops team. It does not necessarily has to be a new person in the company. And if you get an ops person it does not necessarily mean that developers are off the hook.
- Operations engineering is a very specialized skill set. It is not someone to do all the stuff that you do not want to.
- You need an ops team if you have hard operational problems: extreme reliability, extreme scalability, extreme security requirements, solving operational problems for the whole Internet. If you need an ops team- congratulations! You are doing something right.
- Regarding the hiring process, you want a unicorn – everybody does! But you cannot have one, you need to settle and sacrifice some expectations. Team building is like Jenga – need to find the pieces that fit. People have potential, but it is always a risk.
- “Devops” means that all share responsibility – developers and operations.
- Big company is more of a safety net, a startup – it all ends with you. You need people with different skillset for them, one will thrive in a gig corporation, another will not.
- When hiring, do not hire for the lack of weaknesses.
- Screen for learned helplessness. Startups cannot afford this.
- How to spot a bad ops engineer? – tweaking indefinitely, adding complexity, won’t admit that does not know things, disconnected from customer experience.
- How to lose good ops people? – Give all responsibility and no authority, stick them with all the tedious work, blameful attitude, no interesting operational problems.
Quotations:
Treat your people well.
Having people who want to work for you is a superpower.
HUMAN OPS – SCALING TEAMS AND HANDLING INCIDENTS by David Mytton (@davidmytton)
David Mytton talked about the perfect procedure to handle the incidents with real life examples.
Key concepts:
- Cost of uptime is a capital expenditure.
- 100% uptime is impossible. You have to expect and prepare for the problems.
- When dealing with incidents, there are three stages: Prepare, Respond, Postmortem.
- Engineers should be able to solve most of the issues. if they cannot – devops team is the backup.
- Having a good checklist is essential. It avoids complexity, stress and fatigue (lessens the human error), removes the “Ego” (tendency to “wing it”, winging it is not good enough).
- The key is to have backup options – when one system is down, you need to have clear protocol and everybody aware which other system to use to, for example, communicate.
- Key credentials: making sure everybody has access and their own login.
- Understand disaster recovery.
- War Games – simulate the situation and have everybody in the team to solve it.
- After getting an alert:
- Load the response checklist
- Log into ops War Room (no non-essential info in the chat room)
- Log into JIRA
- Begin investigation
- Understand why the issue happened and make sure it does not happen again.
Quotations:
Log absolutely everything.
Have a backup option for your backup option.
EMPHATICALLY EMPATHETIC by Emma Jane Hogbin Westby (@emmajanehw)
The talk by Emma Jane Hogbin Westby was aimed at making the best out of team collaboration by using empathy to connect with each other.
Key concepts:
- Empathy is a skill that can be practiced.
- Empathy and Sympathy are not the same thing.
- Three levels of empathy:
- Level 1: collecting stories, listening and referring back
- Level 2: The biggest obstacle : thinking that your way is the only way – there are a couple ways to do one thing! Uncover motivators – understand why a person behaves one or the other way. Creative outcome based interactions- help me help you.
- Level 3: Foster creative problem solving. Seek to understand: complain about yourself from another person’s prospective, live the day trough another person’s eyes.
- All of us have empathy, but some of us may not have enough courage to discover it.
- To get people to empathize with each other, a good idea is to make them work against a common enemy, which, if you cannot find a real issue, can be you.
Quotations:
Convert from resources to human beings.
EFFECTIVE INCIDENT COMMUNICATION by Scott Klein (@scootklein)
Scott Klein has started his talk with a short breathing exercise and then continued onto how to convert a crisis to a positive outcome.
Key concepts:
- During a critical event, your empathy shuts down, survival skills kick in.
- During a crisis, have a status page and keep it updated.
- What makes a good status page?
- Time stamps
- Mobile friendly
- Handling accountability
- Communicate early, even if you have no idea what the problem is.
- Do not do ETA’s
- Trolls and tickets:
- Apologize
- Do not mention names
- Be personal (Not “we”, but “I” am sorry)
- Give details, they give confidence.
- Close the loop.
- People share bad experiences more than good experiences.
- “Service recovery paradox” – customers are likely to be happier after a nicely handled incident than before it!
Quotations:
Status page should not be that thing the intern built last summer.
LEADING A TEAM WITH VALUES by Rich Archbold (@rich_archbold)
Next was Rich Archbold, who explained how to create a high performing team.
Key concepts:
- Core values.
- Core values have to fit with the culture of the team, be personal, specific, relevant every day and not a dogma.
- Value 1: Security, Availability, Performance, Scalability, Cost – prioritize for maximum impact.
- Value 2: Faster, Safer, Easier Shipping.
- Value 3: Zero Touch Ops
- Value 5: Run Less Software
Quotations:
Guinness Ambulance is a real thing.
UN-BROKEN LOGGING – THE FOUNDATION OF OPERABLITY by Matthew Skelton (@matthewpskelton)
Our co-founder Matthew Skelton talked about the importance of logging which should no be sacrificed for other features.
Key concepts:
- Logging is often forgotten.
- Log aggregation and search tools should be first class tools, right up there with source code management
- Logging is another system component and should be tested as such.
- Log everything, not just errors, to explore how software actually works in production when things are going well.
- You need to allocate the right resources to logging, including budget.
- How to make logging awesome?
- Continuous event IDS
- Transaction tracing
- Log aggregation & search tools
- Design for logging
- Decoupled severity
Quotations:
Logging makes things work.
TAKING THE OPERATING SYSTEM OUT OF OPERATIONS by Gareth Rushgrove (@garethr)
Gareth Rushgrove gave a fascinating talk about hypervisors and unikernels.
Key concepts:
- Systems that have not been designed to be complex are becoming more complex.
- General Purpose OS is a dead concept. Reduce the complexity and only use what you need.
- We are running more software than we can understand.
- Applications start “fighting” with each other.
- Everyone wants to be the VMware of containers.
- Everyone, that is not running a hypervisor is an applications developer.
- Unikernel advantages: hypervisor/hardware isolation, smaller attack surface area, running less code, enforced immutability, no default remote access.
- Unikernels allow us to be specific about everything.
- Infrastructure is code.
- Collaborate on hard problems, rather than marveling about your Docker PaaS.
Quotations:
I’m limited here by the slide, rather than the insanity of our infrastructure!
SECURITY FOR NON-UNICORNS by Ben Hughes (@benjammingh)
Ben Hughes gave an inspiring talk about the biggest security vulnerabilities that are often overlooked.
Key concepts:
- Small bugs turn into giant acorns that end up on the front page.
- Who can you trust? -Nobody.
- Bug Bounties – a good way to have your system examined for security issues.
- Curl & Bash – just don’t!
- Do not run things as root.
- Stop trusting files on Internet.
- Check your public info – you can find many interesting things on Google Search.
- Docker needs to be secured: don’t use –privileged, use –cap-drop and –cap-drop <thing> to get the minimum capabilities, use Docker Notary, use GRSecurity, use SELinux..
Quotations:
“It’s got legit in the name. It must be secure, right?”
Takeaways from the day
- Communication is still key.
- Checklists and logging makes incident handling run smoothly.
- AIX OS is old and you should probably stop using it.
- MongoDB is hard to use.
- Unicorns are a theme!
Fun fact of the day – you can order sleeping bees on the Internet and deliver them to a “friend”.
Thanks for a great summary of the conference presentations 🙂
LikeLike
Thank you! 🙂
LikeLike