Michael Kehoe

Day 2 of SRECon started with an introduction from Kurt and Betsy (the Program Chairs) and then three set of plenary’s talks.

The following is a set of notes I put together from the talks I went to today.

If you Don’t Know WHere You’re Going, It Doesn’t Matter How Fast You Get There Nicole Forsgren & Jez Humble Slides found here. The talk generally was about strategic planning and measuring your performance/ success. This part of IT is actually very under-valued.

These Tweets sums up the presentation well: < Now unavailable to show >

Some nice points about the mis-use of Velocity and Utilization as Key Performance Indicators (KPI).

Security and SRE: Natural Force Multipliers Cory Scott Slides here

Heirarchy of Needs in SRE:

Monitoring & Inciddent Response
Postmortem & Analysis
Testing & Release Procedures
Capacity Planning
Product

Problem Statement:

High Rate of Change
- Trust but verify
- Embrace the Error Budget
- Inject Engineering Discipline
Testing in Production is the new normal
- Dark Canaries

Security Challenges are similar to SRE

Latency & perf impact
Cascading failure scenarios
Service Discovery

Security Challenges

Authentation
Authorization
Access Control Logic

Data center technologies can be all controlled with a single web page application.

Start with a known-good state
Asset management
Ensure visibility
Validate consistently and constantly

Takeaways or Giveaways

Your data pipeline is your security lifeblood
Human-in-the-loop is your last resort, not your first option
All security solutions must be scalable
Remove single points of security failure like you do for availability
Assume that an attacker can be anywhere in your system or flow
Capture and measure meaningful security telemetry

What it Really Means to Be an Effective Engineer Edmond Lau

See coleadership.com/srecon
Effort <> Impact
Impact = Hours spent x (impact produced/hours spent)
Leverage = impact produced/ hours spent

What are the high-leverage activities for engineers?

Do the simple thing first
Effective engineers invest in iteration speed
Effictive engineers validate their ideas early and often
What are the best practices for building good infrastructure for relationships
Effective engineers explicitly design their alliances
Effective engineers explicitly share their assumptions
Work hard and get things done and focus on high-leverage activities and build infrastructure for their relationships

The Day the DNS died Jeremy Blosser Slides here

Impact

Sending mail
Application traffic
Metrics

Diagnosing blind (without metrics) is difficult!

resolv.conf is fiddly

Can only use first 3 entries

Diagnosis

Assymetric DNS packet flow (94% packet loss)

The Cause

[Undocumented] Connection tracking

Response

Incident response was functional
Ability to respond was compromised
New DNS design required

New Design

Dedicated VPC for isolation
Open security groups with ACLs
Seperate clusters for app/db vs MTA
Use DNSMasq for local caching

Lessons learnt

Not all cloud limits are apparent
Instrument your support services and protect
It’a always a DNS problem…excpet when it’s a firewall problem
resolf.conf is not agile

Stable and Accurate Health-Checking of Horizontally-Scaled Services Lorenzo Salno See new Fastly paper on load-balancing: Balancing of the edge: transport affinity without network state - NSDI paper

PoP Deployments

Space and power are at a premium

Architecture:

Building a smarter load balancer

Methods:

machine learning - classifier
signal processing - filter
control theoery - controller

Design Multiple stages system

Denoising - remove noise from input signal
Anomaly detection - identify misbehaving instance
Hysteresis filter - stabilize output

Implementation Host signals go into a filter which makes a decision about global state of host

Don’t Ever Change! Are Imutable Deployments Really Simplier Faster, and Safer? Rob Hirschfelf Immutable Patterns

Baseline + Config
Live Boot + Config
Image Deploy

Image creation

Do the configuration and capture the immage into a portable format
This sounds like a lot of work and really slow
Yes, but it’s faster, safer and more scalable

Lessons Learned from Our Main Database Migrations at Facebook Yoshinori Matsunobu

User Database

Stores Social Graph
Massively Sharded
Low latency
Automated Operations
Pure Flash Storage

What is MyRocks

MySQL on top of RocksDB
Open Source, distributed from MariaDB and Percona

MyRocks Features

Clustered Index
Bloom filter and Column family
Transactions, including consistency betweenbinlog and RocksDB
Faster data loading, deletes and replication
Dynamic Options
TTL
Online logical and binary backup

MyRocks pros vs InnoDB

Much smaller space (half compared to compressed InnoDB)
Writes are faster
Much smaller bytes written

MyRocks cons vs InnoDB

Lack several features
- No FK, Fulltext index, spactial index
Must use Row based binary logging format
Reads are slower than InnoDB
Too many tuning options

MyRocks migration - technical challenges Migration

Creating MyRocks instances without downtime
Creating second MyRocks instance without downtime
Shadow traffic tests
Promoting new master

InnoDB vs MyRocks, from SRE PoV

Server is busier because of double density
Rocksdb is much newer database that changes rapidly
MyRocks/ RocksDB relies on BufferIO
For large transactions, COMMIT needs more work than InnoDB
There are too amny tuning options
Faster writes means replication slaves lag less often

Issue: Mitigating stalls

We upgraded kernel to 4.6
Changed data loading queries (schema changes) to use MyRocks bulk loading feature
COmmit stalls every few mins –> now nearly zero

Issue: A few deleted rows re-appeared

Some of our secondary indexes had extra rows
Turned out to be a bug in RocksDB compactions that in rare cases on heavy deletions, tombstones might now have been handled correctly

Issue: Scanning outbound delete-markers Counting from one of the empty tables started taking a few minutes

Lesons Learned

Learn how core components work
- RocksDB depends on linux more than innodb
- Understanding how Linux works helps to fix issues faster
Do not ignore outliers
- Many of our issues happened on only handful instances

Leveraging Multiple Regions to Improve Site Reliability: Lessons learnt from Jet.om Andrew Duch

Lesson 1: Sorry I missed this
Lesson 2: Not everything has to be active-active
Lesson 3: Three is cheaper than two - you waste 50% of you capacity in a active-active model
Lesson 4: Practie, Practice, Practice
Lesson 5: Failover Automation needs to scale

Unfortunately I had to skip the final set of sessions this afternoon due to a conflict. From all acounts, the sessions this afternoon were great.

See everyone tomorrow for day 3!

SRECon Americas 2018 Day 2 Notes