Maybe you’re thinking: SRE’s shouldn’t care about system design right? Right?

No, that’s not the case!

SRE’s should always care and should be able to design complex and distributed systems.

With that in mind, I’ve decided to write this post.

My first tip will be this systems design course as follows.

Click on the Video to access the course: 👇👇👇👇👇

What are distributed systems?

No shared clock, No shared memory, Shared resources, Concurrency and Consistency, requires agreement upon format or protocol.
Benefits: More reliable, fault tolerant, Scalability, Lower latency, increased performance, Cost effective

Performance metrics for system design

Scalability - Ability of a system to grow and manage increased traffic, system be able to handle increased volume of data or requests

Volume / requests

Reliability - probability a system will fail during a period of time; Slightly harder to define than hardware reliability
Mean time Between Failure - MTBF = (Total Elapsed Time - total downtime) / number of failures 24 hours - 4 hours downtime / 4 failures = 5 hour MTBF
Availability - Amount of time a system is operational during a period of time; poorly designed software requiring downtime for updates is less available

Availability % = ( available time / total time ) x 100 ( 23 hours / 24 hours) x 100 = 95.83%

Table of downtime for 9’s

Percentage	Availability in time
99%	3 days, 15 hours, 40 minutes Annual Downtime
99.9%	8 hours, 46minutes Annual Downtime
99.99%	52 minutes, 36 seconds Annual Downtime
99.999%	5.26 minutes Annual Downtime

Reliable vs Available system.

Example: Think about airplane. Plane should be reliable. Once in the air, cannot fail. Availability you can use a backup plan:

Reliable system is always an available system
Availability can be maintained by redundancy, but system may not be reliable
Reliable software will be more profitable because providing same service requires less backup resources
Requirements will depend on function of the software

While designing a scalable system, the most important aspect is defining how the data will be partitioned and replicated across servers. Let’s first define these terms before moving on:

Data partitioning

It is the process of distributing data across a set of servers. It improves the scalability and performance of the system.

Data replication

It is the process of making multiple copies of data and storing them on different servers. It improves the availability and durability of the data across the system.

Data partition and replication strategies lie at the core of any distributed system. A carefully designed scheme for partitioning and replicating the data enhances the performance, availability, and reliability of the system and also defines how efficiently the system will be scaled and managed.

David Karger et al. first introduced Consistent Hashing in their 1997 paper and suggested its use in distributed caching.

Consistent Hasing

Later, Consistent Hashing was adopted and enhanced to be used across many distributed systems. I

Distributed systems can use Consistent Hashing to distribute data across nodes.

Consistent Hashing maps data to physical nodes and ensures that only a small set of keys move when servers are added or removed.

Consistent Hashing stores the data managed by a distributed system in a ring. Each node in the ring is assigned a range of data. Here is an example of the consistent hash ring:

https://www.educative.io/courses/grokking-the-system-design-interview/B81vnyp0GpY

Virtual nodes

Adding and removing nodes in any distributed system is quite common. Existing nodes can die and may need to be decommissioned. Similarly, new nodes may be added to an existing cluster to meet growing demands. To efficiently handle these scenarios, Consistent Hashing makes use of virtual nodes (or Vnodes).

Consistent Hashing in System Design

Consistent Hashing helps with efficiently partitioning and replicating data; therefore, any distributed system that needs to scale up or down or wants to achieve high availability through data replication can utilize Consistent Hashing.

A few such examples could be:

Any system working with a set of storage (or database) servers and needs to scale up or down based on the usage, e.g., the system could need more storage during Christmas because of high traffic.
Any distributed system that needs dynamic adjustment of its cache usage by adding or removing cache servers based on the traffic load.
Any system that wants to replicate its data shards to achieve high availability.

Consistent Hashing use cases

Amazon’s Dynamo and Apache Cassandra use Consistent Hashing to distribute and replicate data across nodes.

Efficiency - How well the system performs

Latency and throughput often used as metrics

Latency Key Takeaways:
- Avoid Network calls whenever possible
- Replicate data across data centres for disaster recovery as well as performance
- use CDNs to reduce latency
- Keep frequently accessed data in memory if possible rather than seeking from disk, caching

Manageability - Speed and difficulty involved with maintaining system

Observability, how hard to track bugs
Difficulty of deploying updates
Want to abstract away infrastructure so product engineers don’t have to worry about it

Back of envelope estimates for scalable System’s design

Here follos some of the Numbers every SRE Should Know:

Data Conversions

Value	unit
8 bits	1 byte
1024 bytes.	1 Kilobyte
1024 kilobytes	1 Megabyte
1024 Megabytes	1 Gigabyte
1024 Gigabytes	1 Terabyte

Common Data Types

Data type	size
Char	1 byte
Integer	4 bytes
UNIX Timestamp	4 bytes

Time

60 seconds x 60 minutes = 3600 seconds per hour
3600 x 24 hours = 86.400 seconds per day
86.400 x 30 days = 2.500.000 seconds per Month

Traffic Estimates

Estimate total number of requests app will receive
Average Daily Active Users (DAU) x Average reads/writes per User

Samples:

10 Million DAU x 30 photos viewed = 300 Million Photo Requests
10 Million DAU x 1 photos upload = 10 Million Photo Writes
300 Million Requests // 86.400 =. 3472 Requests per second
10 Million Writes // 86.400 = 115 writes per second

Memory

Read Requests per day x Average Request size x .2

300 Million Request x 500 bytes = 150 Gigabytes
150 GB X .2 (20%) = 30 Gigabytes
30 GB x 3 (replication) = 90 Gigabytes

Bandwidth

Requests per day x Request size

300 Million Requests x 1.5 MB = 450.000 Gigabytes
450.000 GB // 86.400. =. 5.2GB per second

Storage

Writes per day x Size of write x Time to store data
10 Million writes x 1.5MB = 15 TB per day
15 TB x 365 days x 10 years = 55 petabytes

Distributed Systems advanced concepts

Now I’ll cover some other important and advanced concepts/properties of distributed systems.

Idempotence

POST should create only one record

Using clientID, and using a GUID on table so user will never create duplicates

Idempotente

CQRS - Command Query Responsibility Segregarion

Commands - Persistent change Return no information -> Writes
Queries - No change Return information- Reads

Immutable by Default:

Make all data immutable
Reliable audit logs
Do not destroy data
Preserve metadata

Horizontal vs Vertical scaling

Vertical Scaling:

Easiest way to Scale an Application
Diminishing returns, limits to scalability
Single point of failure. :-(

Horizontal Scaling:

More complexity up front, but more efficient long term
Redundancy built in
Need load balancer to distribute traffic
Cloud providers make this easier

Buzzwords

Kubernetes
Docker
Hadoop
Snowflake

Diminishing returns

Time elapsed versus CPU cores:

1 CPU core
2 CPU cores
4 CPU cores
8 CPU cores
N CPU cores

Latency:

Use geographic located and distributed Availability zones and CDNs to delivery content to users

Fault Tolerance

Failures are not avoidable in any system and will happen all the time, hence we need to build systems that can tolerate failures or recover from them.

In systems, failure is the norm rather than the exception.

“Anything that can go wrong will go wrong” – Murphy’s Law
“Complex systems contain changing mixtures of failures latent within them” – How Complex Systems Fail.

Fault Tolerance - Failure Metrics

Common failure metrics that get measured and tracked for any system.

Mean time to repair (MTTR): The average time to repair and restore a failed system.
Mean time between failures (MTBF): The average operational time between one device failure or system breakdown and the next.
Mean time to failure (MTTF): The average time a device or system is expected to function before it fails.
Mean time to detect (MTTD): The average time between the onset of a problem and when the organization detects it.
Mean time to investigate (MTTI): The average time between the detection of an incident and when the organization begins to investigate its cause and solution.
Mean time to restore service (MTRS): The average elapsed time from the detection of an incident until the affected system or component is again available to users.
Mean time between system incidents (MTBSI): The average elapsed time between the detection of two consecutive incidents. MTBSI can be calculated by adding MTBF and MTRS (MTBSI = MTBF + MTRS).
Failure rate: Another reliability metric, which measures the frequency with which a component or system fails. It is expressed as a number of failures over a unit of time.

System Design Components

Load balancers

Balance incoming traffic to multiple servers
Software or Hardware based
Used to improve reliability and scalability of application
Nginx, HAProxy, F5, Citrix

LB Routing Methods

Round Robin

Simplest type of routing
Can result in uneven traffic

Least Connections

Routes based on number of client connections to server
Useful for chat or streaming applications

Least Response Time

Routes based on hw quickly servers respond

IP Hash

Routes client to server based on IP
Useful for stateful sessions

Can be Applied on L4 or L7

Layer 4

Only has access to TCP and UDP data
Faster
Lack of information can lead to uneven traffic

Layer 7

Full access to HTTP protocol and data
SSL termination
Check for authentication
Smarter routing options

Caching

To improve performance of application
Save money

Numbers Programmers Should Know

Data Conversions

8 bits -> 1 byte
1024 bytes. -> 1 Kilobyte
1024 kilobytes -> 1 Megabyte
1024 Megabytes -> 1 Gigabyte
1024 Gigabytes -> 1 Terabyte

Common Data Types

- Char - 1 byte - Integer - 4 bytes - UNIX Timestamp - 4 bytes

Costs of Operations:

Read from disk
Read from memory
Local Area Network (LAN) round-trip
Cross-Continental Network

Latency Numbers (Memory / Disk / Network)

It is possible to read: ● sequentially from HDD at a rate of ~200MB per second ● sequentially from SSD at a rate of ~1 GB per second ● sequentially from main memory at a rate of ~100GB per second (burst rate) ● sequentially from 10Gbps Ethernet at a rate of ~1000MB per second

No more than 6-7 round trips between Europe and the US per second are possible, but approximately 2000 per second can be achieved within a datacenter.

Speed and Performance - Cache data, in between database and app

Reading from memory is much faster than disk, 50-200x faster

Most apps have far more reads than writes, perfect for Caching

Caching Layers

DNS - Google.com is an IP cached in th browser
CDN - Cloudflare
Application
Database - Frequent accessed data on cache

Distributed Cache

Works same as traditional cache
Has built-in functionality to replicate data,
Can shard data across servers and locate proper server for each key
Active (A, B, C)
Passive (A, B, C) warm up - can serve as a backup of the cache

Only store frequent accessed data to use it efficiently

TTL - Time to Live - time period before a cache entry is deleted

Used to prevent stale data

LRU/LFU

Least Recently used

Once cache is full, remove last accessed key and add new key

Least Frequently used

Track number of times key is accessed
Drop least used when cache is full

Caching Strategies

Cache Aside - most common

Read through
Write through
Write back

Implementation:

Caching pseudocode - retrieval

def app_request(tweet_id):
   cache = { }
   data = cache.get(tweet_id)

  If data:
	return data
 else:
	data = db_query(tweet_id)
	// set data in cache
      cache[tweet_id] = data
	return data

How to write/update on cache:

def app_update(tweet_id, data):
	cache = {}

	db_update(data)

	cache.pop(tweet_id) // remove from cache to avoid stale data

—————————————————————————————

Database Scaling

Ways to improve db performance
Horizontally
Replication, partitions
Most web apps are 95% reads
Twiter, Facebook, Google

Write once and read many times

Basic Scaling Techniques

Indexes
- Index based on column
- Speed up read performance
- Writes and updates become slightly slower
- More storage required for index
Denormalisation
- Add redundant data to tables to reduce joins
- Boots read performance
- Slows down writes
- Risk inconsistent data across tables
- Code is harder to write
Connection pooling
- Allow multiple application threads to use same DB connection
- Saves on overhead of independent DB connections
Caching
- Cache sits in front of DB to handle serving content
- Can’t cache everything
- Redis / memcached
Vertical scaling - get a bigger server (more CPU, more memory, disk speed, network throughput)

Replication and Partitioning

Read replicas
- Create replica servers to handle reads
- Master server dedicated only to writes
- Have to handle making sure new data reaches replicas
- DB master/ main and read replicas - Issues with consistency of data / stale data
Horizontal partitioning - sharing
- Schema of table stays the same, but split across multiple DBs - Master 1 A-M, Master 2 N-Z
- Downsides - Hot keys, no joins across shards, could result on unbalanced servers and traffic
- Instagram - Justin Bieber post a picture and servers go crazy
Vertical Partition
- Divide up schema of database into separate tables
- Generally divided by functionality
- Best when most data in row isn’t need for most queries
- Table (ID, Username, Email, Address)
- Split into 2 tables Table1(ID, username) Table2 (Email, Address, userID)
  - Instagram has counts for likes and dislikes on separate databases and tables for vertical partition

The Devil You Know - When to consider NoSQL

Upfront you know what you sacrificing
NoSQL is because you know what you need to make trade off
If you need transaction and no perfect consistency (for social media)
How to design Youtube clone

Database Design and Scaling

Vertical scalling

Vertical scalling - Get a bigger Database.

Indexes

Index based on column
Speed up read performance
Writes and updates become slightly slower
More storage required for index

Denormalization

Add redundant data to tables to reduce joins
Boosts read performance
Slows down writes
Risk inconsistent data across tables
Code is harder to write

Connection Pooling

Allow multiple application threads to use same DB connection
Saves on overhead of independent DB connections

System Design Outline

This is a recommended approach to solving system design problems. Not all of these topics will be relevant for a given problem.

Functional Requirements / Clarifications / Assumptions
Non-Functional Requirements
- Consistency vs Availability
- Latency
  - How fast does this system need to be?
  - User-perceived latency
  - Data processing latency
- Security
  - Potential attacks? How should they be mitigated?
- Privacy
  - PII, PCI, or other user data
- Data Integrity
  - How to recover from data corruption or loss?
- Read:Write Ratio
APIs
- Public and/or internal APIs?
- Are the APIs simple and easy to understand?
- How are entities identified?
Storage Schemas
- SQL vs NoSQL
- Message Queues
System Design
Scalability
- How does the system scale? Consider both data size and traffic increases.
- What are the bottlenecks? How should they be addressed?
- What are the edge cases? What could go wrong? Assuming they occur, how should they be addressed?
- How will we stress test this system?
- Load Balancing
- Auto-scaling / Replication
- Caching
- Parititioning
- Replication
- Business Continuity and Disaster Recovery (BCDR)
- Internationalization / Localization
  - How to scale to multiple countries and languages? Don’t assume the US is English-only.

All that being covered, the mindset of an SRE should be simple:

You built it, you run it!

https://www.atlassian.com/br/incident-management/devops/sre

You can be an SRE, DevOps, Developer, QA and Product Manager at the same time. Don’t put yourself in a silo. Think about your customers, users, colleagues and communities around your service The team should be able to understand their service end-to-end - Quality, cost, scope, security, compliance, etc …

System Design for SRE's

System Design for SRE's

What are distributed systems?

Performance metrics for system design

Volume / requests

Table of downtime for 9’s

Reliable vs Available system.

Data partitioning

Data replication

Consistent Hasing

Virtual nodes

Consistent Hashing in System Design

Consistent Hashing use cases

Efficiency - How well the system performs

Manageability - Speed and difficulty involved with maintaining system

Back of envelope estimates for scalable System’s design

Data Conversions

Common Data Types

Time

Traffic Estimates

Memory

Bandwidth

Storage

Distributed Systems advanced concepts

Idempotence

CQRS - Command Query Responsibility Segregarion

Horizontal vs Vertical scaling

Vertical Scaling:

Horizontal Scaling:

Buzzwords

Diminishing returns

Latency:

Fault Tolerance

Fault Tolerance - Failure Metrics

System Design Components

Load balancers

LB Routing Methods

Round Robin

Least Connections

Least Response Time

IP Hash

Layer 4

Layer 7

Caching

Numbers Programmers Should Know

Data Conversions

Common Data Types

Costs of Operations:

Latency Numbers (Memory / Disk / Network)

Speed and Performance - Cache data, in between database and app

Caching Layers

TTL - Time to Live - time period before a cache entry is deleted

LRU/LFU

Caching Strategies

Database Scaling

Basic Scaling Techniques

Replication and Partitioning

Database Design and Scaling

Vertical scalling

Indexes

Denormalization

Connection Pooling

System Design Outline

References