System Design for SRE's

System Design for SRE's

System Design for SRE's

Maybe you’re thinking: SRE’s shouldn’t care about system design right? Right?

No, that’s not the case!

SRE’s should always care and should be able to design complex and distributed systems.

With that in mind, I’ve decided to write this post.

My first tip will be this systems design course as follows.

Click on the Video to access the course: 👇👇👇👇👇

What are distributed systems?

  • No shared clock, No shared memory, Shared resources, Concurrency and Consistency, requires agreement upon format or protocol.
  • Benefits: More reliable, fault tolerant, Scalability, Lower latency, increased performance, Cost effective

Performance metrics for system design

  • Scalability - Ability of a system to grow and manage increased traffic, system be able to handle increased volume of data or requests

Volume / requests

  • Reliability - probability a system will fail during a period of time; Slightly harder to define than hardware reliability

  • Mean time Between Failure - MTBF = (Total Elapsed Time - total downtime) / number of failures 24 hours - 4 hours downtime / 4 failures = 5 hour MTBF

  • Availability - Amount of time a system is operational during a period of time; poorly designed software requiring downtime for updates is less available

Availability % = ( available time / total time ) x 100 ( 23 hours / 24 hours) x 100 = 95.83%

Table of downtime for 9’s

PercentageAvailability in time
99%3 days, 15 hours, 40 minutes Annual Downtime
99.9%8 hours, 46minutes Annual Downtime
99.99%52 minutes, 36 seconds Annual Downtime
99.999%5.26 minutes Annual Downtime

Reliable vs Available system.

Example: Think about airplane. Plane should be reliable. Once in the air, cannot fail. Availability you can use a backup plan:

  • Reliable system is always an available system
  • Availability can be maintained by redundancy, but system may not be reliable
  • Reliable software will be more profitable because providing same service requires less backup resources
  • Requirements will depend on function of the software

While designing a scalable system, the most important aspect is defining how the data will be partitioned and replicated across servers. Let’s first define these terms before moving on:

Data partitioning 

It is the process of distributing data across a set of servers. It improves the scalability and performance of the system.

Data replication

It is the process of making multiple copies of data and storing them on different servers. It improves the availability and durability of the data across the system.

Data partition and replication strategies lie at the core of any distributed system. A carefully designed scheme for partitioning and replicating the data enhances the performance, availability, and reliability of the system and also defines how efficiently the system will be scaled and managed.

David Karger et al. first introduced Consistent Hashing in their 1997 paper and suggested its use in distributed caching.

Consistent Hasing

Later, Consistent Hashing was adopted and enhanced to be used across many distributed systems. I

Distributed systems can use Consistent Hashing to distribute data across nodes.

  • Consistent Hashing maps data to physical nodes and ensures that only a small set of keys move when servers are added or removed.

Consistent Hashing stores the data managed by a distributed system in a ring. Each node in the ring is assigned a range of data. Here is an example of the consistent hash ring:

Virtual nodes

Adding and removing nodes in any distributed system is quite common. Existing nodes can die and may need to be decommissioned. Similarly, new nodes may be added to an existing cluster to meet growing demands. To efficiently handle these scenarios, Consistent Hashing makes use of virtual nodes (or Vnodes).

Consistent Hashing in System Design

Consistent Hashing helps with efficiently partitioning and replicating data; therefore, any distributed system that needs to scale up or down or wants to achieve high availability through data replication can utilize Consistent Hashing.

A few such examples could be:

  • Any system working with a set of storage (or database) servers and needs to scale up or down based on the usage, e.g., the system could need more storage during Christmas because of high traffic.
  • Any distributed system that needs dynamic adjustment of its cache usage by adding or removing cache servers based on the traffic load.
  • Any system that wants to replicate its data shards to achieve high availability.

Consistent Hashing use cases

Amazon’s Dynamo and Apache Cassandra use Consistent Hashing to distribute and replicate data across nodes.

Efficiency - How well the system performs

Latency and throughput often used as metrics

  • Latency Key Takeaways:
    • Avoid Network calls whenever possible
    • Replicate data across data centres for disaster recovery as well as performance
    • use CDNs to reduce latency
    • Keep frequently accessed data in memory if possible rather than seeking from disk, caching

Manageability - Speed and difficulty involved with maintaining system

  • Observability, how hard to track bugs
  • Difficulty of deploying updates
  • Want to abstract away infrastructure so product engineers don’t have to worry about it

Back of envelope estimates for scalable System’s design

Here follos some of the Numbers every SRE Should Know:

Data Conversions

8 bits1 byte
1024 bytes.1 Kilobyte
1024 kilobytes1 Megabyte
1024 Megabytes1 Gigabyte
1024 Gigabytes1 Terabyte

Common Data Types

Data typesize
Char1 byte
Integer4 bytes
UNIX Timestamp4 bytes


  • 60 seconds x 60 minutes = 3600 seconds per hour
  • 3600 x 24 hours = 86.400 seconds per day
  • 86.400 x 30 days = 2.500.000 seconds per Month

Traffic Estimates

  • Estimate total number of requests app will receive
  • Average Daily Active Users (DAU) x Average reads/writes per User


  • 10 Million DAU x 30 photos viewed = 300 Million Photo Requests
  • 10 Million DAU x 1 photos upload = 10 Million Photo Writes
  • 300 Million Requests // 86.400 =. 3472 Requests per second
  • 10 Million Writes // 86.400 = 115 writes per second


Read Requests per day x Average Request size x .2

  • 300 Million Request x 500 bytes = 150 Gigabytes
  • 150 GB X .2 (20%) = 30 Gigabytes
  • 30 GB x 3 (replication) = 90 Gigabytes


Requests per day x Request size

  • 300 Million Requests x 1.5 MB = 450.000 Gigabytes
  • 450.000 GB // 86.400. =. 5.2GB per second


  • Writes per day x Size of write x Time to store data
  • 10 Million writes x 1.5MB = 15 TB per day
  • 15 TB x 365 days x 10 years = 55 petabytes

Distributed Systems advanced concepts

Now I’ll cover some other important and advanced concepts/properties of distributed systems.


  • POST should create only one record

Using clientID, and using a GUID on table so user will never create duplicates


CQRS - Command Query Responsibility Segregarion

  • Commands - Persistent change Return no information -> Writes
  • Queries - No change Return information- Reads

Immutable by Default:

  • Make all data immutable
  • Reliable audit logs
  • Do not destroy data
  • Preserve metadata

Horizontal vs Vertical scaling

Vertical Scaling:

  • Easiest way to Scale an Application
  • Diminishing returns, limits to scalability
  • Single point of failure. :-(

Horizontal Scaling:

  • More complexity up front, but more efficient long term
  • Redundancy built in
  • Need load balancer to distribute traffic
  • Cloud providers make this easier


  • Kubernetes
  • Docker
  • Hadoop
  • Snowflake

Diminishing returns

Time elapsed versus CPU cores:

  • 1 CPU core
  • 2 CPU cores
  • 4 CPU cores
  • 8 CPU cores
  • N CPU cores


  • Use geographic located and distributed Availability zones and CDNs to delivery content to users

Fault Tolerance

Failures are not avoidable in any system and will happen all the time, hence we need to build systems that can tolerate failures or recover from them.

In systems, failure is the norm rather than the exception.

  • “Anything that can go wrong will go wrong” – Murphy’s Law
  • “Complex systems contain changing mixtures of failures latent within them” – How Complex Systems Fail.

Fault Tolerance - Failure Metrics

Common failure metrics that get measured and tracked for any system.

  • Mean time to repair (MTTR): The average time to repair and restore a failed system.
  • Mean time between failures (MTBF): The average operational time between one device failure or system breakdown and the next.
  • Mean time to failure (MTTF): The average time a device or system is expected to function before it fails.
  • Mean time to detect (MTTD): The average time between the onset of a problem and when the organization detects it.
  • Mean time to investigate (MTTI): The average time between the detection of an incident and when the organization begins to investigate its cause and solution.
  • Mean time to restore service (MTRS): The average elapsed time from the detection of an incident until the affected system or component is again available to users.
  • Mean time between system incidents (MTBSI): The average elapsed time between the detection of two consecutive incidents. MTBSI can be calculated by adding MTBF and MTRS (MTBSI = MTBF + MTRS).
  • Failure rate: Another reliability metric, which measures the frequency with which a component or system fails. It is expressed as a number of failures over a unit of time.

System Design Components

Load balancers

  • Balance incoming traffic to multiple servers
  • Software or Hardware based
  • Used to improve reliability and scalability of application
  • Nginx, HAProxy, F5, Citrix

LB Routing Methods

Round Robin

  • Simplest type of routing
  • Can result in uneven traffic

Least Connections

  • Routes based on number of client connections to server
  • Useful for chat or streaming applications

Least Response Time

  • Routes based on hw quickly servers respond

IP Hash

  • Routes client to server based on IP
  • Useful for stateful sessions

Can be Applied on L4 or L7

Layer 4

  • Only has access to TCP and UDP data
  • Faster
  • Lack of information can lead to uneven traffic

Layer 7

  • Full access to HTTP protocol and data
  • SSL termination
  • Check for authentication
  • Smarter routing options


  • To improve performance of application
  • Save money

Numbers Programmers Should Know

Data Conversions

  •  8 bits -> 1 byte
  • 1024 bytes. -> 1 Kilobyte
  • 1024 kilobytes     -> 1 Megabyte
  • 1024 Megabytes -> 1 Gigabyte
  • 1024 Gigabytes   -> 1 Terabyte

Common Data Types

 - Char - 1 byte  - Integer - 4 bytes  - UNIX Timestamp - 4 bytes

Costs of Operations:

  • Read from disk
  • Read from memory
  • Local Area Network (LAN) round-trip
  • Cross-Continental Network

Latency Numbers (Memory / Disk / Network)

It is possible to read:  ● sequentially from HDD at a rate of ~200MB per second  ● sequentially from SSD at a rate of ~1 GB per second  ● sequentially from main memory at a rate of ~100GB per second (burst rate)  ● sequentially from 10Gbps Ethernet at a rate of ~1000MB per second

No more than 6-7 round trips between Europe and the US per second are possible, but approximately 2000 per second can be achieved within a datacenter.

Speed and Performance - Cache data, in between database and app

Reading from memory is much faster than disk, 50-200x faster

Most apps have far more reads than writes, perfect for Caching

Caching Layers

  • DNS - is an IP cached in th browser
  • CDN - Cloudflare
  • Application
  • Database - Frequent accessed data on cache

Distributed Cache

  • Works same as traditional cache

  • Has built-in functionality to replicate data,

  • Can shard data across servers and locate proper server for each key

  • Active (A, B, C)

  • Passive (A, B, C) warm up - can serve as a backup of the cache

Only store frequent accessed data to use it efficiently

TTL - Time to Live - time period before a cache entry is deleted

  • Used to prevent stale data


Least Recently used

  • Once cache is full, remove last accessed key and add new key

Least Frequently used

  • Track number of times key is accessed
  • Drop least used when cache is full

Caching Strategies

Cache Aside - most common

  • Read through
  • Write through
  • Write back


Caching pseudocode - retrieval

def app_request(tweet_id):
   cache = { }
   data = cache.get(tweet_id)

  If data:
	return data
	data = db_query(tweet_id)
	// set data in cache
      cache[tweet_id] = data
	return data

How to write/update on cache:

def app_update(tweet_id, data):
	cache = {}


	cache.pop(tweet_id) // remove from cache to avoid stale data


Database Scaling

  • Ways to improve db performance
  • Horizontally
  • Replication, partitions
  • Most web apps are 95% reads
  • Twiter, Facebook, Google

Write once and read many times

Basic Scaling Techniques

  • Indexes
    • Index based on column
    • Speed up read performance
    • Writes and updates become slightly slower
    • More storage required for index
  • Denormalisation
    • Add redundant data to tables to reduce joins
    • Boots read performance
    • Slows down writes
    • Risk inconsistent data across tables
    • Code is harder to write
  • Connection pooling
    • Allow multiple application threads to use same DB connection
    • Saves on overhead of independent DB connections
  • Caching
    • Cache sits in front of DB to handle serving content
    • Can’t cache everything
    • Redis / memcached
  • Vertical scaling - get a bigger server (more CPU, more memory, disk speed, network throughput)

Replication and Partitioning

  • Read replicas
    • Create replica servers to handle reads
    • Master server dedicated only to writes
    • Have to handle making sure new data reaches replicas
    • DB master/ main and read replicas - Issues with consistency of data / stale data
  • Horizontal partitioning - sharing
    • Schema of table stays the same, but split across multiple DBs - Master 1 A-M, Master 2 N-Z
    • Downsides - Hot keys, no joins across shards, could result on unbalanced servers and traffic
    • Instagram - Justin Bieber post a picture and servers go crazy
  • Vertical Partition
    • Divide up schema of database into separate tables
    • Generally divided by functionality
    • Best when most data in row isn’t need for most queries
    • Table (ID, Username, Email, Address)
    • Split into 2 tables Table1(ID, username) Table2 (Email, Address, userID)
      • Instagram has counts for likes and dislikes on separate databases and tables for vertical partition

The Devil You Know - When to consider NoSQL

  • Upfront you know what you sacrificing
  • NoSQL is because you know what you need to make trade off
  • If you need transaction and no perfect consistency (for social media)
  • How to design Youtube clone

Database Design and Scaling

Vertical scalling

  • Vertical scalling - Get a bigger Database.


  • Index based on column
  • Speed up read performance
  • Writes and updates become slightly slower
  • More storage required for index


  • Add redundant data to tables to reduce joins
  • Boosts read performance
  • Slows down writes
  • Risk inconsistent data across tables
  • Code is harder to write

Connection Pooling

  • Allow multiple application threads to use same DB connection
  • Saves on overhead of independent DB connections

System Design Outline

This is a recommended approach to solving system design problems. Not all of these topics will be relevant for a given problem.

  1. Functional Requirements / Clarifications / Assumptions
  2. Non-Functional Requirements
    • Consistency vs Availability
    • Latency
      • How fast does this system need to be?
      • User-perceived latency
      • Data processing latency
    • Security
      • Potential attacks? How should they be mitigated?
    • Privacy
      • PII, PCI, or other user data
    • Data Integrity
      • How to recover from data corruption or loss?
    • Read:Write Ratio
  3. APIs
    • Public and/or internal APIs?
    • Are the APIs simple and easy to understand?
    • How are entities identified?
  4. Storage Schemas
    • SQL vs NoSQL
    • Message Queues
  5. System Design
  6. Scalability
    • How does the system scale? Consider both data size and traffic increases.
    • What are the bottlenecks? How should they be addressed?
    • What are the edge cases? What could go wrong? Assuming they occur, how should they be addressed?
    • How will we stress test this system?
    • Load Balancing
    • Auto-scaling / Replication
    • Caching
    • Parititioning
    • Replication
    • Business Continuity and Disaster Recovery (BCDR)
    • Internationalization / Localization
      • How to scale to multiple countries and languages? Don’t assume the US is English-only.

All that being covered, the mindset of an SRE should be simple:

You built it, you run it!

You can be an SRE, DevOps, Developer, QA and Product Manager at the same time. Don’t put yourself in a silo. Think about your customers, users, colleagues and communities around your service The team should be able to understand their service end-to-end - Quality, cost, scope, security, compliance, etc …


Please, follow our social networks:

Thank You and until the next one! 😉👍

Published on Jul 25, 2022 by Vinicius Moll

Share on / compartilhe: