Maybe you’re thinking: SRE’s shouldn’t care about system design right? Right?
No, that’s not the case!
SRE’s should always care and should be able to design complex and distributed systems.
With that in mind, I’ve decided to write this post.
My first tip will be this systems design course as follows.
Click on the Video to access the course: 👇👇👇👇👇
What are distributed systems?
- No shared clock, No shared memory, Shared resources, Concurrency and Consistency, requires agreement upon format or protocol.
- Benefits: More reliable, fault tolerant, Scalability, Lower latency, increased performance, Cost effective
Performance metrics for system design
- Scalability - Ability of a system to grow and manage increased traffic, system be able to handle increased volume of data or requests
Volume / requests
Reliability - probability a system will fail during a period of time; Slightly harder to define than hardware reliability
Mean time Between Failure - MTBF = (Total Elapsed Time - total downtime) / number of failures 24 hours - 4 hours downtime / 4 failures = 5 hour MTBF
Availability - Amount of time a system is operational during a period of time; poorly designed software requiring downtime for updates is less available
Availability % = ( available time / total time ) x 100 ( 23 hours / 24 hours) x 100 = 95.83%
Table of downtime for 9’s
Percentage | Availability in time |
---|---|
99% | 3 days, 15 hours, 40 minutes Annual Downtime |
99.9% | 8 hours, 46minutes Annual Downtime |
99.99% | 52 minutes, 36 seconds Annual Downtime |
99.999% | 5.26 minutes Annual Downtime |
Reliable vs Available system.
Example: Think about airplane. Plane should be reliable. Once in the air, cannot fail. Availability you can use a backup plan:
- Reliable system is always an available system
- Availability can be maintained by redundancy, but system may not be reliable
- Reliable software will be more profitable because providing same service requires less backup resources
- Requirements will depend on function of the software
While designing a scalable system, the most important aspect is defining how the data will be partitioned and replicated across servers. Let’s first define these terms before moving on:
Data partitioning
It is the process of distributing data across a set of servers. It improves the scalability and performance of the system.
Data replication
It is the process of making multiple copies of data and storing them on different servers. It improves the availability and durability of the data across the system.
Data partition and replication strategies lie at the core of any distributed system. A carefully designed scheme for partitioning and replicating the data enhances the performance, availability, and reliability of the system and also defines how efficiently the system will be scaled and managed.
David Karger et al. first introduced Consistent Hashing in their 1997 paper and suggested its use in distributed caching.
Consistent Hasing
Later, Consistent Hashing was adopted and enhanced to be used across many distributed systems. I
Distributed systems can use Consistent Hashing to distribute data across nodes.
- Consistent Hashing maps data to physical nodes and ensures that only a small set of keys move when servers are added or removed.
Consistent Hashing stores the data managed by a distributed system in a ring. Each node in the ring is assigned a range of data. Here is an example of the consistent hash ring:
Virtual nodes
Adding and removing nodes in any distributed system is quite common. Existing nodes can die and may need to be decommissioned. Similarly, new nodes may be added to an existing cluster to meet growing demands. To efficiently handle these scenarios, Consistent Hashing makes use of virtual nodes (or Vnodes).
Consistent Hashing in System Design
Consistent Hashing helps with efficiently partitioning and replicating data; therefore, any distributed system that needs to scale up or down or wants to achieve high availability through data replication can utilize Consistent Hashing.
A few such examples could be:
- Any system working with a set of storage (or database) servers and needs to scale up or down based on the usage, e.g., the system could need more storage during Christmas because of high traffic.
- Any distributed system that needs dynamic adjustment of its cache usage by adding or removing cache servers based on the traffic load.
- Any system that wants to replicate its data shards to achieve high availability.
Consistent Hashing use cases
Amazon’s Dynamo and Apache Cassandra use Consistent Hashing to distribute and replicate data across nodes.
Efficiency - How well the system performs
Latency and throughput often used as metrics
- Latency Key Takeaways:
- Avoid Network calls whenever possible
- Replicate data across data centres for disaster recovery as well as performance
- use CDNs to reduce latency
- Keep frequently accessed data in memory if possible rather than seeking from disk, caching
Manageability - Speed and difficulty involved with maintaining system
- Observability, how hard to track bugs
- Difficulty of deploying updates
- Want to abstract away infrastructure so product engineers don’t have to worry about it
Back of envelope estimates for scalable System’s design
Here follos some of the Numbers every SRE Should Know:
Data Conversions
Value | unit |
---|---|
8 bits | 1 byte |
1024 bytes. | 1 Kilobyte |
1024 kilobytes | 1 Megabyte |
1024 Megabytes | 1 Gigabyte |
1024 Gigabytes | 1 Terabyte |
Common Data Types
Data type | size |
---|---|
Char | 1 byte |
Integer | 4 bytes |
UNIX Timestamp | 4 bytes |
Time
- 60 seconds x 60 minutes = 3600 seconds per hour
- 3600 x 24 hours = 86.400 seconds per day
- 86.400 x 30 days = 2.500.000 seconds per Month
Traffic Estimates
- Estimate total number of requests app will receive
- Average Daily Active Users (DAU) x Average reads/writes per User
Samples:
- 10 Million DAU x 30 photos viewed = 300 Million Photo Requests
- 10 Million DAU x 1 photos upload = 10 Million Photo Writes
- 300 Million Requests // 86.400 =. 3472 Requests per second
- 10 Million Writes // 86.400 = 115 writes per second
Memory
Read Requests per day x Average Request size x .2
- 300 Million Request x 500 bytes = 150 Gigabytes
- 150 GB X .2 (20%) = 30 Gigabytes
- 30 GB x 3 (replication) = 90 Gigabytes
Bandwidth
Requests per day x Request size
- 300 Million Requests x 1.5 MB = 450.000 Gigabytes
- 450.000 GB // 86.400. =. 5.2GB per second
Storage
- Writes per day x Size of write x Time to store data
- 10 Million writes x 1.5MB = 15 TB per day
- 15 TB x 365 days x 10 years = 55 petabytes
Distributed Systems advanced concepts
Now I’ll cover some other important and advanced concepts/properties of distributed systems.
Idempotence
- POST should create only one record
Using clientID, and using a GUID on table so user will never create duplicates
Idempotente
CQRS - Command Query Responsibility Segregarion
- Commands - Persistent change Return no information -> Writes
- Queries - No change Return information- Reads
Immutable by Default:
- Make all data immutable
- Reliable audit logs
- Do not destroy data
- Preserve metadata
Horizontal vs Vertical scaling
Vertical Scaling:
- Easiest way to Scale an Application
- Diminishing returns, limits to scalability
- Single point of failure. :-(
Horizontal Scaling:
- More complexity up front, but more efficient long term
- Redundancy built in
- Need load balancer to distribute traffic
- Cloud providers make this easier
Buzzwords
- Kubernetes
- Docker
- Hadoop
- Snowflake
Diminishing returns
Time elapsed versus CPU cores:
- 1 CPU core
- 2 CPU cores
- 4 CPU cores
- 8 CPU cores
- N CPU cores
Latency:
- Use geographic located and distributed Availability zones and CDNs to delivery content to users
Fault Tolerance
Failures are not avoidable in any system and will happen all the time, hence we need to build systems that can tolerate failures or recover from them.
In systems, failure is the norm rather than the exception.
- “Anything that can go wrong will go wrong” – Murphy’s Law
- “Complex systems contain changing mixtures of failures latent within them” – How Complex Systems Fail.
Fault Tolerance - Failure Metrics
Common failure metrics that get measured and tracked for any system.
- Mean time to repair (MTTR): The average time to repair and restore a failed system.
- Mean time between failures (MTBF): The average operational time between one device failure or system breakdown and the next.
- Mean time to failure (MTTF): The average time a device or system is expected to function before it fails.
- Mean time to detect (MTTD): The average time between the onset of a problem and when the organization detects it.
- Mean time to investigate (MTTI): The average time between the detection of an incident and when the organization begins to investigate its cause and solution.
- Mean time to restore service (MTRS): The average elapsed time from the detection of an incident until the affected system or component is again available to users.
- Mean time between system incidents (MTBSI): The average elapsed time between the detection of two consecutive incidents. MTBSI can be calculated by adding MTBF and MTRS (MTBSI = MTBF + MTRS).
- Failure rate: Another reliability metric, which measures the frequency with which a component or system fails. It is expressed as a number of failures over a unit of time.
System Design Components
Load balancers
- Balance incoming traffic to multiple servers
- Software or Hardware based
- Used to improve reliability and scalability of application
- Nginx, HAProxy, F5, Citrix
LB Routing Methods
Round Robin
- Simplest type of routing
- Can result in uneven traffic
Least Connections
- Routes based on number of client connections to server
- Useful for chat or streaming applications
Least Response Time
- Routes based on hw quickly servers respond
IP Hash
- Routes client to server based on IP
- Useful for stateful sessions
Can be Applied on L4 or L7
Layer 4
- Only has access to TCP and UDP data
- Faster
- Lack of information can lead to uneven traffic
Layer 7
- Full access to HTTP protocol and data
- SSL termination
- Check for authentication
- Smarter routing options
Caching
- To improve performance of application
- Save money
Numbers Programmers Should Know
Data Conversions
- 8 bits -> 1 byte
- 1024 bytes. -> 1 Kilobyte
- 1024 kilobytes -> 1 Megabyte
- 1024 Megabytes -> 1 Gigabyte
- 1024 Gigabytes -> 1 Terabyte
Common Data Types
- Char - 1 byte - Integer - 4 bytes - UNIX Timestamp - 4 bytes
Costs of Operations:
- Read from disk
- Read from memory
- Local Area Network (LAN) round-trip
- Cross-Continental Network
Latency Numbers (Memory / Disk / Network)
It is possible to read: ● sequentially from HDD at a rate of ~200MB per second ● sequentially from SSD at a rate of ~1 GB per second ● sequentially from main memory at a rate of ~100GB per second (burst rate) ● sequentially from 10Gbps Ethernet at a rate of ~1000MB per second
No more than 6-7 round trips between Europe and the US per second are possible, but approximately 2000 per second can be achieved within a datacenter.
Speed and Performance - Cache data, in between database and app
Reading from memory is much faster than disk, 50-200x faster
Most apps have far more reads than writes, perfect for Caching
Caching Layers
- DNS - Google.com is an IP cached in th browser
- CDN - Cloudflare
- Application
- Database - Frequent accessed data on cache
Distributed Cache
Works same as traditional cache
Has built-in functionality to replicate data,
Can shard data across servers and locate proper server for each key
Active (A, B, C)
Passive (A, B, C) warm up - can serve as a backup of the cache
Only store frequent accessed data to use it efficiently
TTL - Time to Live - time period before a cache entry is deleted
- Used to prevent stale data
LRU/LFU
Least Recently used
- Once cache is full, remove last accessed key and add new key
Least Frequently used
- Track number of times key is accessed
- Drop least used when cache is full
Caching Strategies
Cache Aside - most common
- Read through
- Write through
- Write back
Implementation:
Caching pseudocode - retrieval
def app_request(tweet_id):
cache = { }
data = cache.get(tweet_id)
If data:
return data
else:
data = db_query(tweet_id)
// set data in cache
cache[tweet_id] = data
return data
How to write/update on cache:
def app_update(tweet_id, data):
cache = {}
db_update(data)
cache.pop(tweet_id) // remove from cache to avoid stale data
—————————————————————————————
Database Scaling
- Ways to improve db performance
- Horizontally
- Replication, partitions
- Most web apps are 95% reads
- Twiter, Facebook, Google
Write once and read many times
Basic Scaling Techniques
- Indexes
- Index based on column
- Speed up read performance
- Writes and updates become slightly slower
- More storage required for index
- Denormalisation
- Add redundant data to tables to reduce joins
- Boots read performance
- Slows down writes
- Risk inconsistent data across tables
- Code is harder to write
- Connection pooling
- Allow multiple application threads to use same DB connection
- Saves on overhead of independent DB connections
- Caching
- Cache sits in front of DB to handle serving content
- Can’t cache everything
- Redis / memcached
- Vertical scaling - get a bigger server (more CPU, more memory, disk speed, network throughput)
Replication and Partitioning
- Read replicas
- Create replica servers to handle reads
- Master server dedicated only to writes
- Have to handle making sure new data reaches replicas
- DB master/ main and read replicas - Issues with consistency of data / stale data
- Horizontal partitioning - sharing
- Schema of table stays the same, but split across multiple DBs - Master 1 A-M, Master 2 N-Z
- Downsides - Hot keys, no joins across shards, could result on unbalanced servers and traffic
- Instagram - Justin Bieber post a picture and servers go crazy
- Vertical Partition
- Divide up schema of database into separate tables
- Generally divided by functionality
- Best when most data in row isn’t need for most queries
- Table (ID, Username, Email, Address)
- Split into 2 tables Table1(ID, username) Table2 (Email, Address, userID)
- Instagram has counts for likes and dislikes on separate databases and tables for vertical partition
The Devil You Know - When to consider NoSQL
- Upfront you know what you sacrificing
- NoSQL is because you know what you need to make trade off
- If you need transaction and no perfect consistency (for social media)
- How to design Youtube clone
Database Design and Scaling
Vertical scalling
- Vertical scalling - Get a bigger Database.
Indexes
- Index based on column
- Speed up read performance
- Writes and updates become slightly slower
- More storage required for index
Denormalization
- Add redundant data to tables to reduce joins
- Boosts read performance
- Slows down writes
- Risk inconsistent data across tables
- Code is harder to write
Connection Pooling
- Allow multiple application threads to use same DB connection
- Saves on overhead of independent DB connections
System Design Outline
This is a recommended approach to solving system design problems. Not all of these topics will be relevant for a given problem.
- Functional Requirements / Clarifications / Assumptions
- Non-Functional Requirements
- Consistency vs Availability
- Latency
- How fast does this system need to be?
- User-perceived latency
- Data processing latency
- Security
- Potential attacks? How should they be mitigated?
- Privacy
- PII, PCI, or other user data
- Data Integrity
- How to recover from data corruption or loss?
- Read:Write Ratio
- APIs
- Public and/or internal APIs?
- Are the APIs simple and easy to understand?
- How are entities identified?
- Storage Schemas
- SQL vs NoSQL
- Message Queues
- System Design
- Scalability
- How does the system scale? Consider both data size and traffic increases.
- What are the bottlenecks? How should they be addressed?
- What are the edge cases? What could go wrong? Assuming they occur, how should they be addressed?
- How will we stress test this system?
- Load Balancing
- Auto-scaling / Replication
- Caching
- Parititioning
- Replication
- Business Continuity and Disaster Recovery (BCDR)
- Internationalization / Localization
- How to scale to multiple countries and languages? Don’t assume the US is English-only.
All that being covered, the mindset of an SRE should be simple:
You built it, you run it!
https://www.atlassian.com/br/incident-management/devops/sre
You can be an SRE, DevOps, Developer, QA and Product Manager at the same time. Don’t put yourself in a silo. Think about your customers, users, colleagues and communities around your service The team should be able to understand their service end-to-end - Quality, cost, scope, security, compliance, etc …
References
- https://learning.oreilly.com/library/view/seeking-sre/9781491978856/part03.html
- https://www.splunk.com/en_us/data-insider/what-is-mean-time-to-repair.html
Please, follow our social networks:
Thank You and until the next one! 😉👍
Published on Jul 25, 2022 by Vinicius Moll