Introduction to System Design

SHOW CONTENTS

System Design is the process of defining the architecture, components, modules, interfaces, and data for a system to fulfill specific business requirements, while ensuring scalability, maintainability, and performance.


Load Balancing

SHOW CONTENTS

In System Design, Load Balancing refers to the practice of distributing incoming network traffic or workload across multiple servers or resources to optimize resource use and ensure high availability.

Load Balancing

To fullly leverage scalability and redundency, load balancing can occur at different layers: between user and the web server, between web server and an internal platform serve, between internal platform server and database as illustrated in the following image:

Load Balancing at Different Layers

The typical process of load balancing involves the following steps:

  1. The load balancer recerves a request from the client.
  2. The load balancer evaluates the request and routes it to a server based on the chosen load balancing algorithm.
  3. The selected server or resource processes the request and sends the response back to the load balancer.
  4. The load balancer receives the response and forwards it to the client.

Process of Load Balancing


Load Balancing Algorithms

SHOW CONTENTS

A load balancing algorithm is a method used by a load balancer to determine how an incoming request should be distributed across multiple servers. Commonly used load balancing algorithms include Round Robin, Least Connections, Weight Round Robin, Weighted Least Connections, IP Hash, Least Response Time, Random, and Least Bandwidth.


Round Robin

SHOW CONTENTS

The Round Robin algorithm distributes requests evenly across multiple servers in a circular manner. This algorithm does not consider the current load or capabilities of each server. It is commonly used in environments where servers have similar capacity and performance, or in applications where each request can be handled independently.

Round Robin Load Balancing Algorithm


Least Connections

SHOW CONTENTS

The Least Connections algorithm distributes requests to servers with the fewest active connections. It takes into account the server’s current workload, helping to prevent any single server from becoming overwhelmed. This algorithm is particularly useful in scenerios where traffic or workload is unpredictable, servers have varying capabilities, or maintaining session state is important.

Least Connections Load Balancing Algorithm


Weighted Round Robin

SHOW CONTENTS

The Weighted Round Robin algorithm is an enhanced version of Round Robin, where each server is assigned a weight based on its capability and workload. Servers with higher weights process more requests, helping to prevent overloading less powerful servers. This algorithm is ideal for scenarios where servers have varying processing abilities, such as in a database cluster, where nodes with higher processing power can handle more queries.

Weighted Round Robin Load Balancing Algorithm


Weighted Least Connections

SHOW CONTENTS

The Weighted Least Connections algorithm is a combination of the Least Connections and the Weighted Round Robin algorithms. It takes into account the number of active connections of each server and the weight assigned to a server based on its capability. Requests are routed to servers based on the load factor, which is commonly calculated using the formular: the number of active connections of a server divided by its weight.

$$ \text{Load Factor} = \frac{\text{Number of Active Connections}}{\text{Weight of the Server}} $$

Weighted Least Connections Load Balancing Algorithm


IP Hash

SHOW CONTENTS

The IP Hash algorithm routes requests to servers based on a hash of the client’s IP address. The load balancer applies a hash function to the client’s IP address to calculate the hash value, which is then used to determined which server will handle the current request. If the distribution of client IP addresses is uneven, some servers may receive more requests than others, leading to an imbalanced load. This algorithm is ideal for scenarios where maintaining state is important, such as online shopping carts or user sessions.

IP Hash Load Balancing Algorithm


Least Response Time

SHOW CONTENTS

The Least Response Time algorithm routes incoming requests to the server with the lowest response time, ensuring efficient resource utilization and optimal client experience. It is ideal for scenerios where low latancy and fast response times are crucial, such as online gaming and financial trading.

Least Response Time Load Balancing Algorithm


Random

SHOW CONTENTS

The Random load balancing algorithm routes incoming requests to servers randomly. It is commonly used in scenarios where the load is relatively uniform and the servers have similar capabilities.

Random Load Balancing Algorithm


Least Bandwidth

SHOW CONTENTS

The Least Bandwidth algorithm routes incoming requests to the server that is consuming the least amount of bandwidth. It is ideal for applications with high bandwidth usage, such as vedio streaming, file downloads, and large file transfers.

Least Bandwidth Load Balancing Algorithm


Redundent Load Balancers

SHOW CONTENTS

The Single Point of Failure (SPOF) refers to any component in a system or infrastructure that, if it fails, causes the entire system or a significant portion of it to become unavailable. For instance, if a load balancer is responsible for routing all incoming requests to servers, its faulure would result in the entire system or application becoming inoperable. To mitigate this risk, redundent load balancers can be deployed.

For example, in an active-passive setup, two load balancers are used, where both are capable of routing traffic and detecting failures. The active load balancer handles all incoming requests, and if it fails, the passive load balancer takes over to ditribute requests, ensuring continuous availability. This approach helps prevent the system from being dependent on a single point of failure, as illustrated in the following diagram.

Load Balancer Cluster


API Gateway

SHOW CONTENTS

An API Gateway is a server-side component that acts as a central entry point for clients to access as a collection of microservices. It receives client requests, forwards them to the appropriate microservice, and then returns the response from the server to the client. The API Gateway is responsible for various tasks, such as request routing, authentication, rate limiting.

API Gateway

The key difference between an API Gateway and a load balancer lies in their core functions. An API Gateway focuses on routing requests to specific microservices. In contrast, a Load Balancer is responsible for distributing incoming traffic across multiple backend servers. Additionally, while an API Gateway typically deals with requests that target specific APIs identified by unique URLs, a load balancer generally handles requests directed to a single, well-known IP address, distributing those requests to one of serveral backend servers based on load-balancing algorithms.

The Difference Between API Gateway and Load Balancer


Key Characteristics of Distributed System

Scalability

SHOW CONTENTS

Scalability refers to a system’s ability to handle increasing workloads. In system design, there are two primary types of scaling: horizontal and vertical. Horizontal scaling involves adding more machines to distribute the load across multiple servers, while vertical scaling typically involves upgrading the hardware of a single machine.

Vertical Scaling vs. Horizontal Scaling


Availability

SHOW CONTENTS

In system design, availability refers to the ability of a system to remain operational even in the face of faulures or high demand. Factors that affect availability include redundency, failover mechanisms, and load balancing. Redundency involves duplicating critical components to ensure that if one fails, another can take over. Failover mechanisms refer to the ability to quickly switch to a backup system during failure. Load balancing distributes requests across multiple servers to prevent any single point from becoming overwhelmed.

In distributed systems, there is often a trade-off between availability and consistency. The three common types of consistency models are strong, weak, and eventual consistency. The strong consistency model ensure that all replicas have the same data at all times, which can reduce availability and performance. The weak consistency model allows for temporary inconsistencies between replicas, offering improved availability and performance. The eventual consistency model guarantees that all replicas will eventually converge to the same data, balancing consistency and availability over time.


Monitoring

SHOW CONTENTS

Monitoring in distributed systems is crucial for identifying issues and ensuring the overall health of the system. It typically involves four key components: metrics collection, distributed tracing, logging, and anomaly detection.

Metrics collection involves gathering and analyzing key performance indicators such as latency, throughput, error rates, and resource utilization. This helps identify performance bottlenecks, potential issues, and areas for optimization. Common tools for metrics collection inculde Prometheus, Graphite, and InfluxDB.

Distributed tracing is a technique for tracking and analyzing requests as they pass through various services, helping identify issues within specific services. Common tools for distributed tracing include Zipkin.

Logging refers to the collection, centralization, and analysis of logs from all services in a distributed system. It provides valuable insights into system behavior, aiding in debugging and troubleshooting. Tools like the ELK Stack (Elasticsearch, Logstash, Kibana) are used for logging.

Anomaly detection involves monitoring for unusual behaviors or patterns and notifying the appropriate team when such events occur. Tools like Grafana can be used for anomaly detection in distributed systems.


Caching

Introduction to Caching

SHOW CONTENTS

In system design, caching is a high-speed storage layer positioned between the application and the original data source, such as a database or a remote web service. The primary goal of caching is to minimize access to the original data source, thereby improving application performance. Caching can be implemented in various forms, including in-memory caching, disk caching, database caching, and CDN caching:

  • In-memory caching stores data directly in the computer’s memory, offering faster access than disk storage.
  • Disk caching stores data on disk, which is slower than memory but faster than fetching data from an external source.
  • Database caching stores data within the database itself, reducing the need to access external storage systems.
  • CDN caching stores data on a distributed network of servers, minimizing latency when accessing data from remote locations.

Here are some key caching terms:

  1. Cache Hit: Occurs when the requested data is found in the cache.
  2. Cache Miss: Happens when the requested data is not found in the cache, requiring a fetch from the original data source.
  3. Cache Eviction: The process of removing data from the cache, often to make room for new data based on a predefined cache eviction policy.
  4. Cache Staleness: Refers to the situation where the cached data is outdated compared to the original data source.

Here are some of the most common types of caching:

Types of Caching


Caching Replacement Policies

SHOW CONTENTS

When caching data becomes outdated, it should be removed. Therefore, specifying a cache replacement policy is crucial when implementing caching. Common cache replacement policies include: LRU (Least Recently Used), LFU (Least Frequently Used), FIFO (First In, First Out), and Random Replacement.

  • LRU (Least Recently Used) removes the least recently accessed data when the cache becomes full. It ensures that data that have been accessed more recently are more likely to be accessed again in the future.
  • LFU (Least Frequently Used) removes the least frequently accessed data when the cache is full. It ensures that data that have been accessed more frequently are more likely to be accessed again in the futrue.
  • FIFO (First In, First Out) removes the oldest data when the cache becomes full. It assumes that the oldest data are least likely to be accessed in the future.
  • Random Replacement removes random data when the cache is full. This policy can be useful when the data access pattern is unpredictable.


Cache Invalidation

SHOW CONTENTS

Cache Invalidation is the process of marking data in the cache as stale or directly removing it from the cache, ensuring that the cache doesn’t serve outdated or incorrect data. Common cache invalidation schemes include write-through cache, write-around cache, write-back cache, and write-behind cache.

  • Write-Through Cache: In a write-through scheme, data is written to both the cache and the data store simultaneously. This ensures that the cached always serves the most up-to-date data, but it introduces latency, as each write operation must be performed twice before the data is returned. Write-Through Cache
  • Write-Around Cache: In a write-around scheme, data is directly written to the underlying data store, bypassing the cache. This prevents the cache from becoming flooded with less frequently accessed data, while ensuring that the data store always holds the most recent data. However, this can result in a cache miss when requesting data that was recently written. Write-Around Cache
  • Write-Back Cache: In a write-back scheme, data is written only to the cache, and the write operation is immediately confirmed. The data is written to the data store only when the cached data is evicted. This ensures low latency and high throughput but may lead to data loss in the event of a crash, as the data is stored only in the cache. Write-Back Cache
  • Write-Behind Cache: Similar to the write-back scheme, the write-behind cache writes data to the underlying store after a specified delay. The key difference is that in a write-back cache, data is only written to the data store when necessary (e.g., upon eviction). In contrast, in a write-behind cache, data is written to the data store at regular intervals.

Here are some of the most commonly used cache invalidation methods:

  • Purge: The purge method removes cached data immediately. When a purge request is received, the cached data is deleted, and the next time the key is requested, the cache fetches a fresh copy from the original data store, stores it, and returns it.
  • Refresh: The refresh method updates the cached data with the latest data from the original data store. This ensures the cache always holds the most up-to-date data.
  • Ban: The ban method blocks access to certain cached data. When a ban command is issued, the key is added to a ban list. From then on, every request is checked against the ban list. If the requested resource matches an invalidation rule, the cache treats it as stale, fetches a fresh copy from the original data store, and updates the cache. Unlike purge, which removes the cache entry immediately, ban simply marks the entry as stale, and the data remains in the cache until it is evicted.
  • Time To Live (TTL) Expiration: This method involves setting a time-to-live (TTL) value for cached data. Once the TTL expires, the data is considered stale. When a request is made, the cache checks the TTL of the cached data and serves it only if if hasn’t expired. If expired, the cache fetches the fresh copy from the original data store and returns it.
  • Stale-While-Revalidate: This method serves the cached data to the client immediately when a request is received. Meanwhile, the cache asynchronously updates itself with the latest version of the data from the original data store. This approach ensures a qucik response, even if the cached data is slightly outdated, and is commonly used in CDN caching.

Cache Invalidation Methods


Cache Read Strategies

SHOW CONTENTS

Cache read strategies define how the cache behaves when a cache miss occurs. The most common cache read strategies are read-through and read-aside.

  • In a read-through cache strategy, the cache is responsible for fetching fresh data from the original data store when a cache miss occurs. If the requested data is not found in the cache, the cache automatically retrives the data from the data store, stores it in the cache, and returns it to the client. Read-Through Cache
  • In a read-aside cache strategy, the application first requests the data from the cache. If the data is found, the cached data is used. However, if the data is not found, the application fetches the data from the underlying data store, updates the cache with the retrived data, and then uses it. Read-Aside Cache


Cache Coherence and Cache Consistence

SHOW CONTENTS

Cache coherence refers to the consistency of data stored in multiple caches that are part of the same system, particularly in multi-core systems. In a distributed system, each cache may store a local copy of shared data. When one cache modifies its copy, all other caches holding a copy of that data must be updated or invalidated to maintain consistency. The most common protocols for achieving cache coherence are write-invalidate and write-update.

  • Write-Invalidate: When a cache writes to its copy of shared data, it broadcasts a message to all other caches, invalidating their copies. When another cache needs the updated data, it fetches the new data from memory or from the cache that made the update.
  • Write-Update: When a cache writes to its copy of shared data, it broadcasts a message to all other caches, prompting them to update their local copies accordingly.

Cache consistency focuses on maintaining the consistency of data between the cache and the original data source. Cache consistency models define the rules governing how data is updated and accessed in a distributed system with multiple caches. These models vary in terms of their strictness, and include strict consistency, sequential consistency, casual consistency, and eventual consistency.

  • Strict Consistency: In a strict consistency model, any write to a data item is immediately visible to all caches. While this ensures data is always up-to-date, it may lead to performance issues due to the significant synchronization overhead required.
  • Sequential Consistency: In a sequential consistency model, all operations on data items must appear in a specific, sequential order across all caches. This model ensures a predictable order of operations but may not guarantee the exact real-time visibility of changes.
  • Casual Consistency: In a casual consistency model, operations that are casually related (e.g., one operation depends on the outcome of another) are guaranteed to appear in order across multiple caches. Operations that are not casually related can occur in any order. This model strikes a balance between on consistency and performance.
  • Eventual Consistency: In an eventual consistency model, updates to a data item will eventually propagate to all caches, but there is no guarantee regarding the order or timing of these updates. This model offers the best performance but the weakest consistency guarantees, making it ideal for distributed systems where performance and scalability are prioritized over strict consistency.


Caching Challenges

SHOW CONTENTS

The main cache-related issues include Thundering Herd, Cache Penetration, Cache Stampede, and Cache Pollution.

  • Thundering Herd: This problem arises when a popular piece of data expires, leading to a sundden surge of requests to the original server, resulting in performance degradation. Solutions include staggered expiration times, cache locking, or background updates before expiration.

  • Cache Penetration: This occurs when multiple requests for non-existent data bypass the cache, querying the original data store directly. Solutions include negative caching (caching “not found” responses) or using a Bloom Filter to check the existence of data before querying the cache.

  • Cache Stampede: This happens when multiple requests for the same data arrive after cache expires, causing a heavy load on the original data source. Solutions typically involve request coalescing (letting on request fetch the data while others wait) or implementing a read-through cache, where the cache itself fetches missing data. Cache Stampede

  • Cache Pollution: This occurs when less frequently accessed data displaces more frequently accessed data, reducing cache hit rates. To mitigate cache pollution, eviction policies such as LRU (Least Recently Used) or LFU (Least Frequently Used) can be implemented.


Data Partitioning

SHOW CONTENTS

In system design, data partitioning is a technique used to break large datasets into smaller, more manageable units called partitions. Each partition is independent and contains a subset of the overall data. Common data partitioning methods include horizontal partitioning and vertical partitioning.

  • Horizontal partitioning, also known as sharding, involves dividing a database into multiple shards, with each shard containing a subset of rows. These shards are typically stored on different database servers, allowing for parallel processing and improving query execution times.
  • Vertical partitioning divides a database into multiple partitions based on columns, with each partition containing a subset of the columns. This technique is commonly used when some fields are accessed more frequently than others, optimizing performance for specific queries.

Horizontal Partitioning vs. Vertical Partitioning


Data sharding splits table horizontally, here are some common sharding techniques.

  • Range-Based Sharding: Data is divided based on a specific range, such as numeric ranges or dates. For example, an e-commerce platform may partition the order table by order date. Range-Based Sharding
  • Hash-Based Sharding: Data is partitioned by applying a hash function to a partition key. The hash value determines which shard will store the data. Hash-Based Sharding
  • Directory-Based Sharding: Data is partitioned based on a lookup table that tracks which shard contains which data. Directory-Based Sharding
  • Geographical Sharding: Data is divided based on geographical locations, such as countries or regions.
  • Hybrid-Based Sharding: A combination of multiple sharding strategies to optimize system performance. Hybrid-Based Sharding


Proxy

SHOW CONTENTS

A proxy is an intermediary server or software that sits between the client and the internet, typically used for tasks like filtering, caching, or security checks. Proxies can consolidate multiple client requests into a single request, a process known as collapsed forwarding. For instance, if multiple clients request the same resource, the proxy can cache the resource and serve it to those clients without having to forward the request to the origin server each time. There are two main types of proxies: forward proxy and reverse proxy.

  • A forward proxy acts on behalf of the client, hiding its identity by forwarding requests from the client to the server. It is often used to mask the client’s IP address, bypass geo-restrictions, or cache content for faster access.

  • A reverse proxy, on the other hand, acts on behalf of the server, intercepting incoming requests from clients and directing them to the appropriate backend server. This type of proxy is commonly used for load balancing, caching, and enhancing security by hiding the backend server details from clients.

Forward Proxy vs. Reverse Proxy


Replication

SHOW CONTENTS


Replication Methods

SHOW CONTENTS


CAP Theorem

SHOW CONTENTS


Database Federation

SHOW CONTENTS


Security

SHOW CONTENTS


Distributed Messaging System

SHOW CONTENTS


Distributed File System

SHOW CONTENTS


Misc Concepts

Bloom Filters

SHOW CONTENTS


Long-Polling, WebSockets, and Server-Sent Events

SHOW CONTENTS


Quorum

SHOW CONTENTS


Heartbeat

SHOW CONTENTS


Leader and Follower

SHOW CONTENTS


Message Queues vs. Service Bus

SHOW CONTENTS


Stateful vs. Stateless Architecture

SHOW CONTENTS


Event-Driven vs. Polling Architecture

SHOW CONTENTS