What is a good design: Something that is easier to change in the future
First Steps: Gather
A) Get the Context (or what does the System do)/Scope/Clarify Functional Requirements
B) How System will behave/Non-functional Requirements: (aka “ilities” like Reliability (available 24×7), Usability/Scalability (serve millions of users)/Maintainability/Efficiency (short response latency))
C) Restrictions: Compliance, Cost, Time to market vs Features, Talent hiring
Next: Prioritize
Aspects:
Performance, Load, Scalability, Security, Configurability, Robustness (handling of timeouts/retries/ext deps), Reliability, Maintainability, Usability, Reusability, Portability, Testability, Operability, Monitor-ability, Change Management
First make the system Maintainable, then -> Scalable, then -> Performant
Time to Market vs Features
Portability vs Scalability/Maintainability (If the app will not migrate to another platform, so drop portability and favor scalability)
If priorities are acceptable, then proceed to designing the system. Keep in mind that Architectures evolve over time, so requirements will change, some changes will be expensive and try to not over/under engineer
Next: Will it Scale
Do we need to scale the requests or do we need to scale the data, or both? If the requests are within 50K, probably vertical scaling will work (prevent over-engineering). If its more, we go for horizontal scaling. Horizontal scaling is not cheap, so decide when the system is about to reach a threshold.
Tradeoffs: Expensive, CAP theorem needs to be considered. Latency between servers need to be considered.
The first thing for horizontal scaling is state vs stateless. If no state, we can scale with an LB. If there is state, then we hit with CAP theorem.
CAP Implications: Strong vs Eventual Consistency (any write available to each reader). Read vs Write.
Scale servers by using LB, and scale data by using sharding/partitioning allowing independent scaling. Splitting data can be based on the features, or geo location or hashing. Consistent Hashing to scale it further. Another approach to data is to replicate it, this is also useful for read intensive applications.
Next: Can it be Faster
If latency is an issue due to horizontal scaling, and costing the business, can we make it faster? Caching.
If caching is needed at UI layer (static images/videos) then use CDN. If its at server layer, then local server caching can help, otherwise Distributed Caching.
Tradeoffs: Caching can add cost and complications like cache invalidations and cache misses.
Example Architecture Pattern: EDA
With great decoupling comes no tracing and more traffic.
3 components to this, Producer, Broker and Consumer. Also called Pub/Sub models. Event becomes the contract and systems have to follow it.
Advantages: Loose coupling, Better Scalability, Dependency Inversion (services dont need to know about each other), Event Persistence. Events are Immutable to allow it to consumed by multiple services. Easy integration for any new service.
Trade-off: Latency due to broker services (and services reaching each other more often to get/update), and eventual consistency (as services will reading messages at will), and tracing communication channels becomes difficult. Services unknowingly breaking due to other services changed?
When to use: When data replication and parallel processing is required between services. Services need loose coupling.
When a change (data change) affects multiple parts of the system, ex updating customer address in insurance system needs to be updated in user management and in insurance quoting system. When the system does not care what happens to the event, otherwise the system has to pass orders in the form of commands.
Event Carried State Transfer: In order to reduce the traffic between services and the dependency making the system highly available. But at the cost of consistency to maintain the duplicates of data.
Event Sourcing: When a change is about to happen, we create an event object and then use it to change the data. Later if the system blows up, we can recover the application state back from the logs. Ex: GIT. A combination of changes and some snapshots. If the application state changes is in logs, then there is no need for a relational db to store all these, the system can be an in-memory system.
Advantages: Auditable, Historic state, Debuggable, In memory state
Pick some Core Features: Post tweet (text/image/video/links), Follow/Unfollow, Like/Share/Retweet, Newsfeed to be sorted by relevance, Search tweet, Trending, Celebrity Tweet
Non-Functional: HA, Scalable, Fast Response, Celebrity post to all followers, Most relevant tweet?
How much traffic and how much data. Read heavy vs write heavy.
Data usage:
1) Different types of tweets will need different backend database (Cassandra to store tweet, elastisearch to store indexing and trending, blob storage for multimedia).
2) Average tweet size: 10 kb. No of users: 300 mill. Total data per day: 10 x 300 mill = 3 TB.
3) how to store 1 year of data?
Flow of posting a tweet:
Resiliency
The ability of a system to perform consistently, as expected under varying conditions (redundancy, DR, observability, continous testing are the ways to achieve this). While redundancy ensures continuity, reliability provides consistent performance
Advantages: Enhanced uptime, Risk Mitigation and Sustainable performance
Disadvantages: High cost and complexity due to redundant systems and complacency due to high reliance on these principles
HA vs FT
High availability is the ability of a system to operate continuously with minimal risk of failure. On the other hand, fault tolerance is the ability of a system to continue operating without interruption, even if several components fail