Skip to content

Commit 2e9ddba

Browse files
More System design
1 parent cb516de commit 2e9ddba

File tree

10 files changed

+328
-11
lines changed

10 files changed

+328
-11
lines changed

Interview/Algorithm/Array.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
14. Maximum Swap
1717
15. Meeting Room II
1818
16. Sort Colors
19+
17. 3 Sum
1920

2021
## Implementation
2122

@@ -621,4 +622,42 @@ public:
621622
}
622623
623624
};
625+
```
626+
627+
### **3 SUM**
628+
629+
***Big O:*** O(n^2), O(1) space
630+
```
631+
Tips:
632+
633+
Sort + two pointers.
634+
```
635+
```c++
636+
class Solution {
637+
public:
638+
vector<vector<int>> threeSum(vector<int>& nums) {
639+
vector<vector<int>> ans;
640+
sort(nums.begin(), nums.end());
641+
int length = nums.size() - 1, left, right;
642+
for ( int index = 0; index <= length; ++index )
643+
{
644+
if ( index > 0 && nums[index - 1] == nums[index] ) continue;
645+
left = index + 1;
646+
right = length;
647+
while ( left < right )
648+
{
649+
if ( nums[index] + nums[left] + nums[right] < 0 ) ++left;
650+
else if ( nums[index] + nums[left] + nums[right] > 0 ) --right;
651+
else
652+
{
653+
vector<int> anotherAnswer { nums[index], nums[left], nums[right] };
654+
ans.push_back(anotherAnswer);
655+
++left;
656+
while ( left < right && nums[left] == nums[left - 1] ) ++left;
657+
}
658+
}
659+
}
660+
return ans;
661+
}
662+
};
624663
```

Interview/Algorithm/LeetCode_for_embedded_advanced.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@
8181
31. Maximum Swap v
8282
32. Meeting Room II v
8383
33. Sort Colors v
84+
34. 3 Sum v
8485

8586
***Math:***
8687
1. Add Binary v
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
## CAP Theorem
2+
3+
```CAP theorem states that it is impossible for a distributed software system to simultaneously provide more than two out of three of the following guarantees (CAP): Consistency, Availability, and Partition tolerance.```
4+
5+
When we design a distributed system, trading off among CAP is almost the first thing we want to consider. CAP theorem says while designing a distributed system we can pick only two of the following three options:
6+
7+
***Consistency:*** All nodes see the same data at the same time. Consistency is achieved by updating several nodes before allowing further reads.
8+
9+
***Availability:*** Every request gets a response on success/failure. Availability is achieved by replicating the data across different servers.
10+
11+
***Partition tolerance:*** The system continues to work despite message loss or partial failure. A system that is partition-tolerant can sustain any amount of network failure that doesn’t result in a failure of the entire network. Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.
12+
13+
![CAP theorem](https://miro.medium.com/max/922/1*tmttEOAU9xacJgw6vrsAuA.jpeg)
14+
15+
We cannot build a general data store that is continually available, sequentially consistent, and tolerant to any partition failures. We can only build a system that has any two of these three properties. Because, to be consistent, all nodes should see the same set of updates in the same order. But if the network loses a partition, updates in one partition might not make it to the other partitions before a client reads from the out-of-date partition after having read from the up-to-date one. The only thing that can be done to cope with this possibility is to stop serving requests from the out-of-date partition, but then the service is no longer 100% available.
16+
17+
## Reference
18+
19+
Grokking the System Design Interview by Educative.io
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
## Redundancy and Replication
2+
3+
### **Redundancy**
4+
Redundancy is the ***duplication of critical components or functions of a system*** with the intention of increasing the reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance. For example, if there is only one copy of a file stored on a single server, then losing that server means losing the file. Since losing data is seldom a good thing, we can create duplicate or redundant copies of the file to solve this problem.
5+
6+
```Redundancy plays a key role in removing the single points of failure in the system and provides backups if needed in a crisis. For example, if we have two instances of a service running in production and one fails, the system can failover to the other one.```
7+
8+
9+
### **Replication**
10+
11+
Replication means ***sharing information to ensure consistency between redundant resources***, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.
12+
13+
Replication is widely used in many database management systems (DBMS), usually with a primary-replica relationship between the original and the copies. The primary server gets all the updates, which then ripple through to the replica servers. Each replica outputs a message stating that it has received the update successfully, thus allowing the sending of subsequent updates.
14+
15+
## Reference
16+
17+
Grokking the System Design Interview by Educative.io
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
## Consistent Hashing
2+
3+
Distributed Hash Table (DHT) is one of the fundamental components used in distributed scalable systems. Hash Tables need a key, a value, and a hash function where hash function maps the key to a location where the value is stored.
4+
5+
Suppose we are designing a distributed caching system. Given ‘n’ cache servers, an intuitive hash function would be ‘key % n’. It is simple and commonly used. But it has two major drawbacks:
6+
7+
1. It is NOT horizontally scalable. Whenever a new cache host is added to the system, all existing mappings are broken. It will be a pain point in maintenance if the caching system contains lots of data. Practically, it becomes difficult to schedule a downtime to update all caching mappings.
8+
9+
2. It may NOT be load balanced, especially for non-uniformly distributed data. In practice, it can be easily assumed that the data will not be distributed uniformly. For the caching system, it translates into some caches becoming hot and saturated while the others idle and are almost empty.
10+
11+
In such situations, consistent hashing is a good way to improve the caching system.
12+
13+
### **What is Consistent Hashing?**
14+
15+
Consistent hashing is a very useful strategy for distributed caching systems and DHTs. It allows us to distribute data across a cluster in such a way that **will minimize reorganization when nodes are added or removed. Hence, the caching system will be easier to scale up or scale down**.
16+
17+
```In Consistent Hashing, when the hash table is resized (e.g. a new cache host is added to the system), only ‘k/n’ keys need to be remapped where ‘k’ is the total number of keys and ‘n’ is the total number of servers. Recall that in a caching system using the ‘mod’ as the hash function, all keys need to be remapped.```
18+
19+
In Consistent Hashing, objects are mapped to the same host if possible. When a host is removed from the system, the objects on that host are shared by other hosts; when a new host is added, it takes its share from a few hosts without touching other’s shares.
20+
21+
### **How does it work?**
22+
23+
As a typical hash function, consistent hashing maps a key to an integer. Suppose the output of the hash function is in the range of [0, 256]. Imagine that the integers in the range are placed on a ring such that the values are wrapped around.
24+
25+
Here’s how consistent hashing works:
26+
27+
1. Given a list of cache servers, hash them to integers in the range.
28+
2. To map a key to a server,
29+
1. Hash it to a single integer.
30+
2. Move clockwise on the ring until finding the first cache it encounters.
31+
3. That cache is the one that contains the key. See animation below as an example: key1 maps to cache A; key2 maps to cache C.
32+
33+
![Consistent Hashing](https://uploads.toptal.io/blog/image/129309/toptal-blog-image-1551794743400-9a6fd84dca83745f8b6ca95a2890cdc2.png)
34+
35+
To add a new server, say D, keys that were originally residing at C will be split. Some of them will be shifted to D, while other keys will not be touched.
36+
37+
To remove a cache or, if a cache fails, say A, all keys that were originally mapped to A will fall into B, and only those keys need to be moved to B; other keys will not be affected.
38+
39+
For load balancing, as we discussed in the beginning, the real data is essentially randomly distributed and thus may not be uniform. It may make the keys on caches unbalanced.
40+
41+
To handle this issue, we add “virtual replicas” for caches. Instead of mapping each cache to a single point on the ring, we map it to multiple points on the ring, i.e. replicas. This way, each cache is associated with multiple portions of the ring.
42+
43+
If the hash function “mixes well,” as the number of replicas increases, the keys will be more balanced.
44+
45+
### **Advance Reading**
46+
[The Ultimate Guide For consistent Hashing](https://www.toptal.com/big-data/consistent-hashing)
47+
48+
In this article, it first reviews the general concept of hashing and its purpose, followed by a description of distributed hashing and the problems it entails. In turn, that will lead us to our title subject.
49+
50+
## Reference
51+
52+
Grokking the System Design Interview by Educative.io

Interview/SystemDesign/dataPartitioning.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,67 @@
11
## Data Partitioning
22

3+
Data partitioning is a technique to break up a big database (DB) into many smaller parts. It is the process of splitting up a DB/table across multiple machines to improve the manageability, performance, availability, and load balancing of an application. The justification for data partitioning is that, after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines than to grow it vertically by adding beefier servers.
4+
5+
### **Partitioning Methods**
6+
7+
There are many different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are three of the most popular schemes used by various large scale applications.
8+
9+
***a. Horizontal partitioning***
10+
11+
In this scheme, we put different rows into different tables. For example, if we are storing different places in a table, we can decide that locations with ZIP codes less than 10000 are stored in one table and places with ZIP codes greater than 10000 are stored in a separate table. This is also called a range based partitioning as we are storing different ranges of data in separate tables. Horizontal partitioning is also called as Data Sharding.
12+
13+
The key problem with this approach is that if the value whose range is used for partitioning isn’t chosen carefully, then the partitioning scheme will lead to unbalanced servers. In the previous example, splitting location based on their zip codes assumes that places will be evenly distributed across the different zip codes. This assumption is not valid as there will be a lot of places in a thickly populated area like Manhattan as compared to its suburb cities.
14+
15+
***b. Vertical Partitioning***
16+
17+
In this scheme, we divide our data to store tables related to a specific feature in their own server. For example, if we are building Instagram like application - where we need to store data related to users, photos they upload, and people they follow - we can decide to place user profile information on one DB server, friend lists on another, and photos on a third server.
18+
19+
Vertical partitioning is straightforward to implement and has a low impact on the application. The main problem with this approach is that if our application experiences additional growth, then it may be necessary to further partition a feature specific DB across various servers (e.g. it would not be possible for a single server to handle all the metadata queries for 10 billion photos by 140 million users).
20+
21+
***c. Directory Based Partitioning***
22+
23+
A loosely coupled approach to work around issues mentioned in the above schemes is to create a lookup service which knows your current partitioning scheme and abstracts it away from the DB access code. So, to find out where a particular data entity resides, we query the directory server that holds the mapping between each tuple key to its DB server. This loosely coupled approach means we can perform tasks like adding servers to the DB pool or changing our partitioning scheme without having an impact on the application.
24+
25+
### **Partitioning Criteria**
26+
27+
***a. Key or Hash-based partitioning:***
28+
29+
Under this scheme, we apply a hash function to some key attributes of the entity we are storing; that yields the partition number. For example, if we have 100 DB servers and our ID is a numeric value that gets incremented by one each time a new record is inserted. In this example, the hash function could be ‘ID % 100’, which will give us the server number where we can store/read that record. This approach should ensure a uniform allocation of data among servers. The fundamental problem with this approach is that it effectively fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service. A workaround for this problem is to use Consistent Hashing.
30+
31+
***b. List partitioning:***
32+
33+
In this scheme, each partition is assigned a list of values, so whenever we want to insert a new record, we will see which partition contains our key and then store it there. For example, we can decide all users living in Iceland, Norway, Sweden, Finland, or Denmark will be stored in a partition for the Nordic countries.
34+
35+
***c. Round-robin partitioning:***
36+
37+
This is a very simple strategy that ensures uniform data distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).
38+
39+
***d. Composite partitioning:***
40+
41+
Under this scheme, we combine any of the above partitioning schemes to devise a new scheme. For example, first applying a list partitioning scheme and then a hash based partitioning. Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key space to a size that can be listed.
42+
43+
### **Common Problems of Data Partitioning**
44+
45+
On a partitioned database, there are certain extra constraints on the different operations that can be performed. Most of these constraints are due to the fact that operations across multiple tables or multiple rows in the same table will no longer run on the same server. Below are some of the constraints and additional complexities introduced by partitioning:
46+
47+
***a. Joins and Denormalization:***
48+
49+
Performing joins on a database which is running on one server is straightforward, but once a database is partitioned and spread across multiple machines it is often not feasible to perform joins that span database partitions. Such joins will not be performance efficient since data has to be compiled from multiple servers. A common workaround for this problem is to denormalize the database so that queries that previously required joins can be performed from a single table. Of course, the service now has to deal with all the perils of denormalization such as data inconsistency.
50+
51+
***b. Referential integrity:***
52+
53+
As we saw that performing a cross-partition query on a partitioned database is not feasible, similarly, trying to enforce data integrity constraints such as foreign keys in a partitioned database can be extremely difficult.
54+
55+
Most of RDBMS do not support foreign keys constraints across databases on different database servers. Which means that applications that require referential integrity on partitioned databases often have to enforce it in application code. Often in such cases, applications have to run regular SQL jobs to clean up dangling references.
56+
57+
***c. Rebalancing:***
58+
59+
There could be many reasons we have to change our partitioning scheme:
60+
61+
The data distribution is not uniform, e.g., there are a lot of places for a particular ZIP code that cannot fit into one database partition.
62+
There is a lot of load on a partition, e.g., there are too many requests being handled by the DB partition dedicated to user photos.
63+
64+
In such cases, either we have to create more DB partitions or have to rebalance existing partitions, which means the partitioning scheme changed and all existing data moved to new locations. Doing this without incurring downtime is extremely difficult. Using a scheme like directory based partitioning does make rebalancing a more palatable experience at the cost of increasing the complexity of the system and creating a new single point of failure (i.e. the lookup service/database).
365

466
## Reference
567

Interview/SystemDesign/indexes.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
## Indexes
2+
3+
Indexes are well known when it comes to databases. Sooner or later there comes a time when database performance is no longer satisfactory. One of the very first things you should turn to when that happens is database indexing.
4+
5+
The goal of creating an index on a particular table in a database is to make it faster to search through the table and find the row or rows that we want. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.
6+
7+
```Indexes is like the table of contents that provides fast retrival of contents from a large database.```
8+
9+
### **Example: A Library Catalog**
10+
11+
A library catalog is a register that contains the list of books found in a library. The catalog is organized like a database table generally with four columns: book title, writer, subject, and date of publication. There are usually two such catalogs: one sorted by the book title and one sorted by the writer name. That way, you can either think of a writer you want to read and then look through their books or look up a specific book title you know you want to read in case you don’t know the writer’s name. These catalogs are like indexes for the database of books. They provide a sorted list of data that is easily searchable by relevant information.
12+
13+
Simply saying, an index is a data structure that can be perceived as a table of contents that points us to the location where actual data lives. So when we create an index on a column of a table, we store that column and a pointer to the whole row in the index.
14+
15+
Just like a traditional relational data store, we can also apply this concept to larger datasets. The trick with indexes is that we must carefully consider how users will access the data. In the case of data sets that are many terabytes in size, but have very small payloads (e.g., 1 KB), indexes are a necessity for optimizing data access. Finding a small payload in such a large dataset can be a real challenge, since we can’t possibly iterate over that much data in any reasonable time. Furthermore, it is very likely that such a large data set is spread over several physical devices—this means we need some way to find the correct physical location of the desired data. Indexes are the best way to do this.
16+
17+
### **How do Indexes decrease write performance?**
18+
19+
An index can ***dramatically speed up data retrieval but may itself be large due to the additional keys, which slow down data insertion & update***.
20+
21+
When adding rows or making updates to existing rows for a table with an active index, we not only have to write the data but also have to update the index. This will decrease the write performance. This performance degradation applies to all insert, update, and delete operations for the table. For this reason, adding unnecessary indexes on tables should be avoided and indexes that are no longer used should be removed. To reiterate, adding indexes is about improving the performance of search queries.
22+
23+
```If the goal of the database is to provide a data store that is often written to and rarely read from, in that case, decreasing the performance of the more common operation, which is writing, is probably not worth the increase in performance we get from reading.```
24+
25+
## Reference
26+
27+
Grokking the System Design Interview by Educative.io

0 commit comments

Comments
 (0)