| Requirement | Description |
|---|---|
| Search | Users can search for tweets using keywords, hashtags, or usernames. The search will display all related tweets from newest to oldest. |
| Fuzzy Search | Our system should also support Fuzzy Search, which helps users find results even if they misspell words. For example, searching for "facebok" might still show results for "Facebook". This makes searching easier and more forgiving. |
- Suppose a user wants to search for posts related to
"quantum computing". If they type this and search, they will see all the posts that mention"quantum computing". - A user searches for
#Olympics2024Paris. They will see all the posts where this hashtag is mentioned. - If a user misspells and searches for
"quantom computing", they will still see the correct results for"quantum computing".
User misspells and searches for "quantom computing" instead. Still he sees the correct results for quantum computing.
| Requirement | Description |
|---|---|
| Availability | Our system should be highly available 99.9999% of the time. With millions of users searching simultaneously, it's critical that the search function never goes down. |
| Low Latency | Search results must load quickly. Any noticeable delay could lead to user frustration. |
| Scalability | The user base of the platform is expected to grow with time. The system should be able to scale up in the future smoothly. |
| Metric | Value |
|---|---|
| Daily Active Users (DAU) | 100 million |
| Monthly Active Users (MAU) | 400 million |
Since we are designing the search functionality on Twitter, we will calculate the throughput only for search operations. Every search operation reads some information (searches) from our system.
There is only one primary way to read data from our system:
- Searching posts by keywords
Let's assume each user searches 5 times a day.
Total searches per day:
DAU × 5 searches per day
= 100 Million × 5
= 500 Million searches per day
Searches per second: =500 Million searches/(24 × 60 × 60) seconds per day ≈ 5,787 searches per second
We are designing a search functionality for Twitter, but what data will it search? It will search tweet data. Let's look at how much storage these tweets take up.
Note: For simplicity, let's assume we're only dealing with text tweets since our search feature performs text searches anyway.
We know we have 100 million daily active users.
- Suppose each person tweets, on average, once a day.
- Let's assume each tweet has about 200 characters on average, which roughly translates to 200 bytes (considering 1 byte per character).
Total storage per day
= 100 million users × 1 tweet per day ×200 bytes per tweet
= 20 billion bytes/day = 20 GB/day.
= 20 GB/day × 365 days/year × 10 years = 73,000 GB = 73 TB.
By memory, we mean cache memory size.
Accessing data from a database takes a long time, but if we want to access it faster, we use cache.
Let's assume that we store 5% of the most frequently accessed tweets data in the cache.
This means the memory required for cache: =0.05 × 20 GB = 1 GB/day
Network / Bandwidth estimation tells us how much data flows in and out of our system per second.
Incoming data in a day is ultimately saved in the storage.
-
Given that data stored in a day = 20 GB/day
-
Therefore, incoming data in a day = 20 GB/day
-
Hence, incoming data per second:
20GB/(24 * 60 * 60) =approx ***231KB/sec
Outgoing data represents the data read or searched from our system.
-
Given that we have 500 million search requests per day
-
Each search request returns top 10 search results
-
Each search result is a tweet with an average size of 200 bytes
-
Therefore, outgoing data per day:
500million *10 times * 200bytes= 1,000 ,GB/day
-
Hence, outgoing data per second: = 1000 GB/(24 * 60 *60) =approx 11.5 MB/sec
Let's understand what kind of API call goes to Twitter's servers when a user enters a keyword in the search bar and hits the search button.
This tells the server what action to perform. Since we want to get a list of tweets related to our search keyword, we use the GET action.
This specifies where on the server to perform the action. For searching tweets, we use the /v1/search endpoint. We also pass our search keywords through this URL with the help of query parameters like this:
/v1/search?query=quantum%20computing
Now, you might wonder, where did this %20 come from? Normally, you'd expect a space there between quantum and computing.
Imagine if we just sent a space in our URL like this:
/v1/search?query=quantum computing
This space could cause confusion, making it seem like our URL ends at quantum, and the rest of the part, computing, might be trimmed or ignored.
To avoid this confusion:
- We convert the space to
%20in the URL. - This process is technically called encoding, where we convert special characters that might cause issues, like spaces, into a format that won't confuse the server.
- The backend understands this encoding well and converts them back into their original characters, a process known as decoding.
Let's say John goes to Twitter and searches for "quantum computing". What happens next looks something like this:
-
User Search Request
John opens Twitter and searches for "quantum computing". As soon as he does this, a search request is sent to the API Gateway. -
Request Validation & Routing
The API Gateway validates this request and then passes it to the Search Service through a Load Balancer. -
Keyword Extraction & Database Search
- The Search Service picks up the keywords, "quantum" and "computing", from John's query.
- It then searches the Tweets Database for all tweets that contain these two keywords.
-
Returning Search Results
Finally, all these tweets are returned to John as search results.
You might have guessed that this approach would work, but it is very unoptimized. It scans the entire table for every search, which can significantly slow down the search operation, especially when the system needs to support a high volume of queries like Twitter.
Let's now try to optimize this.
Previously, we saw that searching every tweet for certain keywords is slow and inefficient. How can we speed this up?
Imagine we have a magic list. Whenever we input a keyword into this list, it instantly gives us all the tweet IDs that contain this keyword. Essentially, there's no need to scan the entire table anymore.
This 'magic list' is what we call an Index DB. When John searches for "quantum computing," the Index DB quickly shows which tweets contain these words, without checking every tweet.
-
John initiates a search
- John opens Twitter and searches for "quantum computing."
- A search request is sent to the API Gateway.
-
Request validation and forwarding
- The API Gateway validates the request.
- It forwards the request to the Search Service via a Load Balancer.
-
Keyword search in Index DB
- The Search Service extracts keywords:
"quantum"and"computing". - It queries the Index DB to find tweet IDs containing these words:
"quantum"appears in tweets 1001, 1002, 1003, 1005."computing"appears in tweets 1002, 1003, 1005.
- The intersection of both lists is 1002, 1003, 1005 (tweets containing both words).
- The Search Service extracts keywords:
-
Retrieving the tweets
- The Search Service fetches these specific tweets from the Tweets DB.
-
Returning the results
- The tweets 1002, 1003, 1005 are sent back to John as search results.
With this optimized approach, searching becomes much faster, eliminating the need to scan every tweet in the database.
"To err is human" – this common saying is very true. Often, we make mistakes while typing. But you might notice that Twitter still gives us the correct results.
This is what we call Fuzzy Search. Fuzzy literally means something that is not clear or is ambiguous.
For example, suppose John searches for "quantom computing" instead of "quantum computing". He still receives results for "quantum computing."
Let's understand from the diagram below how the server corrects the incorrect word. Most of the flow is the same except that steps 3 and 4 are added.
-
User Search Request:
- John opens Twitter and searches for "quantum computing".
- A search request is sent to the API Gateway.
-
API Gateway Processing:
- The API Gateway validates the request and forwards it to the Search Service via a Load Balancer.
-
Keyword Extraction & Initial Search:
- The Search Service picks up the keywords, 'quantom' and 'computing', from John's query.
- It checks if these keywords exist in the Index DB.
- It finds that there is no such word as 'quantom' in the Index DB.
-
Query Correction:
- The Search Service calls the Query Correction Service.
- The Query Correction Service takes the incorrect word ('quantom') as input and returns the most likely correct word ('quantum').
- Consider this part of the high-level design as a black box. We will discuss its inner workings in the Deep Dive Insights section.
-
Search with Corrected Keywords:
- The Search Service then searches the Index DB with the corrected keywords, 'quantum' and 'computing'.
- It finds that the keyword 'quantum' appears in tweets 1001, 1002, 1003, 1005, and 'computing' in tweets 1002, 1003, 1005.
- The Search Service takes the intersection of these, which is 1002, 1003, 1005, because these tweets contain both 'quantum' and 'computing'.
-
Fetching Relevant Tweets:
- The Search Service retrieves these specific tweets from the Tweets DB.
-
Returning the Results:
- Finally, all these tweets (1002, 1003, 1005) are returned to John as search results.

When John searches for "quantum computing," he receives a list of tweets. The most relevant tweet appears at the top of this list, with each subsequent tweet being slightly less relevant.
So, who does this sorting on the server side? The Ranking Service.
Let's understand this step-by-step. Most of the flow is the same as a standard search, but we've added one last step before returning the results, where we rank these tweets in order of relevance.
- John opens Twitter and searches for "quantum computing."
- A search request is sent to the API Gateway.
- The API Gateway validates the request.
- It then forwards the request to the Search Service via a Load Balancer.
- The Search Service extracts keywords:
- 'quantom' (misspelled)
- 'computing'
- It checks the Index DB for these keywords.
- The Index DB finds no entry for 'quantom.'
- The Search Service calls the Query Correction Service.
- The Query Correction Service detects the incorrect word ('quantom') and corrects it to 'quantum.'
- The Search Service searches the Index DB with the corrected keywords:
- 'quantum' appears in tweets 1001, 1002, 1003, 1005.
- 'computing' appears in tweets 1002, 1003, 1005.
- The Search Service takes the intersection of these results:
- Tweets 1002, 1003, and 1005 contain both 'quantum' and 'computing.'
- The Search Service fetches these tweets from the Tweets DB.
- The retrieved tweets are sent to the Ranking Service.
- The Ranking Service orders them from newest to oldest.
- Instead of adding ranking logic to the Search Service, a separate Ranking Service was created.
- This ensures decoupling, meaning if the ranking logic needs to change in the future, only the Ranking Service is updated.
- Finally, the ranked tweets are returned to John as search results.
By structuring the system this way, Twitter ensures efficient, accurate, and scalable tweet searches while keeping components flexible for future improvements.
## HIGH LEVEL DESIGN :Search Final Design
Let's take a look at the final design diagram for the search, including all the components—Fuzzy Search, Ranking, and Caches next to both databases to store frequently accessed data.
John opens Twitter and searches for "quantum computing." As soon as he does this, a search request is sent to the API Gateway.
- The API Gateway validates this request.
- It then forwards the request to the Search Service via a Load Balancer.
- The Search Service extracts the keywords:
'quantom'and'computing'from John’s query. - It first checks the cache for these keywords.
- If cache miss, it checks the Index DB.
- The Index DB does not contain anything listed as
'quantom'.
- The Search Service contacts the Query Correction Service.
- This service detects
'quantom'as a likely typo and suggests the correct word: 'quantum'.
- The Search Service searches the cache again with
'quantum'and'computing'. - If the cache does not contain them, it checks the Index DB and updates the cache (read-through mechanism).
- The Index DB lookup results:
'quantum'appears in tweets 1001, 1002, 1003, 1005.'computing'appears in tweets 1002, 1003, 1005.- The intersection of both results: tweets 1002, 1003, 1005.
- The Search Service checks the Tweets DB cache for these tweet IDs.
- If the cache does not have them, it queries the Tweets DB and updates the cache (read-through mechanism).
- The Search Service sends the retrieved tweets to the Ranking Service.
- The Ranking Service orders them from newest to oldest.
- The Search Service returns the ranked tweets to John as search results.
The table below provides a high-level comparison of when to use SQL vs NoSQL, marked with ✔ (preferred) and ✖ (not preferred).
| Criteria | SQL | NoSQL |
|---|---|---|
| Fast Data Access | ✖ | ✔ |
| Handles Large Scale | ✖ | ✔ |
| Fixed Structure Data | ✔ | ✖ |
| Flexible Structure Data | ✖ | ✔ |
| Complex Queries | ✔ | ✖ |
| Simple Queries | ✖ | ✔ |
| Frequent/Evolving Data Changes | ✖ | ✔ |
| Database | Deciding Factors | Decision |
|---|---|---|
| Tweets DB |
|
NoSQL |
| Index DB |
|
NoSQL |
Note: The tables shown in the high-level design were simplified versions just to illustrate the overall flow. This lesson provides a more detailed schema.
- Database Type: NoSQL
- Common Queries: Find the tweet by tweet ID
- Indexing: Tweet ID
Note:
Because we have this common query to find the tweet by tweet ID, we create an index on the tweet ID. This sets a shortcut to quickly find the information by tweet ID.
- Database Type: NoSQL
- Common Queries: Grab all the tweet IDs for a keyword.
- Indexing: Keyword
Note:
Because we have this common query to grab all the tweet IDs for a keyword, we create an index on the keyword field. This sets a shortcut to quickly find the data by keyword.
We've seen how the Query Correction Service can turn John's misspelled or incorrect keyword into the correct one. Let's dive deeper into exactly how this happens.
Imagine John types the incorrect keyword "Fery".
What do you think John actually meant? Here are some potential correct options:
- Ferry
- Fare
- Fair
The most likely option is Ferry because it's the closest to what he typed. "Closest" here means requiring the fewest changes.
To turn Fery into Ferry, you only need to do one change:
- Add one 'r' →
Fery→Ferry
For Fery → Fare, it requires two changes:
- Replace 'e' with 'a' →
Fery→Fary - Replace 'y' with 'e' →
Fary→Fare
For Fery → Fair, it requires three changes:
- Replace 'e' with 'a' →
Fery→Fary - Insert 'i' before 'r' →
Fary→Fairy - Delete 'y' →
Fairy→Fair
So, Ferry is the closest as it needs the fewest changes—just one insertion.
A change could be:
- Inserting a character
- Deleting a character
- Replacing a character
This count of changes (aka closeness) is technically known as the Levenshtein Distance between the two words.
The option with the fewest changes (Ferry) is typically selected, as it's considered the closest match to the intended word.
The Query Correction Service does something similar:
- It calculates the Levenshtein Distance for all possible options from the incorrect keyword.
- It chooses the one with the minimum distance.
This is the most likely option that John intended to type.
Let's explore something interesting. The design we've created mirrors a popular and very powerful search tool known as Elastic Search.
Think of Elastic Search as a powerful search engine. It can handle vast amounts of data and delivers real-time search results. It provides search functionalities, including complex features like fuzzy search, and also helps rank search results based on their relevance.
Essentially, if we were to replace our entire design with Elastic Search, it would function in much the same way.
Many popular websites, including Twitter, use Elastic Search to manage their vast data and provide real-time search results.












