|
| 1 | +# Repeated DNA Sequences |
| 2 | + |
| 3 | +A DNA sequence consists of nucleotides represented by the letters ‘A’, ‘C’, ‘G’, and ‘T’ only. For example, “ACGAATTCCG” |
| 4 | +is a valid DNA sequence. |
| 5 | + |
| 6 | +Given a string, s, that represents a DNA sequence, return all the 10-letter-long sequences (continuous substrings of |
| 7 | +exactly 10 characters) that appear more than once in s. You can return the output in any order. |
| 8 | + |
| 9 | +Constraints: |
| 10 | +- 1 ≤ s.length ≤ 10^3 |
| 11 | +- s[i] is either 'A', 'C', 'G', or 'T'. |
| 12 | + |
| 13 | +Examples: |
| 14 | + |
| 15 | + |
| 16 | + |
| 17 | + |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +## Solution |
| 22 | + |
| 23 | +### Naive Approach |
| 24 | +The naive approach to solving this problem would be to use a nested loop to check all possible 10-letter-long substrings |
| 25 | +in the given DNA sequence. Using a set, we would extract every possible substring of length 10 and compare it with all |
| 26 | +previously seen substrings. If a substring appears more than once, we add it to the result. |
| 27 | + |
| 28 | +Specifically, we start by iterating through the string and extracting every substring of length 10. For each substring, |
| 29 | +we check if it has already been seen before. We store it in a separate set to track repeated sequences if it has. If |
| 30 | +not, we add it to the set of seen substrings. Finally, we return all repeated sequences as a list. This method is |
| 31 | +simple to understand but inefficient because checking each substring against previously seen ones takes much time, |
| 32 | +making it slow for large inputs. |
| 33 | + |
| 34 | +We extract all k-length (where k = 10) substrings from the given string s, which has length n. This means we extract |
| 35 | +(n−k+1) substrings. Each substring extraction takes O(k) time. Checking whether a substring is in a set (average case) |
| 36 | +takes O(1), but in the worst case (hash collisions), it takes O(n−k+1) comparisons. Inserting a new substring into the |
| 37 | +set takes O(1) on average, but worst case O(n−k+1). Therefore, the overall time complexity becomes O((n−k)×k). |
| 38 | + |
| 39 | +The space complexity of this approach is O((n−k)×k) because, in the worst case, our set can contain (n−k+1) elements, |
| 40 | +and at each iteration of the traversal, we are allocating memory to generate a new k - length substring. |
| 41 | + |
| 42 | +### Optimized approach using sliding window |
| 43 | +As we only need to check consecutive 10-letter substrings, we can slide over the string and update our hash efficiently |
| 44 | +instead of creating new substrings every time. To optimize it further, instead of computing a hash from scratch for |
| 45 | +each substring, we can update its value as we slide forward on the string. This technique is commonly known as |
| 46 | +rolling hash. |
| 47 | + |
| 48 | +The rolling hash can be divided into three main steps: |
| 49 | +- Initial hash calculation: Calculate the hash for the main string’s first window (substring). |
| 50 | +- Slide the window: Move the window one character forward. |
| 51 | +- Update the hash: Use the previous hash value to calculate the new hash without rescanning the whole substring. |
| 52 | + - Remove the hash contribution of the outgoing character. |
| 53 | + - Add the hash contribution of the incoming character. |
| 54 | + |
| 55 | +This optimized solution revolves around the rolling hash technique. First, we convert the characters 'A', 'C', 'G', and |
| 56 | +'T' into numerical values 0, 1, 2, and 3, respectively. Then, compute the rolling hash for the first 10-letter substring. |
| 57 | +As we slide the window forward one character at a time, remove the old character from the left and add the new character |
| 58 | +on the right. Update the hash efficiently to reflect this change. We use a set to track previously seen hash values. |
| 59 | +At each step, check if the computed hash has been seen before. If a hash appears again, store the corresponding |
| 60 | +substring in the result. |
| 61 | + |
| 62 | +Let’s look at the following illustration to get a better understanding of the solution: |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | + |
| 71 | + |
| 72 | + |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | + |
| 77 | + |
| 78 | + |
| 79 | + |
| 80 | +### A step-by-step solution construction |
| 81 | + |
| 82 | +#### Step 1: Encode characters into numbers |
| 83 | + |
| 84 | +Before processing DNA sequences in s, we must convert the characters 'A', 'C', 'G', and 'T' into numerical values. |
| 85 | +This allows us to perform mathematical operations, like computing hashes, more efficiently. We assign 0 → 'A', 1 → |
| 86 | +'C', 2 → 'G', and 3 → 'T |
| 87 | +We’ll define this mapping in a dictionary to_int. Then, we’ll convert each character in s into its corresponding numeric |
| 88 | +value and store it in a list encoded_sequence. Let’s look at the code for this step: |
| 89 | + |
| 90 | +```python |
| 91 | +from typing import List |
| 92 | + |
| 93 | + |
| 94 | +def find_repeated_dna_sequences(dna_sequence: str) -> List[str]: |
| 95 | + # Define a mapping of DNA characters to numerical values |
| 96 | + to_int = {"A": 0, "C": 1, "G": 2, "T": 3} |
| 97 | + |
| 98 | + # Convert each character in the input string to its corresponding number |
| 99 | + encoded_sequence = [to_int[c] for c in dna_sequence] |
| 100 | + |
| 101 | + # Return the encoded list of numbers |
| 102 | + return encoded_sequence |
| 103 | +``` |
| 104 | + |
| 105 | +#### Step 2: Compute the first hash (rolling hash) |
| 106 | + |
| 107 | +Now that we have the numerical form of the DNA sequence in s, we can compute a rolling hash for the first 10-letter |
| 108 | +substring. Hashing allows us to efficiently compare substrings without repeatedly checking each character. |
| 109 | + |
| 110 | +For hashing, we’ll use the polynomial rolling hash: |
| 111 | + |
| 112 | +`hash = (c1 × a^(k-1)) + (c2 × a ^(k-2)) + ... + (ck × a^0)` |
| 113 | + |
| 114 | +Here, ci represents each character in s, a is the size of our alphabet, i.e., 4 for the DNA sequence in s, and |
| 115 | +k is the length of the substring for which we are computing the hash, i.e., 10 in our case. |
| 116 | + |
| 117 | +By plugging in the values, we’ll compute the initial hash as follows: |
| 118 | + |
| 119 | +`h0 = (c1 × 4^9)+(c2 × 4^8)+ . . . + (c10 × 4^0)` |
| 120 | + |
| 121 | +Here, each character ci contributes to the hash using base-4 multiplication. This uniquely represents the sequence we |
| 122 | +can update as we slide through the DNA string. |
| 123 | + |
| 124 | +> Note: This hash is derived from the polynomial hash. If you want to dive deeper into how we mapped the general |
| 125 | +polynomial hash to our case,here is more information: |
| 126 | +> For a sequence of numbers {n1,n2,n3,..., nk}, where each ni represents a character converted into a number, the |
| 127 | +> polynomial hash function in base-a: `hash=(n1×a^(k-1))+(n2×a^(k-2))+ . . . + (nk × a^0)`. In the equation above, k is |
| 128 | +> the length of the substring for which we are computing the hash, and a is the size of our alphabet, i.e., 4 for the |
| 129 | +> DNA sequence in s. This formula treats the sequence as a number in base-a notation, similar to how we represent |
| 130 | +> numbers in base-10 or base-2. If we apply it to our case, k=10 as we are working with 10-letter substrings, and a=4 as |
| 131 | +> we have 4 possible characters (A, C, G, T). So, plugging in the values: `hash=(n1×4^9)+(n2×4^8)+ . . . + (n10 × 4^0)` |
| 132 | +> This formula gives a unique number (most of the time) for each 10-letter DNA sequence. |
| 133 | +
|
| 134 | + |
| 135 | +In the code implementation, we’ll define and initialize some variables to store each equation component above. We’ll |
| 136 | +define the constants k = 10 and a = 4. We’ll use the variable h to store the rolling hash value and a_k to compute 4^k. |
| 137 | +Then, we’ll use a loop to process the first 10 letters of s for computing the first hash. This loop will do the following: |
| 138 | +- Builds a unique number (hash) for the first 10 letters iteratively. |
| 139 | +- Prepares a multiplier (a_k) for future hash updates. |
| 140 | + |
| 141 | +Let’s look at the code for this step: |
| 142 | + |
| 143 | +```python |
| 144 | +from typing import List |
| 145 | + |
| 146 | + |
| 147 | +def find_repeated_dna_sequences(dna_sequence: str) -> List[str]: |
| 148 | + # Define a mapping of DNA characters to numerical values |
| 149 | + to_int = {"A": 0, "C": 1, "G": 2, "T": 3} |
| 150 | + |
| 151 | + # Convert each character in the input string to its corresponding number |
| 152 | + encoded_sequence = [to_int[c] for c in dna_sequence] |
| 153 | + dna_sequence_length = 10 # Length of DNA sequence to check |
| 154 | + base_a_encoding = 4 # Base-4 encoding |
| 155 | + |
| 156 | + rolling_hash_value = 0 |
| 157 | + a_k = 1 # Stores a^k for hash updates |
| 158 | + |
| 159 | + # Compute the initial hash using base-4 multiplication |
| 160 | + for i in range(dna_sequence_length): |
| 161 | + rolling_hash_value = rolling_hash_value * base_a_encoding + encoded_sequence[i] |
| 162 | + a_k *= base_a_encoding # Precompute a^k for later use in rolling hash updates |
| 163 | + |
| 164 | + return rolling_hash_value |
| 165 | +``` |
| 166 | + |
| 167 | +#### Step 3: Update the hash and use a set to track seen substrings |
| 168 | + |
| 169 | +After computing the initial hash, we slide a window through the string, efficiently updating the hash. Instead of |
| 170 | +recomputing the hash from scratch at every step, we adjust it by: |
| 171 | + |
| 172 | +- Removing the old character from the left. |
| 173 | +- Adding the new character on the right. |
| 174 | + |
| 175 | +Using a rolling hash, the update formula becomes: |
| 176 | + |
| 177 | +new hash =(old hash × 4) − (leftmost digit × 4^10) + new digit |
| 178 | + |
| 179 | +In the code implementation, we’ll use a loop to slide over s and update the hash value, h, for each new window. As we |
| 180 | +have already computed the hash for the first window, we’ll start our loop from the index 1 of s. The variable start |
| 181 | +will always indicate the starting point of our window, and we’ll get the ending point by adding k to it |
| 182 | +(to be precise in terms of coding, start + k - 1). So, to remove the contribution of the leftmost character and add the |
| 183 | +contribution of the rightmost character in h, we'll update it as follows: |
| 184 | + |
| 185 | +`rolling_hash_value = (rolling_hash_value * base_a_encoding) - (encoded_sequence[start - 1] * a_k) + (encoded_sequence[start + dna_sequence_length - 1])` |
| 186 | + |
| 187 | +We’ll use a set, `seen_hashes`, to track hashes we’ve seen before. If a hash, h, appears again, we add the corresponding |
| 188 | +substring to the result, output. Let’s look at the code for this step: |
| 189 | + |
| 190 | +```python |
| 191 | +from typing import List |
| 192 | + |
| 193 | + |
| 194 | +def find_repeated_dna_sequences(dna_sequence: str) -> List[str]: |
| 195 | + # Define a mapping of DNA characters to numerical values |
| 196 | + to_int = {"A": 0, "C": 1, "G": 2, "T": 3} |
| 197 | + |
| 198 | + # Convert each character in the input string to its corresponding number |
| 199 | + encoded_sequence = [to_int[c] for c in dna_sequence] |
| 200 | + dna_sequence_substr_length, dna_sequence_length = 10, len(dna_sequence) # Length of DNA sequence to check |
| 201 | + base_a_encoding = 4 # Base-4 encoding |
| 202 | + |
| 203 | + rolling_hash_value = 0 |
| 204 | + a_k = 1 # Stores a^k for hash updates |
| 205 | + |
| 206 | + # Compute the initial hash using base-4 multiplication |
| 207 | + for i in range(dna_sequence_substr_length): |
| 208 | + rolling_hash_value = rolling_hash_value * base_a_encoding + encoded_sequence[i] |
| 209 | + a_k *= base_a_encoding # Precompute a^k for later use in rolling hash updates |
| 210 | + |
| 211 | + seen_hashes, output = set(), set() # Sets to track hashes and repeated sequences |
| 212 | + seen_hashes.add(rolling_hash_value) # Store the initial hash |
| 213 | + |
| 214 | + # Sliding window approach to update the hash efficiently |
| 215 | + for start in range(1, dna_sequence_length - dna_sequence_substr_length + 1): |
| 216 | + # Remove the leftmost character and add the new rightmost character |
| 217 | + rolling_hash_value = rolling_hash_value * base_a_encoding - encoded_sequence[start - 1] * a_k + encoded_sequence[start + dna_sequence_substr_length - 1] |
| 218 | + |
| 219 | + # If this hash has been seen_hashes before, add the corresponding substring to the output |
| 220 | + if rolling_hash_value in seen_hashes: |
| 221 | + output.add(dna_sequence[start : start + dna_sequence_substr_length]) |
| 222 | + else: |
| 223 | + seen_hashes.add(rolling_hash_value) |
| 224 | + |
| 225 | + return list(output) # Convert set to list before returning |
| 226 | +``` |
| 227 | + |
| 228 | +### Solution Summary |
| 229 | +Let’s get a quick recap of the optimized solution: |
| 230 | + |
| 231 | +Encode DNA sequence in s by converting 'A', 'C', 'G', and 'T' into numbers (0, 1, 2, 3) for easier computation. |
| 232 | + |
| 233 | +Use a set to store seen hashes and detect repeating sequences. |
| 234 | + |
| 235 | +Compute the rolling hash for the first 10-letter substring and store it in the set. |
| 236 | + |
| 237 | +Move the window one step forward and compute the hash of the new window. Store this new hash in the set. Store the |
| 238 | +corresponding substring in the result if the calculated hash appears again. |
| 239 | + |
| 240 | +Once the entire string has been traversed, return the result containing all the repeating 10-letter long sequences. |
| 241 | + |
| 242 | +### Time Complexity |
| 243 | +Let’s break down and analyze the time complexity of this solution: |
| 244 | + |
| 245 | +- We go through the input string once to convert characters into numbers, which takes O(n) |
| 246 | +- We compute the first rolling hash in O(k) time (where k = 10). As k is a fixed small number, it is treated as O(1) |
| 247 | +- Then, we slide through the string once, updating the hash in O(1) time for each step. Overall, it takes O(n−k) time, |
| 248 | + which can be simplified to O(n) as k is a constant. |
| 249 | +- Checking and storing hashes in a set is O(1) on average. |
| 250 | + |
| 251 | +If we sum these up, the overall time complexity simplifies to: `O(n)+O(1)+O(n)+O(1)=O(n)` |
| 252 | + |
| 253 | +### Space Complexity |
| 254 | +Let’s break down and analyze the space complexity of this solution: |
| 255 | + |
| 256 | +- We store the encoded sequence as a list of numbers, which takes O(n) space. |
| 257 | +- We store hashes in a set that would have, at most, n−k+1 entries. This can be simplified to O(n) space. |
| 258 | +- We store repeated sequences in another set that takes, at most, n−k+1 unique sequences. This can be simplified to |
| 259 | + O(n) space. |
| 260 | + |
| 261 | +If we sum these up, the overall space complexity becomes: `O(n)+O(n)+O(n)=O(n)` |
| 262 | + |
| 263 | +--- |
| 264 | + |
| 265 | +## Tags |
| 266 | +- Hash Table |
| 267 | +- String |
| 268 | +- Bit Manipulation |
| 269 | +- Sliding Window |
| 270 | +- Rolling Hash |
| 271 | +- Hash Function |
0 commit comments