Skip to content

Commit e84daec

Browse files
authored
Merge pull request #97 from BrianLusina/feat/algorithms-sliding-window
feat(algorithms, sliding window): repeated dna sequences
2 parents 1aff4bb + 3f3c9e9 commit e84daec

22 files changed

+435
-0
lines changed

DIRECTORY.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,8 @@
101101
* [Test Longest Substring K Repeating Chars](https://github.com/BrianLusina/PythonSnips/blob/master/algorithms/sliding_window/longest_substring_with_k_repeating_chars/test_longest_substring_k_repeating_chars.py)
102102
* Longest Substring Without Repeating Characters
103103
* [Test Longest Substring Without Repeating Characters](https://github.com/BrianLusina/PythonSnips/blob/master/algorithms/sliding_window/longest_substring_without_repeating_characters/test_longest_substring_without_repeating_characters.py)
104+
* Repeated Dna Sequences
105+
* [Test Repeated Dna Sequences](https://github.com/BrianLusina/PythonSnips/blob/master/algorithms/sliding_window/repeated_dna_sequences/test_repeated_dna_sequences.py)
104106
* Sorting
105107
* Insertionsort
106108
* [Test Insertion Sort](https://github.com/BrianLusina/PythonSnips/blob/master/algorithms/sorting/insertionsort/test_insertion_sort.py)
Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
# Repeated DNA Sequences
2+
3+
A DNA sequence consists of nucleotides represented by the letters ‘A’, ‘C’, ‘G’, and ‘T’ only. For example, “ACGAATTCCG”
4+
is a valid DNA sequence.
5+
6+
Given a string, s, that represents a DNA sequence, return all the 10-letter-long sequences (continuous substrings of
7+
exactly 10 characters) that appear more than once in s. You can return the output in any order.
8+
9+
Constraints:
10+
- 1 ≤ s.length ≤ 10^3
11+
- s[i] is either 'A', 'C', 'G', or 'T'.
12+
13+
Examples:
14+
15+
![Example one](images/repeated_dna_sequences_example_one.png)
16+
![Example two](images/repeated_dna_sequences_example_two.png)
17+
![Example three](images/repeated_dna_sequences_example_three.png)
18+
19+
---
20+
21+
## Solution
22+
23+
### Naive Approach
24+
The naive approach to solving this problem would be to use a nested loop to check all possible 10-letter-long substrings
25+
in the given DNA sequence. Using a set, we would extract every possible substring of length 10 and compare it with all
26+
previously seen substrings. If a substring appears more than once, we add it to the result.
27+
28+
Specifically, we start by iterating through the string and extracting every substring of length 10. For each substring,
29+
we check if it has already been seen before. We store it in a separate set to track repeated sequences if it has. If
30+
not, we add it to the set of seen substrings. Finally, we return all repeated sequences as a list. This method is
31+
simple to understand but inefficient because checking each substring against previously seen ones takes much time,
32+
making it slow for large inputs.
33+
34+
We extract all k-length (where k = 10) substrings from the given string s, which has length n. This means we extract
35+
(n−k+1) substrings. Each substring extraction takes O(k) time. Checking whether a substring is in a set (average case)
36+
takes O(1), but in the worst case (hash collisions), it takes O(n−k+1) comparisons. Inserting a new substring into the
37+
set takes O(1) on average, but worst case O(n−k+1). Therefore, the overall time complexity becomes O((n−k)×k).
38+
39+
The space complexity of this approach is O((n−k)×k) because, in the worst case, our set can contain (n−k+1) elements,
40+
and at each iteration of the traversal, we are allocating memory to generate a new k - length substring.
41+
42+
### Optimized approach using sliding window
43+
As we only need to check consecutive 10-letter substrings, we can slide over the string and update our hash efficiently
44+
instead of creating new substrings every time. To optimize it further, instead of computing a hash from scratch for
45+
each substring, we can update its value as we slide forward on the string. This technique is commonly known as
46+
rolling hash.
47+
48+
The rolling hash can be divided into three main steps:
49+
- Initial hash calculation: Calculate the hash for the main string’s first window (substring).
50+
- Slide the window: Move the window one character forward.
51+
- Update the hash: Use the previous hash value to calculate the new hash without rescanning the whole substring.
52+
- Remove the hash contribution of the outgoing character.
53+
- Add the hash contribution of the incoming character.
54+
55+
This optimized solution revolves around the rolling hash technique. First, we convert the characters 'A', 'C', 'G', and
56+
'T' into numerical values 0, 1, 2, and 3, respectively. Then, compute the rolling hash for the first 10-letter substring.
57+
As we slide the window forward one character at a time, remove the old character from the left and add the new character
58+
on the right. Update the hash efficiently to reflect this change. We use a set to track previously seen hash values.
59+
At each step, check if the computed hash has been seen before. If a hash appears again, store the corresponding
60+
substring in the result.
61+
62+
Let’s look at the following illustration to get a better understanding of the solution:
63+
64+
![Step 1](./images/repeated_dna_sequences_illustration_one.png)
65+
![Step 2](./images/repeated_dna_sequences_illustration_two.png)
66+
![Step 3](./images/repeated_dna_sequences_illustration_three.png)
67+
![Step 4](./images/repeated_dna_sequences_illustration_four.png)
68+
![Step 5](./images/repeated_dna_sequences_illustration_five.png)
69+
![Step 6](./images/repeated_dna_sequences_illustration_six.png)
70+
![Step 7](./images/repeated_dna_sequences_illustration_seven.png)
71+
![Step 8](./images/repeated_dna_sequences_illustration_eight.png)
72+
![Step 9](./images/repeated_dna_sequences_illustration_nine.png)
73+
![Step 10](./images/repeated_dna_sequences_illustration_ten.png)
74+
![Step 11](./images/repeated_dna_sequences_illustration_eleven.png)
75+
![Step 12](./images/repeated_dna_sequences_illustration_twelve.png)
76+
![Step 13](./images/repeated_dna_sequences_illustration_thirteen.png)
77+
![Step 14](./images/repeated_dna_sequences_illustration_fourteen.png)
78+
![Step 15](./images/repeated_dna_sequences_illustration_fifteen.png)
79+
80+
### A step-by-step solution construction
81+
82+
#### Step 1: Encode characters into numbers
83+
84+
Before processing DNA sequences in s, we must convert the characters 'A', 'C', 'G', and 'T' into numerical values.
85+
This allows us to perform mathematical operations, like computing hashes, more efficiently. We assign 0 → 'A', 1 →
86+
'C', 2 → 'G', and 3 → 'T
87+
We’ll define this mapping in a dictionary to_int. Then, we’ll convert each character in s into its corresponding numeric
88+
value and store it in a list encoded_sequence. Let’s look at the code for this step:
89+
90+
```python
91+
from typing import List
92+
93+
94+
def find_repeated_dna_sequences(dna_sequence: str) -> List[str]:
95+
# Define a mapping of DNA characters to numerical values
96+
to_int = {"A": 0, "C": 1, "G": 2, "T": 3}
97+
98+
# Convert each character in the input string to its corresponding number
99+
encoded_sequence = [to_int[c] for c in dna_sequence]
100+
101+
# Return the encoded list of numbers
102+
return encoded_sequence
103+
```
104+
105+
#### Step 2: Compute the first hash (rolling hash)
106+
107+
Now that we have the numerical form of the DNA sequence in s, we can compute a rolling hash for the first 10-letter
108+
substring. Hashing allows us to efficiently compare substrings without repeatedly checking each character.
109+
110+
For hashing, we’ll use the polynomial rolling hash:
111+
112+
`hash = (c1 × a^(k-1)) + (c2 × a ^(k-2)) + ... + (ck × a^0)`
113+
114+
Here, ci represents each character in s, a is the size of our alphabet, i.e., 4 for the DNA sequence in s, and
115+
k is the length of the substring for which we are computing the hash, i.e., 10 in our case.
116+
117+
By plugging in the values, we’ll compute the initial hash as follows:
118+
119+
`h0 = (c1 × 4^9)+(c2 × 4^8)+ . . . + (c10 × 4^0)`
120+
121+
Here, each character ci contributes to the hash using base-4 multiplication. This uniquely represents the sequence we
122+
can update as we slide through the DNA string.
123+
124+
> Note: This hash is derived from the polynomial hash. If you want to dive deeper into how we mapped the general
125+
polynomial hash to our case,here is more information:
126+
> For a sequence of numbers {n1,n2,n3,..., nk}, where each ni represents a character converted into a number, the
127+
> polynomial hash function in base-a: `hash=(n1×a^(k-1))+(n2×a^(k-2))+ . . . + (nk × a^0)`. In the equation above, k is
128+
> the length of the substring for which we are computing the hash, and a is the size of our alphabet, i.e., 4 for the
129+
> DNA sequence in s. This formula treats the sequence as a number in base-a notation, similar to how we represent
130+
> numbers in base-10 or base-2. If we apply it to our case, k=10 as we are working with 10-letter substrings, and a=4 as
131+
> we have 4 possible characters (A, C, G, T). So, plugging in the values: `hash=(n1×4^9)+(n2×4^8)+ . . . + (n10 × 4^0)`
132+
> This formula gives a unique number (most of the time) for each 10-letter DNA sequence.
133+
134+
135+
In the code implementation, we’ll define and initialize some variables to store each equation component above. We’ll
136+
define the constants k = 10 and a = 4. We’ll use the variable h to store the rolling hash value and a_k to compute 4^k.
137+
Then, we’ll use a loop to process the first 10 letters of s for computing the first hash. This loop will do the following:
138+
- Builds a unique number (hash) for the first 10 letters iteratively.
139+
- Prepares a multiplier (a_k) for future hash updates.
140+
141+
Let’s look at the code for this step:
142+
143+
```python
144+
from typing import List
145+
146+
147+
def find_repeated_dna_sequences(dna_sequence: str) -> List[str]:
148+
# Define a mapping of DNA characters to numerical values
149+
to_int = {"A": 0, "C": 1, "G": 2, "T": 3}
150+
151+
# Convert each character in the input string to its corresponding number
152+
encoded_sequence = [to_int[c] for c in dna_sequence]
153+
dna_sequence_length = 10 # Length of DNA sequence to check
154+
base_a_encoding = 4 # Base-4 encoding
155+
156+
rolling_hash_value = 0
157+
a_k = 1 # Stores a^k for hash updates
158+
159+
# Compute the initial hash using base-4 multiplication
160+
for i in range(dna_sequence_length):
161+
rolling_hash_value = rolling_hash_value * base_a_encoding + encoded_sequence[i]
162+
a_k *= base_a_encoding # Precompute a^k for later use in rolling hash updates
163+
164+
return rolling_hash_value
165+
```
166+
167+
#### Step 3: Update the hash and use a set to track seen substrings
168+
169+
After computing the initial hash, we slide a window through the string, efficiently updating the hash. Instead of
170+
recomputing the hash from scratch at every step, we adjust it by:
171+
172+
- Removing the old character from the left.
173+
- Adding the new character on the right.
174+
175+
Using a rolling hash, the update formula becomes:
176+
177+
new hash =(old hash × 4) − (leftmost digit × 4^10) + new digit
178+
179+
In the code implementation, we’ll use a loop to slide over s and update the hash value, h, for each new window. As we
180+
have already computed the hash for the first window, we’ll start our loop from the index 1 of s. The variable start
181+
will always indicate the starting point of our window, and we’ll get the ending point by adding k to it
182+
(to be precise in terms of coding, start + k - 1). So, to remove the contribution of the leftmost character and add the
183+
contribution of the rightmost character in h, we'll update it as follows:
184+
185+
`rolling_hash_value = (rolling_hash_value * base_a_encoding) - (encoded_sequence[start - 1] * a_k) + (encoded_sequence[start + dna_sequence_length - 1])`
186+
187+
We’ll use a set, `seen_hashes`, to track hashes we’ve seen before. If a hash, h, appears again, we add the corresponding
188+
substring to the result, output. Let’s look at the code for this step:
189+
190+
```python
191+
from typing import List
192+
193+
194+
def find_repeated_dna_sequences(dna_sequence: str) -> List[str]:
195+
# Define a mapping of DNA characters to numerical values
196+
to_int = {"A": 0, "C": 1, "G": 2, "T": 3}
197+
198+
# Convert each character in the input string to its corresponding number
199+
encoded_sequence = [to_int[c] for c in dna_sequence]
200+
dna_sequence_substr_length, dna_sequence_length = 10, len(dna_sequence) # Length of DNA sequence to check
201+
base_a_encoding = 4 # Base-4 encoding
202+
203+
rolling_hash_value = 0
204+
a_k = 1 # Stores a^k for hash updates
205+
206+
# Compute the initial hash using base-4 multiplication
207+
for i in range(dna_sequence_substr_length):
208+
rolling_hash_value = rolling_hash_value * base_a_encoding + encoded_sequence[i]
209+
a_k *= base_a_encoding # Precompute a^k for later use in rolling hash updates
210+
211+
seen_hashes, output = set(), set() # Sets to track hashes and repeated sequences
212+
seen_hashes.add(rolling_hash_value) # Store the initial hash
213+
214+
# Sliding window approach to update the hash efficiently
215+
for start in range(1, dna_sequence_length - dna_sequence_substr_length + 1):
216+
# Remove the leftmost character and add the new rightmost character
217+
rolling_hash_value = rolling_hash_value * base_a_encoding - encoded_sequence[start - 1] * a_k + encoded_sequence[start + dna_sequence_substr_length - 1]
218+
219+
# If this hash has been seen_hashes before, add the corresponding substring to the output
220+
if rolling_hash_value in seen_hashes:
221+
output.add(dna_sequence[start : start + dna_sequence_substr_length])
222+
else:
223+
seen_hashes.add(rolling_hash_value)
224+
225+
return list(output) # Convert set to list before returning
226+
```
227+
228+
### Solution Summary
229+
Let’s get a quick recap of the optimized solution:
230+
231+
Encode DNA sequence in s by converting 'A', 'C', 'G', and 'T' into numbers (0, 1, 2, 3) for easier computation.
232+
233+
Use a set to store seen hashes and detect repeating sequences.
234+
235+
Compute the rolling hash for the first 10-letter substring and store it in the set.
236+
237+
Move the window one step forward and compute the hash of the new window. Store this new hash in the set. Store the
238+
corresponding substring in the result if the calculated hash appears again.
239+
240+
Once the entire string has been traversed, return the result containing all the repeating 10-letter long sequences.
241+
242+
### Time Complexity
243+
Let’s break down and analyze the time complexity of this solution:
244+
245+
- We go through the input string once to convert characters into numbers, which takes O(n)
246+
- We compute the first rolling hash in O(k) time (where k = 10). As k is a fixed small number, it is treated as O(1)
247+
- Then, we slide through the string once, updating the hash in O(1) time for each step. Overall, it takes O(n−k) time,
248+
which can be simplified to O(n) as k is a constant.
249+
- Checking and storing hashes in a set is O(1) on average.
250+
251+
If we sum these up, the overall time complexity simplifies to: `O(n)+O(1)+O(n)+O(1)=O(n)`
252+
253+
### Space Complexity
254+
Let’s break down and analyze the space complexity of this solution:
255+
256+
- We store the encoded sequence as a list of numbers, which takes O(n) space.
257+
- We store hashes in a set that would have, at most, n−k+1 entries. This can be simplified to O(n) space.
258+
- We store repeated sequences in another set that takes, at most, n−k+1 unique sequences. This can be simplified to
259+
O(n) space.
260+
261+
If we sum these up, the overall space complexity becomes: `O(n)+O(n)+O(n)=O(n)`
262+
263+
---
264+
265+
## Tags
266+
- Hash Table
267+
- String
268+
- Bit Manipulation
269+
- Sliding Window
270+
- Rolling Hash
271+
- Hash Function
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
from typing import List, Dict
2+
3+
4+
def find_repeated_dna_sequences_naive(dna_sequence: str) -> List[str]:
5+
"""
6+
Finds all repeated DNA sequences in a given string.
7+
8+
A repeated DNA sequence is a subsequence that appears more than once in the given string.
9+
The function returns a list of all repeated DNA sequences found in the given string.
10+
Parameters:
11+
dna_sequence (str): The string to search for repeated DNA sequences.
12+
13+
Returns:
14+
List[str]
15+
"""
16+
if len(dna_sequence) <= 10:
17+
return []
18+
19+
result_set = set()
20+
seen = set()
21+
for idx in range(len(dna_sequence)):
22+
subsequence = dna_sequence[idx:idx+10]
23+
if len(subsequence) < 10:
24+
continue
25+
if subsequence in seen:
26+
result_set.add(subsequence)
27+
else:
28+
seen.add(subsequence)
29+
30+
return list(result_set)
31+
32+
def find_repeated_dna_sequences(dna_sequence: str) -> List[str]:
33+
"""
34+
Finds all repeated DNA sequences in a given string.
35+
36+
A repeated DNA sequence is a subsequence that appears more than once in the given string.
37+
The function returns a list of all repeated DNA sequences found in the given string.
38+
Parameters:
39+
dna_sequence (str): The string to search for repeated DNA sequences.
40+
41+
Returns:
42+
List[str]
43+
"""
44+
to_int = {"A": 0, "C": 1, "G": 2, "T": 3}
45+
46+
# Validate input contains only valid DNA bases
47+
if not all(c in to_int for c in dna_sequence):
48+
raise ValueError(f"DNA sequence contains invalid characters. Only A, C, G, T are allowed.")
49+
50+
encoded_sequence = [to_int[c] for c in dna_sequence]
51+
52+
dna_sequence_substr_length, dna_sequence_length = 10, len(dna_sequence) # Length of DNA sequence to check
53+
54+
if dna_sequence_length <= dna_sequence_substr_length:
55+
return []
56+
57+
base_a_encoding = 4 # Base-4 encoding
58+
rolling_hash_value = 0
59+
seen_hashes, output = set(), set()
60+
a_k = 1 # Stores a^k for hash updates
61+
62+
# # Compute the initial hash using base-4 multiplication
63+
for i in range(dna_sequence_substr_length):
64+
rolling_hash_value = rolling_hash_value * base_a_encoding + encoded_sequence[i]
65+
a_k *= base_a_encoding # Precompute a^k for later use in rolling hash updates
66+
67+
seen_hashes.add(rolling_hash_value) # Store the initial hash
68+
69+
# Sliding window approach to update the hash efficiently
70+
for start in range(1, dna_sequence_length - dna_sequence_substr_length + 1):
71+
# Remove the leftmost character and add the new rightmost character
72+
rolling_hash_value = rolling_hash_value * base_a_encoding - encoded_sequence[start - 1] * a_k + encoded_sequence[start + dna_sequence_substr_length - 1]
73+
74+
# If this hash has been seen_hashes before, add the corresponding substring to the output
75+
if rolling_hash_value in seen_hashes:
76+
output.add(dna_sequence[start: start + dna_sequence_substr_length])
77+
else:
78+
seen_hashes.add(rolling_hash_value)
79+
80+
# Convert set to list before returning
81+
return list(output)
82+
41.2 KB
Loading
28.1 KB
Loading
27.3 KB
Loading
133 KB
Loading
124 KB
Loading
115 KB
Loading
124 KB
Loading

0 commit comments

Comments
 (0)