Skip to content

Commit d55074b

Browse files
authored
Merge pull request #72 from 4ndrelim/branch-Trie
Branch trie
2 parents f7c27de + 41875ac commit d55074b

File tree

10 files changed

+381
-297
lines changed

10 files changed

+381
-297
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ Gradle is used for development.
3434
- [Monotonic Queue](src/main/java/dataStructures/queue/monotonicQueue)
3535
- Segment Tree
3636
- [Stack](src/main/java/dataStructures/stack)
37-
- Trie
37+
- [Trie](src/main/java/dataStructures/trie)
3838

3939
## Algorithms
4040
- [Bubble Sort](src/main/java/algorithms/sorting/bubbleSort)
@@ -81,9 +81,9 @@ Gradle is used for development.
8181
* [Binary search tree](src/main/java/dataStructures/binarySearchTree)
8282
* AVL-tree
8383
* Orthogonal Range Searching
84-
* Trie
84+
* [Trie](src/main/java/dataStructures/trie)
8585
* B-Tree
86-
* * Red-Black Tree (Not covered in CS2040s but useful!)
86+
* Red-Black Tree (Not covered in CS2040s but useful!)
8787
* Kd-tree (**WIP**)
8888
* Interval tree (**WIP**)
8989
5. [Binary Heap](src/main/java/dataStructures/heap) (Max heap)

docs/assets/images/Trie.png

454 KB
Loading

src/main/java/algorithms/patternFinding/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ in text editors when searching for a pattern, in computational biology sequence
66
in NLP problems, and even for looking for file patterns for effective file management.
77
It is hence crucial that we develop an efficient algorithm.
88

9+
Typically, the algorithm returns a list of indices that denote the start of each occurrence of the pattern string.
10+
911
![KMP](../../../../../docs/assets/images/kmp.png)
1012
Image Source: GeeksforGeeks
1113

src/main/java/dataStructures/disjointSet/weightedUnion/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Quick Union
2-
2+
If you wish to jump to [weighted union](#Weighted-Union).
33
## Background
44
Here, we consider a completely different approach. We consider the use of trees. Every element can be
55
thought of as a tree node and starts off in its own component. Under this representation, it is likely
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Trie
2+
3+
## Background
4+
A trie (pronounced as 'try') also known as a prefix tree, is often used for handling textual data, especially in
5+
scenarios involving prefixes. In fact, the term 'trie' comes from the word 'retrieval'.
6+
7+
Like most trees, a trie is composed of nodes and edges. But, unlike binary trees, its node can have more than
8+
2 children. A trie stores words by breaking down into characters and organising these characters within a hierarchical
9+
tree. Each node represents a single character, except the root, which does not represent any character
10+
but acts as a starting point for all the words stored. A path in the trie, which is a sequence of connected nodes
11+
from the root, represents a prefix or a whole word. Shared prefixes of different words are represented by common paths.
12+
13+
To distinguish complete words from prefixes within the trie, nodes are often implemented with a boolean flag.
14+
This flag is set to true for nodes that correspond to the final character of a complete word and false otherwise.
15+
16+
<div align="center">
17+
<img src="../../../../../docs/assets/images/Trie.png" alt="Trie" style="width:80%"/>
18+
<br/>
19+
<em>Source: <a href="https://java2blog.com/trie-data-structure-in-java/">Java2Blog</a></em>
20+
</div>
21+
22+
## Complexity Analysis
23+
Let the length of the longest word be _L_ and the number of words be _N_.
24+
25+
**Time**: O(_L_)
26+
An upper-bound. For typical trie operations like insert, delete, and search,
27+
since it is likely that every char is iterated over.
28+
29+
**Space**: O(_N*L_)
30+
In the worst case, we can have minimal overlap between words and every character of every word needs to be captured
31+
with a node.
32+
33+
A trie can be space-intensive. For a very large corpus of words, with the naive assumption of characters being
34+
likely to occur in any position, another naive estimation on the size of the tree is O(_26^l_) where _l_ here is
35+
the average length of a word. Note, 26 is used since are only 26 alphabets.
36+
37+
## Operations
38+
Here we briefly discuss the typical operations supported by a trie.
39+
40+
### Insert
41+
Starting at the root, iterate over the characters and move down the trie to the respective nodes, creating missing
42+
ones in the process. Once the end of the word is reached, the node representing the last character will set its
43+
boolean flag to true
44+
45+
### Search
46+
Starting at the root, iterate over the characters and move down the trie to the respective nodes.
47+
If at any point the required character node is missing, return false. Otherwise, continue traversing until the end of
48+
the word and check if the current node has its boolean flag set to true. If not, the word is not captured in the trie.
49+
50+
### Delete
51+
Starting at the root, iterate over the characters and move down the trie to the respective nodes.
52+
If at any point the required character node is missing, then the word does not exist in the trie and the process
53+
is terminated. Otherwise, continue traversing until the end of the word and un-mark boolean flag of the current node
54+
to false.
55+
56+
### Delete With Pruning
57+
Sometimes, a trie can become huge. Deleting old words would still leave redundant nodes hanging around. These can
58+
accumulate over time, so it is crucial we prune away unused nodes.
59+
60+
Continuing off the delete operation, trace the path back to the root, and if any redundant nodes are found (nodes
61+
that aren't the end flag for a word and have no descendant nodes), remove them.
62+
63+
### Augmentation
64+
Just like how Orthogonal Range Searching can be done by augmenting the usual balanced BSTs, a trie can be augmented
65+
with additional variables captured in the TrieNode to speed up queries of a certain kind. For instance, if one wishes
66+
to quickly find out how many complete words stored in a trie have a given prefix, one can track the number of
67+
descendant nodes whose boolean flag is set to true at each node.
68+
69+
## Notes
70+
### Applications
71+
- [auto-completion](https://medium.com/geekculture/how-to-effortlessly-implement-an-autocomplete-data-structure-in-javascript-using-a-trie-ea87a7d5a804)
72+
- [spell-checker](https://medium.com/@vithusha.ravirajan/enhancing-spell-checking-with-trie-data-structure-eb649ee0b1b5)
73+
- [prefix matching](https://medium.com/@shenchenlei/how-to-implement-a-prefix-matcher-using-trie-tree-1aea9a01013)
74+
- sorting large datasets of textual data
Lines changed: 146 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
package dataStructures.trie;
22

3+
import java.util.ArrayList;
4+
import java.util.HashMap;
5+
import java.util.List;
6+
import java.util.Map;
7+
38
/**
4-
* Implementation of Trie structure.
5-
* Supports the follwing common operations (see below for doc):
6-
* insert(String word)
7-
* search(String word)
8-
* startsWith(String prefix)
9-
* prune(String word)
9+
* Implementation of a Trie; Here we consider strings (not case-sensitive)
1010
*/
1111
public class Trie {
1212
private final TrieNode root;
@@ -16,98 +16,179 @@ public Trie() {
1616
}
1717

1818
/**
19-
* Insert a word into the trie; converts word to
20-
* to lower-case characters before insertion.
21-
*
22-
* @param word the string to be inserted
19+
* TrieNode implementation. Note, fields are set to public for decreased verbosity.
20+
*/
21+
private class TrieNode {
22+
// CHECKSTYLE:OFF: VisibilityModifier
23+
public Map<Character, TrieNode> children; // or array of size 26 (assume not case-sensitive) to denote each char
24+
// CHECKSTYLE:OFF: VisibilityModifier
25+
public boolean isEnd; // a marker to indicate whether the path from the root to this node forms a known word
26+
27+
public TrieNode() {
28+
children = new HashMap<Character, TrieNode>();
29+
isEnd = false;
30+
}
31+
}
32+
33+
/**
34+
* Inserts a word into the trie.
35+
* @param word
2336
*/
2437
public void insert(String word) {
25-
word = word.toLowerCase();
26-
System.out.printf("~~~~~~~Inserting '%s'~~~~~~~%n", word);
27-
TrieNode node = root;
38+
word = word.toLowerCase(); // ignore case-sensitivity
39+
TrieNode trav = root;
2840
for (int i = 0; i < word.length(); i++) {
2941
char curr = word.charAt(i);
30-
if (!node.containsKey(curr)) {
31-
node.insertKey(curr);
42+
if (!trav.children.containsKey(curr)) {
43+
trav.children.put(curr, new TrieNode()); // recall, the edges represent the characters
3244
}
33-
node = node.getNext(curr); // go to the subsequent node!
45+
trav = trav.children.get(curr);
3446
}
35-
node.makeEnd();
47+
trav.isEnd = true; // set word
3648
}
3749

3850
/**
39-
* Search for a word (converted to lower-case) in the trie.
40-
*
41-
* @param word the string to look for
42-
* @return boolean representing whether the word was found
51+
* Searches for a word in the trie.
52+
* @param word
53+
* @return true if the word is found, false otherwise.
4354
*/
4455
public boolean search(String word) {
45-
word.toLowerCase();
46-
System.out.printf("~~~~~~~Searching '%s'~~~~~~~%n", word);
47-
TrieNode node = root;
56+
word = word.toLowerCase();
57+
TrieNode trav = root;
4858
for (int i = 0; i < word.length(); i++) {
4959
char curr = word.charAt(i);
50-
if (node.containsKey(curr)) {
51-
node = node.getNext(curr);
52-
} else {
60+
if (!trav.children.containsKey(curr)) {
5361
return false;
5462
}
63+
trav = trav.children.get(curr);
5564
}
56-
return node.isEnd();
65+
return trav.isEnd;
5766
}
5867

5968
/**
60-
* Search for a prefix (converted to lower-case) in the trie.
61-
* Note: very similar in implementation to search method
62-
* except the search here does not need to look for end flag
63-
*
64-
* @param prefix the string to look for
65-
* @return boolean representing whether the prefix exists
69+
* Deletes a word from the trie.
70+
* @param word
6671
*/
67-
public boolean startsWith(String prefix) {
68-
prefix = prefix.toLowerCase();
69-
System.out.printf("~~~~~~~Looking for prefix '%s'~~~~~~~%n", prefix);
70-
TrieNode node = root;
71-
for (int i = 0; i < prefix.length(); i++) {
72-
char curr = prefix.charAt(i);
73-
if (node.containsKey(curr)) {
74-
node = node.getNext(curr);
75-
} else {
76-
return false;
72+
public void delete(String word) {
73+
word = word.toLowerCase();
74+
TrieNode trav = root;
75+
for (int i = 0; i < word.length(); i++) {
76+
char curr = word.charAt(i);
77+
if (!trav.children.containsKey(curr)) {
78+
return; // word does not exist in trie, so just return
7779
}
80+
trav = trav.children.get(curr);
7881
}
79-
return true;
82+
trav.isEnd = false; // remove word from being tracked
8083
}
8184

85+
// ABOVE ARE STANDARD METHODS OF A TYPICAL TRIE IMPLEMENTATION
86+
// BELOW IMPLEMENTS TWO MORE COMMON / USEFUL METHODS FOR TRIE; IN PARTICULAR, NOTE THE PRUNING METHOD
87+
8288
/**
83-
* Removes a word from the trie by toggling the end flag;
84-
* if any of the end nodes (next nodes relative to current)
85-
* do not hold further characters, repetitively prune the trie
86-
* by removing these nodes from the hashmap of the current node.
87-
* Note: This method is useful in optimizing searching for a set of known words
88-
* especially when the data to be traversed has words that are similar in spelling/
89-
* repeated words which might have been previously found.
90-
*
91-
* @param word the word to be removed
89+
* Deletes a word from the trie, and also prune redundant nodes. This is useful in keeping the trie compact.
90+
* @param word
9291
*/
93-
public void prune(String word) {
94-
word = word.toLowerCase();
95-
System.out.printf("~~~~~~~Removing '%s'~~~~~~~%n", word);
96-
TrieNode node = root;
97-
TrieNode[] track = new TrieNode[word.length()];
92+
public void deleteAndPrune(String word) {
93+
List<TrieNode> trackNodes = new ArrayList<>();
94+
TrieNode trav = root;
9895
for (int i = 0; i < word.length(); i++) {
9996
char curr = word.charAt(i);
100-
track[i] = node;
101-
node = node.getNext(curr);
97+
if (!trav.children.containsKey(curr)) {
98+
return; // word does not exist in trie
99+
}
100+
trackNodes.add(trav);
101+
trav = trav.children.get(curr);
102102
}
103-
node.removeEnd();
103+
trav.isEnd = false;
104+
105+
// now we start pruning
104106
for (int i = word.length() - 1; i >= 0; i--) {
105107
char curr = word.charAt(i);
106-
if (track[i].getNext(curr).getCharacters().size() > 0) {
107-
break; // done further nodes are required
108+
TrieNode nodeBeforeCurr = trackNodes.get(i);
109+
TrieNode nextNode = nodeBeforeCurr.children.get(curr);
110+
if (!nextNode.isEnd && nextNode.children.size() == 0) { // node essentially doesn't track anything, remove
111+
nodeBeforeCurr.children.remove(curr);
112+
} else { // children.size() > 0; i.e. this node is still useful; no need to further prune upwards
113+
break;
114+
}
115+
}
116+
}
117+
118+
/**
119+
* Find all words with the specified prefix.
120+
* @param prefix
121+
* @return a list of words.
122+
*/
123+
public List<String> wordsWithPrefix(String prefix) {
124+
List<String> ret = new ArrayList<>();
125+
TrieNode trav = root;
126+
for (int i = 0; i < prefix.length(); i++) {
127+
char curr = prefix.charAt(i);
128+
if (!trav.children.containsKey(curr)) {
129+
return ret; // no words with this prefix
130+
}
131+
trav = trav.children.get(curr);
132+
}
133+
List<StringBuilder> allSuffix = getAllSuffixFromNode(trav);
134+
for (StringBuilder sb : allSuffix) {
135+
ret.add(prefix + sb.toString());
136+
}
137+
return ret;
138+
}
139+
140+
/**
141+
* Find all words in the trie.
142+
* @return a list of words.
143+
*/
144+
public List<String> getAllWords() {
145+
List<StringBuilder> allWords = getAllSuffixFromNode(root);
146+
List<String> ret = new ArrayList<>();
147+
for (StringBuilder sb : allWords) {
148+
ret.add(sb.toString());
149+
}
150+
return ret;
151+
}
152+
153+
/**
154+
* Helper method to get suffix from the node.
155+
* @param node
156+
* @return
157+
*/
158+
private List<StringBuilder> getAllSuffixFromNode(TrieNode node) {
159+
List<StringBuilder> ret = new ArrayList<>();
160+
if (node.isEnd) {
161+
ret.add(new StringBuilder(""));
162+
}
163+
for (char c : node.children.keySet()) {
164+
TrieNode nextNode = node.children.get(c);
165+
List<StringBuilder> allSuffix = getAllSuffixFromNode(nextNode);
166+
for (StringBuilder sb : allSuffix) {
167+
sb.insert(0, c); // insert c at the front
168+
ret.add(sb);
169+
}
170+
}
171+
return ret;
172+
}
173+
174+
// BELOW IS A METHOD THAT IS USED FOR TESTING PURPOSES ONLY
175+
176+
/**
177+
* Helper method for testing purposes.
178+
* @param str
179+
* @param pos
180+
* @return
181+
*/
182+
public Boolean checkNodeExistsAtPosition(String str, Integer pos) {
183+
TrieNode trav = root;
184+
for (int i = 0; i < pos; i++) {
185+
char c = str.charAt(i);
186+
if (trav.children.containsKey(c)) {
187+
trav = trav.children.get(c);
108188
} else {
109-
track[i].getCharacters().remove(curr);
189+
return false;
110190
}
111191
}
192+
return true;
112193
}
113194
}

0 commit comments

Comments
 (0)