You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21-15Lines changed: 21 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,11 +7,11 @@
7
7
8
8
LZW is an archive format that utilize power of LZW compression algorithm. LZW compression algorithm is a dictionary-based loseless algorithm. It's an old algorithm suitable for beginner to practice.
9
9
10
-
Internal algorithm processes byte data. So it's applicable to any file types, besides text file. Although it may not be able to achieve substantial compression rate for some file types that are already compressed efficiently, such as PDF files and MP4 files. It treats data as byte stream, unaware of the text-level pattern, which makes it less compression-efficient compared to other more advanced compression algorithm.
10
+
Internal algorithm processes byte data. So it's applicable to any file types, besides text file. Although it may not be able to achieve substantial compression rate for some file types that are already compressed efficiently, such as PDF files and MP4 files. It treats data as byte stream, unaware of the text-level pattern, which makes it less compression-efficient compared to other more advanced compression algorithms.
11
11
12
12
LZW compression algorithm is dynamic. It doesn't collect data statistics before hand. Instead, it learns the data pattern while conducting the compression, building a code table on the fly. The compression ratio approaches maximum after enough time. The algorithmic complexity is strictly linear to the size of the text. [A more in-depth algorithmic analysis can be found in the following sections.](#algorithmic-analysis)
13
13
14
-
An alternative implementation that utilizes more efficient self-made customized `bitmap`, `dict` and `set` to replace C++ builtin general-purpose `std::bitset`, `std::unordered_map` and `std::set` can be found in the branch `assignment`. Future enhancement includes customized `hash` functions to replace builtin general-purpose `std::hash`.
14
+
An alternative implementation that utilizes more efficient self-made customized `bitmap`, `dict` and `set` to replace C++ builtin general-purpose `std::bitset`, `std::unordered_map` and `std::set` can be found in the branch [`assignment`](https://github.com/MapleCCC/liblzw/tree/assignment). Future enhancement includes customized `hash` functions to replace builtin general-purpose `std::hash`.
Contribution is welcome. When commiting new code, make sure to apply format specified in `.clang-format` config file. Also remember to add `scripts/pre-commit.py` to `.git/hooks/pre-commit` as pre-commit hook script.
The pre-commit hook script basically does two things:
59
65
60
66
1. Format staged C/C++ code
61
67
62
-
2. Transform LaTeX math equation in `README.raw.md` to image url in `README.md`
68
+
2. Transform `LaTeX` math equation in `README.raw.md` to image url in `README.md`
63
69
64
-
Besides relying on the pre-commit hook script, you can manually format code and transform math equations in README.md
70
+
Besides relying on the pre-commit hook script, you can manually format code and transform math equations in `README.md`.
65
71
66
72
```bash
67
-
make reformat
68
-
make transform-eqn
73
+
$ make reformat
74
+
$ make transform-eqn
69
75
```
70
76
71
77
The advantages of pre-commit hook script, compared to manual triggering scripts, is that it's convenient and un-disruptive, as it only introduces changes to staged files, not all files in the repo.
@@ -116,15 +122,15 @@ The cost model for these two branches are respectively:
116
122
117
123
Suppose the source text byte length is . Among the  bytes consumed by the algorithm, there are  bytes for whom the algorithm goes to branch A, and goes to branch  for the other  bytes.
118
124
119
-
For simplicity, we assume that ), ), ), ), and ) are fixed cost that don't vary upon different string sizes. This assumption is invalid/broken for large input, but that kind of case is very rare, so we are good with such hurtless simplification, as long as the strings are of reasonable lengths.
125
+
For simplicity, we assume that ), ), ), ), and ) are fixed cost that don't vary upon different string sizes. This assumption is invalid/broken for large input, but that kind of case is very rare, so we are good with such hurtless simplification, as long as the strings are of reasonable lengths.
120
126
121
127
The total cost model of compression process can then be summarized as:
For input data that doesn't have many repeated byte pattern,  is small compared to  (i.e. ). The cost model approximates to:
If the underlying data structure implementation of code dict is hash table, then ) and ) are both ) operations. The total cost is ) then.
130
136
@@ -150,11 +156,11 @@ The cost model for these two branches is then:
150
156
151
157
Suppose the code stream length is . Among the  codes consumed by the algorithm, there are  codes for whom the algorithm goes to branch A, and goes to branch  for the other  codes.
152
158
153
-
For simplicity, we assume that ), ), ), ), and ) are fixed cost that don't vary upon different string sizes. This assumption is invalid/broken for large input, but that kind of case is very rare, so we are good with such hurtless simplification, as long as the strings are of reasonable lengths.
159
+
For simplicity, we assume that ), ), ), ), and ) are fixed cost that don't vary upon different string sizes. This assumption is invalid/broken for large input, but that kind of case is very rare, so we are good with such hurtless simplification, as long as the strings are of reasonable lengths.
154
160
155
161
The probability of going to branch  is relatively rare, so the major cost comes from branch . The total cost model for the decompression algorithm is then:
It's the same with that of the compression algorithm! The total cost model for the decompression algorithm turns out to be identical to that of the compression algorithm! They are both linear ). (under the precondition that the underlying implementation of string dict and code dict scales in sublinear factor)
Copy file name to clipboardExpand all lines: README.raw.md
+13-13Lines changed: 13 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,11 +7,11 @@
7
7
8
8
LZW is an archive format that utilize power of LZW compression algorithm. LZW compression algorithm is a dictionary-based loseless algorithm. It's an old algorithm suitable for beginner to practice.
9
9
10
-
Internal algorithm processes byte data. So it's applicable to any file types, besides text file. Although it may not be able to achieve substantial compression rate for some file types that are already compressed efficiently, such as PDF files and MP4 files. It treats data as byte stream, unaware of the text-level pattern, which makes it less compression-efficient compared to other more advanced compression algorithm.
10
+
Internal algorithm processes byte data. So it's applicable to any file types, besides text file. Although it may not be able to achieve substantial compression rate for some file types that are already compressed efficiently, such as PDF files and MP4 files. It treats data as byte stream, unaware of the text-level pattern, which makes it less compression-efficient compared to other more advanced compression algorithms.
11
11
12
12
LZW compression algorithm is dynamic. It doesn't collect data statistics before hand. Instead, it learns the data pattern while conducting the compression, building a code table on the fly. The compression ratio approaches maximum after enough time. The algorithmic complexity is strictly linear to the size of the text. [A more in-depth algorithmic analysis can be found in the following sections.](#algorithmic-analysis)
13
13
14
-
An alternative implementation that utilizes more efficient self-made customized `bitmap`, `dict` and `set` to replace C++ builtin general-purpose `std::bitset`, `std::unordered_map` and `std::set` can be found in the branch `assignment`. Future enhancement includes customized `hash` functions to replace builtin general-purpose `std::hash`.
14
+
An alternative implementation that utilizes more efficient self-made customized `bitmap`, `dict` and `set` to replace C++ builtin general-purpose `std::bitset`, `std::unordered_map` and `std::set` can be found in the branch [`assignment`](https://github.com/MapleCCC/liblzw/tree/assignment). Future enhancement includes customized `hash` functions to replace builtin general-purpose `std::hash`.
15
15
16
16
## Installation
17
17
@@ -57,21 +57,21 @@ $ lzw decompress <ARCHIVE>
57
57
Contribution is welcome. When commiting new code, make sure to apply format specified in `.clang-format` config file. Also remember to add `scripts/pre-commit.py` to `.git/hooks/pre-commit` as pre-commit hook script.
The pre-commit hook script basically does two things:
65
65
66
66
1. Format staged C/C++ code
67
67
68
-
2. Transform LaTeX math equation in `README.raw.md` to image url in `README.md`
68
+
2. Transform `LaTeX` math equation in `README.raw.md` to image url in `README.md`
69
69
70
-
Besides relying on the pre-commit hook script, you can manually format code and transform math equations in README.md
70
+
Besides relying on the pre-commit hook script, you can manually format code and transform math equations in `README.md`.
71
71
72
72
```bash
73
-
make reformat
74
-
make transform-eqn
73
+
$ make reformat
74
+
$ make transform-eqn
75
75
```
76
76
77
77
The advantages of pre-commit hook script, compared to manual triggering scripts, is that it's convenient and un-disruptive, as it only introduces changes to staged files, not all files in the repo.
@@ -126,19 +126,19 @@ $$
126
126
127
127
Suppose the source text byte length is $N$. Among the $N$ bytes consumed by the algorithm, there are $M$ bytes for whom the algorithm goes to branch A, and goes to branch $B$ for the other $N-M$ bytes.
128
128
129
-
For simplicity, we assume that $C(\mathrm{dict.lookup})$, $C(\mathrm{dict.add})$, $C(\mathrm{dict.membership\_check})$, $C(\mathrm{str.concatenate})$, and $C(\mathrm{str.copy})$ are fixed cost that don't vary upon different string sizes. This assumption is invalid/broken for large input, but that kind of case is very rare, so we are good with such hurtless simplification, as long as the strings are of reasonable lengths.
129
+
For simplicity, we assume that $C(\mathrm{dict.lookup})$, $C(\mathrm{dict.add})$, $C(\mathrm{dict.membership-check})$, $C(\mathrm{str.concatenate})$, and $C(\mathrm{str.copy})$ are fixed cost that don't vary upon different string sizes. This assumption is invalid/broken for large input, but that kind of case is very rare, so we are good with such hurtless simplification, as long as the strings are of reasonable lengths.
130
130
131
131
The total cost model of compression process can then be summarized as:
132
132
133
133
$$
134
-
C_{\mathrm{total}} = N * (C(\mathrm{str.concatenate}) + C(\mathrm{dict.membership\_check})) \\
134
+
C_{\mathrm{total}} = N * (C(\mathrm{str.concatenate}) + C(\mathrm{dict.membership-check})) \\
135
135
+ M * C(\mathrm{str.copy}) + (N - M) * (C(\mathrm{dict.lookup}) + C(\mathrm{dict.add}) + C(\mathrm{str.copy}))
136
136
$$
137
137
138
138
For input data that doesn't have many repeated byte pattern, $M$ is small compared to $N$ (i.e. $M \ll N$). The cost model approximates to:
If the underlying data structure implementation of code dict is hash table, then $C(\mathrm{dict.memebership\_check})$ and $C(\mathrm{dict.add})$ are both $O(1)$ operations. The total cost is $O(N)$ then.
@@ -169,12 +169,12 @@ $$
169
169
170
170
Suppose the code stream length is $N$. Among the $N$ codes consumed by the algorithm, there are $M$ codes for whom the algorithm goes to branch A, and goes to branch $B$ for the other $N-M$ codes.
171
171
172
-
For simplicity, we assume that $C(\mathrm{dict.lookup})$, $C(\mathrm{dict.add})$, $C(\mathrm{dict.membership\_check})$, $C(\mathrm{str.concatenate})$, and $C(\mathrm{str.copy})$ are fixed cost that don't vary upon different string sizes. This assumption is invalid/broken for large input, but that kind of case is very rare, so we are good with such hurtless simplification, as long as the strings are of reasonable lengths.
172
+
For simplicity, we assume that $C(\mathrm{dict.lookup})$, $C(\mathrm{dict.add})$, $C(\mathrm{dict.membership-check})$, $C(\mathrm{str.concatenate})$, and $C(\mathrm{str.copy})$ are fixed cost that don't vary upon different string sizes. This assumption is invalid/broken for large input, but that kind of case is very rare, so we are good with such hurtless simplification, as long as the strings are of reasonable lengths.
173
173
174
174
The probability of going to branch $B$ is relatively rare, so the major cost comes from branch $A$. The total cost model for the decompression algorithm is then:
It's the same with that of the compression algorithm! The total cost model for the decompression algorithm turns out to be identical to that of the compression algorithm! They are both linear $O(N)$. (under the precondition that the underlying implementation of string dict and code dict scales in sublinear factor)
0 commit comments