Skip to content

Commit 9634037

Browse files
committed
Merge #14: Refactor: extract bencode tokenizer
ec6cc56 docs: update README (Jose Celano) 68d9915 refactor: rename json::BencodeParser to json::Generator (Jose Celano) a3c7c4b refactor: remove parent parser mod (Jose Celano) 3052d6a refactor: rename BencodeTOkenizer to Tokenizer (Jose Celano) 331c76e refactor: reorganize modules (Jose Celano) 9e0db6c refactor: remove writer from tokenizer string parser (Jose Celano) 0a05544 refactor: remove old int and str parsers with writers (Jose Celano) 75ffdb4 refactor: remove writer from tokenizer integer parser (Jose Celano) 77ad5af refactor: remove writer from main tokenizer (Jose Celano) f6a0584 refactor: duplicate integer and strig parser before removing writer (Jose Celano) 3a7ea5d refactor: extract mod tokenizer (Jose Celano) 63b9b73 refactor: extract struct BencodeTokenizer (Jose Celano) 83eeefd refactor: extract bencode tokenizer (Jose Celano) Pull request description: This refactoring changes the current implementation to extract the tokenizer. It splits parser logic into two types: - **Tokenizer**: It returns bencoded tokens. - **Generator**: It iterates over bencoded tokens to generate the JSON. **NOTES** - It keeps the custom recursivity (with explicit stack) for the time being, instead of using explicit recursivity like @da2ce7 did [here](#12 (comment)). I guess that could be changed later if we think it increases readability and maintainability. **SUBTASKS** - [x] Separate logic for tokenizer. - [x] Extract tokenizer. - [x] Remove `Writer` from the tokenizer. It's not needed. **PERFORMANCE** In the current version, bencoded strings are cached in memory before starting writing to the output (because we nned the whole string to check if it's a valid UTF-8). In this PR, bencoded integers are also cached in memory because the whole integer value is a token. This should not be a problem since integers are short, unlike strings. **FUTURE PRs** We could: - [ ] Implement the `Iterator` trait for the tokenizer. - [ ] Use recursion for the generator like @da2ce7's proposal [here](#12). - [ ] Implement another generator for TOML, for example. Check if this design can be easily extended to other output formats. ACKs for top commit: josecelano: ACK ec6cc56 Tree-SHA512: 9210211d802c8e19aef1f02f814b494c5919c7da81f299cf2c7f4d9fb12b4c63cbec4ac526996e6b1b3d69f75ca58894b9d64936bef2d9da851e70d51234c675
2 parents a2eb63c + ec6cc56 commit 9634037

17 files changed

+436
-567
lines changed

README.md

Lines changed: 7 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -65,12 +65,12 @@ Error: Unexpected end of input parsing integer; read context: input pos 3, lates
6565

6666
```console
6767
printf "3:ab" | cargo run
68-
Error: Unexpected end of input parsing string value; read context: input pos 4, latest input bytes dump: [51, 58, 97, 98] (UTF-8 string: `3:ab`); write context: output pos 0, latest output bytes dump: [] (UTF-8 string: ``)
68+
Error: Unexpected end of input parsing string value; read context: input pos 4, latest input bytes dump: [51, 58, 97, 98] (UTF-8 string: `3:ab`)
6969
```
7070

7171
```console
7272
echo "i00e" | cargo run
73-
Error: Leading zeros in integers are not allowed, for example b'i00e'; read context: byte `48` (char: `0`), input pos 3, latest input bytes dump: [105, 48, 48] (UTF-8 string: `i00`); write context: byte `48` (char: `0`), output pos 2, latest output bytes dump: [48, 48] (UTF-8 string: `00`)
73+
Error: Leading zeros in integers are not allowed, for example b'i00e'; read context: byte `48` (char: `0`), input pos 3, latest input bytes dump: [105, 48, 48] (UTF-8 string: `i00`)
7474
```
7575

7676
Generating pretty JSON with [jq][jq]:
@@ -111,36 +111,10 @@ cargo add bencode2json
111111

112112
There two ways of using the library:
113113

114-
- With high-level parser wrappers.
115-
- With the low-level parsers.
114+
- With high-level wrappers.
115+
- With the low-level generators.
116116

117-
Example using the high-level parser wrappers:
118-
119-
```rust
120-
use bencode2json::{try_bencode_to_json};
121-
122-
let result = try_bencode_to_json(b"d4:spam4:eggse").unwrap();
123-
124-
assert_eq!(result, r#"{"<string>spam</string>":"<string>eggs</<string>string>"}"#);
125-
```
126-
127-
Example using the low-level parser:
128-
129-
```rust
130-
use bencode2json::parsers::{BencodeParser};
131-
132-
let mut output = String::new();
133-
134-
let mut parser = BencodeParser::new(&b"4:spam"[..]);
135-
136-
parser
137-
.write_str(&mut output)
138-
.expect("Bencode to JSON conversion failed");
139-
140-
println!("{output}"); // It prints the JSON string: "<string>spam</string>"
141-
```
142-
143-
More [examples](./examples/).
117+
See [examples](./examples/).
144118

145119
## Test
146120

@@ -167,21 +141,19 @@ cargo cov
167141
## Performance
168142

169143
In terms of memory usage this implementation consumes at least the size of the
170-
biggest bencoded string. The string parser keeps all the string bytes in memory until
171-
it parses the whole string, in order to convert it to UTF-8, when it's possible.
144+
biggest bencoded integer or string. The string and integer parsers keeps all the bytes in memory until
145+
it parses the whole value.
172146

173147
The library also wraps the input and output streams in a [BufReader](https://doc.rust-lang.org/std/io/struct.BufReader.html)
174148
and [BufWriter](https://doc.rust-lang.org/std/io/struct.BufWriter.html) because it can be excessively inefficient to work directly with something that implements [Read](https://doc.rust-lang.org/std/io/trait.Read.html) or [Write](https://doc.rust-lang.org/std/io/trait.Write.html).
175149

176150
## TODO
177151

178-
- [ ] More examples of using the library.
179152
- [ ] Counter for number of items in a list for debugging and errors.
180153
- [ ] Fuzz testing: Generate random valid bencoded values.
181154
- [ ] Install tracing crate. Add verbose mode that enables debugging.
182155
- [ ] Option to check if the final JSON it's valid at the end of the process.
183156
- [ ] Benchmarking for this implementation and the original C implementation.
184-
- [ ] Optimize string parser. We can stop trying to convert the string to UTF-8 when we find a non valid UTF-8 char.
185157

186158
## Alternatives
187159

examples/parser_file_in_file_out.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ use std::{
1010
io::{Read, Write},
1111
};
1212

13-
use bencode2json::parsers::BencodeParser;
13+
use bencode2json::generators::json::Generator;
1414
use clap::{Arg, Command};
1515

1616
fn main() {
@@ -61,7 +61,7 @@ fn main() {
6161
std::process::exit(1);
6262
};
6363

64-
if let Err(e) = BencodeParser::new(input).write_bytes(&mut output) {
64+
if let Err(e) = Generator::new(input).write_bytes(&mut output) {
6565
eprintln!("Error: {e}");
6666
std::process::exit(1);
6767
}

examples/parser_stdin_stdout.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@
77
//! It prints "spam".
88
use std::io;
99

10-
use bencode2json::parsers::BencodeParser;
10+
use bencode2json::generators::json::Generator;
1111

1212
fn main() {
1313
let input = Box::new(io::stdin());
1414
let mut output = Box::new(io::stdout());
1515

16-
if let Err(e) = BencodeParser::new(input).write_bytes(&mut output) {
16+
if let Err(e) = Generator::new(input).write_bytes(&mut output) {
1717
eprintln!("Error: {e}");
1818
std::process::exit(1);
1919
}

examples/parser_string_in_string_out.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@
55
//! ```
66
//!
77
//! It prints "spam".
8-
use bencode2json::parsers::BencodeParser;
8+
use bencode2json::generators::json::Generator;
99

1010
fn main() {
1111
let input = "4:spam".to_string();
1212
let mut output = String::new();
1313

14-
if let Err(e) = BencodeParser::new(input.as_bytes()).write_str(&mut output) {
14+
if let Err(e) = Generator::new(input.as_bytes()).write_str(&mut output) {
1515
eprintln!("Error: {e}");
1616
std::process::exit(1);
1717
}

examples/parser_string_in_vec_out.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@
55
//! ```
66
//!
77
//! It prints "spam".
8-
use bencode2json::parsers::BencodeParser;
8+
use bencode2json::generators::json::Generator;
99

1010
fn main() {
1111
let input = "4:spam".to_string();
1212
let mut output = Vec::new();
1313

14-
if let Err(e) = BencodeParser::new(input.as_bytes()).write_bytes(&mut output) {
14+
if let Err(e) = Generator::new(input.as_bytes()).write_bytes(&mut output) {
1515
eprintln!("Error: {e}");
1616
std::process::exit(1);
1717
}

examples/parser_vec_in_string_out.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@
55
//! ```
66
//!
77
//! It prints "spam".
8-
use bencode2json::parsers::BencodeParser;
8+
use bencode2json::generators::json::Generator;
99

1010
fn main() {
1111
let input = b"4:spam".to_vec();
1212
let mut output = String::new();
1313

14-
if let Err(e) = BencodeParser::new(&input[..]).write_str(&mut output) {
14+
if let Err(e) = Generator::new(&input[..]).write_str(&mut output) {
1515
eprintln!("Error: {e}");
1616
std::process::exit(1);
1717
}

examples/parser_vec_in_vec_out.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@
55
//! ```
66
//!
77
//! It prints "spam".
8-
use bencode2json::parsers::BencodeParser;
8+
use bencode2json::generators::json::Generator;
99

1010
fn main() {
1111
let input = b"4:spam".to_vec();
1212
let mut output = Vec::new();
1313

14-
if let Err(e) = BencodeParser::new(&input[..]).write_bytes(&mut output) {
14+
if let Err(e) = Generator::new(&input[..]).write_bytes(&mut output) {
1515
eprintln!("Error: {e}");
1616
std::process::exit(1);
1717
}
Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ use thiserror::Error;
99

1010
use crate::rw;
1111

12-
use super::BencodeType;
12+
use super::generators::BencodeType;
1313

1414
/// Errors that can occur while parsing a bencoded value.
1515
#[derive(Debug, Error)]
@@ -27,55 +27,55 @@ pub enum Error {
2727
/// The main parser peeks one byte ahead to know what kind of bencoded value
2828
/// is being parsed. If the byte read after peeking does not match the
2929
/// peeked byte, it means the input is being consumed somewhere else.
30-
#[error("Read byte after peeking does match peeked byte; {0}; {1}")]
31-
ReadByteAfterPeekingDoesMatchPeekedByte(ReadContext, WriteContext),
30+
#[error("Read byte after peeking does match peeked byte; {0}")]
31+
ReadByteAfterPeekingDoesMatchPeekedByte(ReadContext),
3232

3333
/// Unrecognized first byte for new bencoded value.
3434
///
3535
/// The main parser peeks one byte ahead to know what kind of bencoded value
3636
/// is being parsed. This error is raised when the peeked byte is not a
3737
/// valid first byte for a bencoded value.
38-
#[error("Unrecognized first byte for new bencoded value; {0}; {1}")]
39-
UnrecognizedFirstBencodeValueByte(ReadContext, WriteContext),
38+
#[error("Unrecognized first byte for new bencoded value; {0}")]
39+
UnrecognizedFirstBencodeValueByte(ReadContext),
4040

4141
// Integers
4242
/// Unexpected byte parsing integer.
4343
///
4444
/// The main parser parses integers by reading bytes until it finds the
4545
/// end of the integer. This error is raised when the byte read is not a
4646
/// valid byte for an integer bencoded value.
47-
#[error("Unexpected byte parsing integer; {0}; {1}")]
48-
UnexpectedByteParsingInteger(ReadContext, WriteContext),
47+
#[error("Unexpected byte parsing integer; {0}")]
48+
UnexpectedByteParsingInteger(ReadContext),
4949

5050
/// Unexpected end of input parsing integer.
5151
///
5252
/// The input ends before the integer ends.
53-
#[error("Unexpected end of input parsing integer; {0}; {1}")]
54-
UnexpectedEndOfInputParsingInteger(ReadContext, WriteContext),
53+
#[error("Unexpected end of input parsing integer; {0}")]
54+
UnexpectedEndOfInputParsingInteger(ReadContext),
5555

5656
/// Leading zeros in integers are not allowed, for example b'i00e'.
57-
#[error("Leading zeros in integers are not allowed, for example b'i00e'; {0}; {1}")]
58-
LeadingZerosInIntegersNotAllowed(ReadContext, WriteContext),
57+
#[error("Leading zeros in integers are not allowed, for example b'i00e'; {0}")]
58+
LeadingZerosInIntegersNotAllowed(ReadContext),
5959

6060
// Strings
6161
/// Invalid string length byte, expected a digit.
6262
///
6363
/// The string parser found an invalid byte for the string length. The
6464
/// length can only be made of digits (0-9).
65-
#[error("Invalid string length byte, expected a digit; {0}; {1}")]
66-
InvalidStringLengthByte(ReadContext, WriteContext),
65+
#[error("Invalid string length byte, expected a digit; {0}")]
66+
InvalidStringLengthByte(ReadContext),
6767

6868
/// Unexpected end of input parsing string length.
6969
///
7070
/// The input ends before the string length ends.
71-
#[error("Unexpected end of input parsing string length; {0}; {1}")]
72-
UnexpectedEndOfInputParsingStringLength(ReadContext, WriteContext),
71+
#[error("Unexpected end of input parsing string length; {0}")]
72+
UnexpectedEndOfInputParsingStringLength(ReadContext),
7373

7474
/// Unexpected end of input parsing string value.
7575
///
7676
/// The input ends before the string value ends.
77-
#[error("Unexpected end of input parsing string value; {0}; {1}")]
78-
UnexpectedEndOfInputParsingStringValue(ReadContext, WriteContext),
77+
#[error("Unexpected end of input parsing string value; {0}")]
78+
UnexpectedEndOfInputParsingStringValue(ReadContext),
7979

8080
// Lists
8181
/// Unexpected end of input parsing list. Expecting first list item or list end.
@@ -121,7 +121,7 @@ pub enum Error {
121121
NoMatchingStartForListOrDictEnd(ReadContext, WriteContext),
122122
}
123123

124-
/// The reader context when the error ocurred.
124+
/// The reader context when the error occurred.
125125
#[derive(Debug)]
126126
pub struct ReadContext {
127127
/// The read byte that caused the error if any.
@@ -157,7 +157,7 @@ impl fmt::Display for ReadContext {
157157
}
158158
}
159159

160-
/// The writer context when the error ocurred.
160+
/// The writer context when the error occurred.
161161
#[derive(Debug)]
162162
pub struct WriteContext {
163163
/// The written byte that caused the error if any.
@@ -197,7 +197,7 @@ impl fmt::Display for WriteContext {
197197
mod tests {
198198

199199
mod for_read_context {
200-
use crate::parsers::error::ReadContext;
200+
use crate::error::ReadContext;
201201

202202
#[test]
203203
fn it_should_display_the_read_context() {
@@ -237,7 +237,7 @@ mod tests {
237237
}
238238

239239
mod for_write_context {
240-
use crate::parsers::error::WriteContext;
240+
use crate::error::WriteContext;
241241

242242
#[test]
243243
fn it_should_display_the_read_context() {

0 commit comments

Comments
 (0)