Skip to content

Commit 60516d6

Browse files
authored
Merge pull request #19 from jqnatividad/CSVWrangling-increased-accuracy
refactor: all accuracy benchmarks are now > 90%
2 parents 195c493 + 55fc319 commit 60516d6

File tree

3 files changed

+805
-86
lines changed

3 files changed

+805
-86
lines changed

README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -164,11 +164,11 @@ The table below shows the dialect detection success ratio. Accuracy is measured
164164

165165
| Data set | `csv-nose` | `CSVsniffer MADSE` | `CSVsniffer` | `CleverCSV` | `csv.Sniffer` | DuckDB `sniff_csv` |
166166
|:---------|:-----------|:-------------------|:-------------|:------------|:--------------|:-------------------|
167-
| POLLOCK | **96.62%** | 95.27% | 96.55% | 95.17% | 96.35% | 84.14% |
167+
| POLLOCK | **97.30%** | 95.27% | 96.55% | 95.17% | 96.35% | 84.14% |
168168
| W3C-CSVW[^2] | **99.55%** | 94.52% | 95.39% | 61.11% | 97.69% | 99.08% |
169-
| CSV Wrangling | **87.15%** | 90.50% | 89.94% | 87.99% | 84.26% | 91.62% |
170-
| CSV Wrangling CODEC | **86.62%** | 90.14% | 90.14% | 89.44% | 84.18% | 92.25% |
171-
| CSV Wrangling MESSY | **84.92%** | 89.60% | 89.60% | 89.60% | 83.06% | 91.94% |
169+
| CSV Wrangling | **92.74%** | 90.50% | 89.94% | 87.99% | 84.26% | 91.62% |
170+
| CSV Wrangling CODEC | **91.55%** | 90.14% | 90.14% | 89.44% | 84.18% | 92.25% |
171+
| CSV Wrangling MESSY | **90.48%** | 89.60% | 89.60% | 89.60% | 83.06% | 91.94% |
172172

173173
[^2]: csv-nose is optimized for the [W3C CSV on the Web Test Suite](https://w3c.github.io/csvw/tests/) - reaching 99.55% accuracy.
174174

@@ -192,25 +192,25 @@ The F1 score is the harmonic mean of precision and recall, providing a balanced
192192

193193
| Data set | `csv-nose` | `CSVsniffer MADSE` | `CSVsniffer` | `CleverCSV` | `csv.Sniffer` | DuckDB `sniff_csv` |
194194
|:---------|:-----------|:-------------------|:-------------|:------------|:--------------|:-------------------|
195-
| POLLOCK | **0.966** | 0.976 | 0.972 | 0.965 | 0.943 | 0.904 |
195+
| POLLOCK | **0.973** | 0.976 | 0.972 | 0.965 | 0.943 | 0.904 |
196196
| W3C-CSVW | **0.995** | 0.967 | 0.967 | 0.748 | 0.730 | 0.986 |
197-
| CSV Wrangling | **0.872** | 0.950 | 0.945 | 0.935 | 0.724 | 0.956 |
198-
| CSV Wrangling CODEC | **0.866** | 0.948 | 0.948 | 0.944 | 0.728 | 0.959 |
199-
| CSV Wrangling MESSY | **0.849** | 0.943 | 0.943 | 0.943 | 0.705 | 0.956 |
197+
| CSV Wrangling | **0.927** | 0.950 | 0.945 | 0.935 | 0.724 | 0.956 |
198+
| CSV Wrangling CODEC | **0.916** | 0.948 | 0.948 | 0.944 | 0.728 | 0.959 |
199+
| CSV Wrangling MESSY | **0.905** | 0.943 | 0.943 | 0.943 | 0.705 | 0.956 |
200200

201201
### Component Accuracy
202202

203203
csv-nose's delimiter and quote detection accuracy on each dataset:
204204

205205
| Data set | Delimiter Accuracy | Quote Accuracy |
206206
|:---------|:-------------------|:---------------|
207-
| POLLOCK | 96.62% | 100.00% |
207+
| POLLOCK | 97.30% | 100.00% |
208208
| W3C-CSVW | 99.55% | 100.00% |
209-
| CSV Wrangling | 89.94% | 96.65% |
210-
| CSV Wrangling CODEC | 89.44% | 96.48% |
211-
| CSV Wrangling MESSY | 88.10% | 96.03% |
209+
| CSV Wrangling | 93.30% | 99.44% |
210+
| CSV Wrangling CODEC | 92.25% | 99.30% |
211+
| CSV Wrangling MESSY | 91.27% | 99.21% |
212212

213-
> NOTE: See [PERFORMANCE.md](docs/PERFORMANCE.mdPERFORMANCE.md) for details on accuracy breakdowns and known limitations.
213+
> NOTE: See [PERFORMANCE.md](docs/PERFORMANCE.md) for details on accuracy breakdowns and known limitations.
214214
215215
### Benchmark Setup
216216

src/benchmark.rs

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -261,7 +261,7 @@ fn parse_delimiter(name: &str) -> u8 {
261261
"space" => b' ',
262262
"vslash" | "pipe" => b'|',
263263
"colon" => b':',
264-
"nsign" => 0xA7, // Section sign (§)
264+
"nsign" => b'#', // Number sign (#)
265265
"slash" => b'/',
266266
_ => b',', // Default to comma
267267
}
@@ -435,6 +435,9 @@ mod tests {
435435
assert_eq!(parse_delimiter("space"), b' ');
436436
assert_eq!(parse_delimiter("vslash"), b'|');
437437
assert_eq!(parse_delimiter("colon"), b':');
438+
// "nsign" is the CSVsniffer annotation name for number sign (#, 0x23).
439+
// It must NOT map to 0xA7 (§, section sign), which is a different Unicode character.
440+
assert_eq!(parse_delimiter("nsign"), b'#');
438441
}
439442

440443
#[test]

0 commit comments

Comments
 (0)