@@ -164,11 +164,11 @@ The table below shows the dialect detection success ratio. Accuracy is measured
164164
165165| Data set | ` csv-nose ` | ` CSVsniffer MADSE ` | ` CSVsniffer ` | ` CleverCSV ` | ` csv.Sniffer ` | DuckDB ` sniff_csv ` |
166166| :---------| :-----------| :-------------------| :-------------| :------------| :--------------| :-------------------|
167- | POLLOCK | ** 96.62 %** | 95.27% | 96.55% | 95.17% | 96.35% | 84.14% |
167+ | POLLOCK | ** 97.30 %** | 95.27% | 96.55% | 95.17% | 96.35% | 84.14% |
168168| W3C-CSVW[ ^ 2 ] | ** 99.55%** | 94.52% | 95.39% | 61.11% | 97.69% | 99.08% |
169- | CSV Wrangling | ** 87.15 %** | 90.50% | 89.94% | 87.99% | 84.26% | 91.62% |
170- | CSV Wrangling CODEC | ** 86.62 %** | 90.14% | 90.14% | 89.44% | 84.18% | 92.25% |
171- | CSV Wrangling MESSY | ** 84.92 %** | 89.60% | 89.60% | 89.60% | 83.06% | 91.94% |
169+ | CSV Wrangling | ** 92.74 %** | 90.50% | 89.94% | 87.99% | 84.26% | 91.62% |
170+ | CSV Wrangling CODEC | ** 91.55 %** | 90.14% | 90.14% | 89.44% | 84.18% | 92.25% |
171+ | CSV Wrangling MESSY | ** 90.48 %** | 89.60% | 89.60% | 89.60% | 83.06% | 91.94% |
172172
173173[ ^ 2 ] : csv-nose is optimized for the [ W3C CSV on the Web Test Suite] ( https://w3c.github.io/csvw/tests/ ) - reaching 99.55% accuracy.
174174
@@ -192,25 +192,25 @@ The F1 score is the harmonic mean of precision and recall, providing a balanced
192192
193193| Data set | ` csv-nose ` | ` CSVsniffer MADSE ` | ` CSVsniffer ` | ` CleverCSV ` | ` csv.Sniffer ` | DuckDB ` sniff_csv ` |
194194| :---------| :-----------| :-------------------| :-------------| :------------| :--------------| :-------------------|
195- | POLLOCK | ** 0.966 ** | 0.976 | 0.972 | 0.965 | 0.943 | 0.904 |
195+ | POLLOCK | ** 0.973 ** | 0.976 | 0.972 | 0.965 | 0.943 | 0.904 |
196196| W3C-CSVW | ** 0.995** | 0.967 | 0.967 | 0.748 | 0.730 | 0.986 |
197- | CSV Wrangling | ** 0.872 ** | 0.950 | 0.945 | 0.935 | 0.724 | 0.956 |
198- | CSV Wrangling CODEC | ** 0.866 ** | 0.948 | 0.948 | 0.944 | 0.728 | 0.959 |
199- | CSV Wrangling MESSY | ** 0.849 ** | 0.943 | 0.943 | 0.943 | 0.705 | 0.956 |
197+ | CSV Wrangling | ** 0.927 ** | 0.950 | 0.945 | 0.935 | 0.724 | 0.956 |
198+ | CSV Wrangling CODEC | ** 0.916 ** | 0.948 | 0.948 | 0.944 | 0.728 | 0.959 |
199+ | CSV Wrangling MESSY | ** 0.905 ** | 0.943 | 0.943 | 0.943 | 0.705 | 0.956 |
200200
201201### Component Accuracy
202202
203203csv-nose's delimiter and quote detection accuracy on each dataset:
204204
205205| Data set | Delimiter Accuracy | Quote Accuracy |
206206| :---------| :-------------------| :---------------|
207- | POLLOCK | 96.62 % | 100.00% |
207+ | POLLOCK | 97.30 % | 100.00% |
208208| W3C-CSVW | 99.55% | 100.00% |
209- | CSV Wrangling | 89.94 % | 96.65 % |
210- | CSV Wrangling CODEC | 89.44 % | 96.48 % |
211- | CSV Wrangling MESSY | 88.10 % | 96.03 % |
209+ | CSV Wrangling | 93.30 % | 99.44 % |
210+ | CSV Wrangling CODEC | 92.25 % | 99.30 % |
211+ | CSV Wrangling MESSY | 91.27 % | 99.21 % |
212212
213- > NOTE: See [ PERFORMANCE.md] ( docs/PERFORMANCE.mdPERFORMANCE. md ) for details on accuracy breakdowns and known limitations.
213+ > NOTE: See [ PERFORMANCE.md] ( docs/PERFORMANCE.md ) for details on accuracy breakdowns and known limitations.
214214
215215### Benchmark Setup
216216
0 commit comments