You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/blog/csv_base_jumping.md
+25-5Lines changed: 25 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -314,17 +314,37 @@ Of course we are sampling from an untractable distribution that is far from unif
314
314
315
315
### Binary search
316
316
317
-
Being able to safely jump through a CSV file means we can support approximate random access. We cannot jump exactly to the nth row of the file, but we can jump approximately near it.
317
+
Being able to safely jump through CSV files means we have approximate random access. We cannot jump exactly to the nth row of the file, but we can jump near it, through byte offset arithmetics.
318
318
319
319
This ultimately means we can perform [binary search](https://en.wikipedia.org/wiki/Binary_search) on sorted CSV data in quasi-logarithmic time. Indeed, binary search is able to suffer approximate jumps in the data if you are careful enough about what you are doing to uphold the search's invariants.
320
320
321
-
Sorted CSV data can therefore be seen as a read-only database index, in a sense.
321
+
Sorted CSV data can therefore be seen as a read-only database index, if you squint hard enough.
322
322
323
-
This is still experimental and will only be released in the near future but this is what the upcoming `xan bisect` command promises to do:
323
+
This is what the `xan bisect` command (available since version `0.56.0`) does:
324
324
325
325
```bash
326
-
# Could be used thusly: xan bisect <column> <value> file.csv
327
-
xan bisect id 4534 sorted-by-id.csv
326
+
# Searching for rows with specific id:
327
+
xan bisect --search id 4534 sorted-by-id.csv
328
+
329
+
# Enumerating all rows with a name starting with A:
330
+
xan bisect name A sorted-by-name.csv | xan slice -E '!name.startswith("A")'
331
+
```
332
+
333
+
Of course, since binary search is `O(log n)`, it is one order of magnitude faster than linear search using `xan search` or `xan filter`:
334
+
335
+
```bash
336
+
# `sorted-people.csv` is a ~12M rows ~1GB CSV file stored on SSD
337
+
338
+
# Searching for a specific row using linear search through the
339
+
# `xan search` command.
340
+
# To remain fair, we search for a row in the first 20% of the file and stop
341
+
# as soon as the row is found (using the --limit 1 flag):
342
+
time xan search --exact --limit 1 -s name Chloe sorted-people.csv
343
+
0.143s
344
+
345
+
# The same, but using `xan bisect`:
346
+
time xan bisect --search name Chloe sorted-people.csv
0 commit comments