Skip to content

Commit d33471b

Browse files
committed
Amending xan bisect section in blog post
1 parent db79120 commit d33471b

File tree

1 file changed

+25
-5
lines changed

1 file changed

+25
-5
lines changed

docs/blog/csv_base_jumping.md

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -314,17 +314,37 @@ Of course we are sampling from an untractable distribution that is far from unif
314314

315315
### Binary search
316316

317-
Being able to safely jump through a CSV file means we can support approximate random access. We cannot jump exactly to the nth row of the file, but we can jump approximately near it.
317+
Being able to safely jump through CSV files means we have approximate random access. We cannot jump exactly to the nth row of the file, but we can jump near it, through byte offset arithmetics.
318318

319319
This ultimately means we can perform [binary search](https://en.wikipedia.org/wiki/Binary_search) on sorted CSV data in quasi-logarithmic time. Indeed, binary search is able to suffer approximate jumps in the data if you are careful enough about what you are doing to uphold the search's invariants.
320320

321-
Sorted CSV data can therefore be seen as a read-only database index, in a sense.
321+
Sorted CSV data can therefore be seen as a read-only database index, if you squint hard enough.
322322

323-
This is still experimental and will only be released in the near future but this is what the upcoming `xan bisect` command promises to do:
323+
This is what the `xan bisect` command (available since version `0.56.0`) does:
324324

325325
```bash
326-
# Could be used thusly: xan bisect <column> <value> file.csv
327-
xan bisect id 4534 sorted-by-id.csv
326+
# Searching for rows with specific id:
327+
xan bisect --search id 4534 sorted-by-id.csv
328+
329+
# Enumerating all rows with a name starting with A:
330+
xan bisect name A sorted-by-name.csv | xan slice -E '!name.startswith("A")'
331+
```
332+
333+
Of course, since binary search is `O(log n)`, it is one order of magnitude faster than linear search using `xan search` or `xan filter`:
334+
335+
```bash
336+
# `sorted-people.csv` is a ~12M rows ~1GB CSV file stored on SSD
337+
338+
# Searching for a specific row using linear search through the
339+
# `xan search` command.
340+
# To remain fair, we search for a row in the first 20% of the file and stop
341+
# as soon as the row is found (using the --limit 1 flag):
342+
time xan search --exact --limit 1 -s name Chloe sorted-people.csv
343+
0.143s
344+
345+
# The same, but using `xan bisect`:
346+
time xan bisect --search name Chloe sorted-people.csv
347+
0.017s
328348
```
329349

330350
## Caveat emptor

0 commit comments

Comments
 (0)