You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -538,6 +538,18 @@ Displaying 1 col from 5 rows of <stdin>
538
538
539
539
To access the expression language's [cheatsheet](./docs/moonblade/cheatsheet.md), run `xan help cheatsheet`. To display the full list of available [functions](./docs/moonblade/functions.md), run `xan help functions`. Finally, to display the list of available [aggregation functions](./docs/moonblade/aggs.md), run `xan help aggs`.
540
540
541
+
## Learning
542
+
543
+
If you speak French, here is a quick rundown of the tool by our friends from [CERES](https://ceres.sorbonne-universite.fr/): [https://ceres.sorbonne-universite.fr/test_outil_xan/](https://ceres.sorbonne-universite.fr/test_outil_xan/)
544
+
545
+
*Documented use-cases*
546
+
547
+
* [Merging frequency tables, three ways](./docs/cookbook/frequency_tables.md)
548
+
* [Parsing and visualizing dates with xan](./docs/cookbook/dates.md)
549
+
* [Joining files by URL prefixes](./docs/cookbook/urls.md)
550
+
551
+
For a sense of what can be achieved with `xan`, see this page summarizing a variety of complex but detailed pipelines that have been used in real-life by real people to solve their problems, using the tool: [PIPELINES](./docs/PIPELINES.md).
552
+
541
553
## Available commands
542
554
543
555
- [**help**](./docs/cmd/help.md): Get help regarding the expression language
@@ -738,13 +750,6 @@ They also respect typical environment variables related to ANSI colouring, such
738
750
- [Comprehensive list of window aggregation functions](./docs/moonblade/window.md)
739
751
- [Scraping DSL](./docs/moonblade/scraping.md)
740
752
741
-
## Cookbook
742
-
743
-
* [Merging frequency tables, three ways](./docs/cookbook/frequency_tables.md)
744
-
* [Parsing and visualizing dates with xan](./docs/cookbook/dates.md)
745
-
* [Joining files by URL prefixes](./docs/cookbook/urls.md)
746
-
* [Miscellaneous](./docs/cookbook/misc.md)
747
-
748
753
## News
749
754
750
755
For news about the tool's evolutions feel free to read:
*[Paginating urls to download](#paginating-urls-to-download)
8
+
*[Making sure a crawler was logged in by reading files in parallel](#making-sure-a-crawler-was-logged-in-by-reading-files-in-parallel)
9
+
*[Parsing logs using `xan separate`](#parsing-logs-using-xan-separate)
10
+
11
+
## Paginating urls to download
12
+
13
+
Let's say you want to download the latest 50 pages from [Hacker News](https://news.ycombinator.com). Fortunately our [`minet`](https://github.com/medialab/minet) tool knows how to efficiently download a bunch of urls fed through a CSV file.
14
+
15
+
The idea here is to generate CSV data out of thin air and to transform it into an url list to be fed to the `minet fetch` command:
16
+
17
+
```bash
18
+
xan range --start 1 50 --inclusive | \
19
+
xan select--evaluate'"https://news.ycombinator.com/?p=" ++ n as url'| \
20
+
minet fetch url --input -
21
+
```
22
+
23
+
## Making sure a crawler was logged in by reading files in parallel
24
+
25
+
Let's say one column of your CSV file is containing paths to files, relative to some `downloaded` folder, and you want to make sure all of them contain some string (maybe you crawled some website and want to make sure you were correctly logged in by searching for some occurrence of your username):
26
+
27
+
```bash
28
+
xan progress files.csv | \
29
+
xan filter -p 'pathjoin("downloaded", path) |read|!contains(_, /yomguithereal/i)' > not-logged.csv
30
+
```
31
+
32
+
## Parsing logs using `xan separate`
33
+
34
+
<!-- show plots -->
35
+
36
+
```bash
37
+
xan from -f txt ~/Downloads/toflit18.log.gz | xan rename log | xan separate 0 -rc '- - \[([^\]]+)\]"([^"]+)" (\d+) \d+ "[^"]*""([^"]+)"' --keep --into datetime,http_call,http_status,user_agent | xan map -O 'datetime.datetime("%d/%b/%Y:%H:%M:%S %z") as datetime, http_call.split("")[1] as url' > toflit18-log.csv
38
+
39
+
xan search -s url -e / toflit18-log.csv.gz | xan plot -LT datetime --count
40
+
xan plot -LT datetime --count toflit18-log.csv.gz --ignore
41
+
```
42
+
43
+
<!--
44
+
45
+
xan filter 'http_status == 200 && col("path", 1).endswith(".pdf")' report-files.csv | xan map -p 'col("path", 1) | pjoin("files", _) | fmt("pdftotext {} -", _) | shell(_).trim() as text'| xan selectndoc,uid,title,lastModified,link,text| xan rename -s lastModified last_modified > final.csv
46
+
47
+
xan parallel cat \
48
+
--progress \
49
+
-S media \
50
+
-B -1 \
51
+
-P '
52
+
select -f scripts/harmonization.moonblade |
53
+
map "date_published.ym().try() || `N/A` as month" |
54
+
search -Bri -s headline,description,text
55
+
--patterns scripts/climate_week/queries.csv
56
+
--pattern-column pattern
57
+
--name-column name |
58
+
groupby month -C -5: "sum(_)" |
59
+
sort -s month' \
60
+
*/articles.csv.gz | \
61
+
xan transform media '_.split("/")[0]'>$BASE_DIR/matches.csv
Copy file name to clipboardExpand all lines: docs/blog/csv_base_jumping.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -347,6 +347,8 @@ time xan bisect --search name Chloe sorted-people.csv
347
347
0.017s
348
348
```
349
349
350
+
As an aside, this is very similar to what the [`look`](https://man7.org/linux/man-pages/man1/look.1.html) unix command does, but for lines instead of CSV data.
351
+
350
352
## Caveat emptor
351
353
352
354
The technique demonstrated by this article is far from a silver bullet and suffers from some drawbacks. Here is unabdridged list of those drawbacks:
0 commit comments