You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -538,6 +538,18 @@ Displaying 1 col from 5 rows of <stdin>
538
538
539
539
To access the expression language's [cheatsheet](./docs/moonblade/cheatsheet.md), run `xan help cheatsheet`. To display the full list of available [functions](./docs/moonblade/functions.md), run `xan help functions`. Finally, to display the list of available [aggregation functions](./docs/moonblade/aggs.md), run `xan help aggs`.
540
540
541
+
## Learning
542
+
543
+
If you speak French, here is a quick rundown of the tool by our friends from [CERES](https://ceres.sorbonne-universite.fr/): [https://ceres.sorbonne-universite.fr/test_outil_xan/](https://ceres.sorbonne-universite.fr/test_outil_xan/)
544
+
545
+
*Documented use-cases*
546
+
547
+
* [Merging frequency tables, three ways](./docs/cookbook/frequency_tables.md)
548
+
* [Parsing and visualizing dates with xan](./docs/cookbook/dates.md)
549
+
* [Joining files by URL prefixes](./docs/cookbook/urls.md)
550
+
551
+
For a sense of what can be achieved with `xan`, see this page summarizing a variety of complex but detailed pipelines that have been used in real-life by real people to solve their problems, using the tool: [PIPELINES](./docs/PIPELINES.md).
552
+
541
553
## Available commands
542
554
543
555
- [**help**](./docs/cmd/help.md): Get help regarding the expression language
@@ -738,13 +750,6 @@ They also respect typical environment variables related to ANSI colouring, such
738
750
- [Comprehensive list of window aggregation functions](./docs/moonblade/window.md)
739
751
- [Scraping DSL](./docs/moonblade/scraping.md)
740
752
741
-
## Cookbook
742
-
743
-
* [Merging frequency tables, three ways](./docs/cookbook/frequency_tables.md)
744
-
* [Parsing and visualizing dates with xan](./docs/cookbook/dates.md)
745
-
* [Joining files by URL prefixes](./docs/cookbook/urls.md)
746
-
* [Miscellaneous](./docs/cookbook/misc.md)
747
-
748
753
## News
749
754
750
755
For news about the tool's evolutions feel free to read:
*[Paginating urls to download](#paginating-urls-to-download)
8
+
*[Making sure a crawler was logged in by reading files in parallel](#making-sure-a-crawler-was-logged-in-by-reading-files-in-parallel)
9
+
*[Parsing logs using `xan separate`](#parsing-logs-using-xan-separate)
10
+
11
+
## Paginating urls to download
12
+
13
+
Let's say you want to download the latest 50 pages from [Hacker News](https://news.ycombinator.com). Fortunately our [`minet`](https://github.com/medialab/minet) tool knows how to efficiently download a bunch of urls fed through a CSV file.
14
+
15
+
The idea here is to generate CSV data out of thin air and to transform it into an url list to be fed to the `minet fetch` command:
16
+
17
+
```bash
18
+
xan range --start 1 50 --inclusive | \
19
+
xan select--evaluate'"https://news.ycombinator.com/?p=" ++ n as url'| \
20
+
minet fetch url --input -
21
+
```
22
+
23
+
## Making sure a crawler was logged in by reading files in parallel
24
+
25
+
Let's say one column of your CSV file is containing paths to files, relative to some `downloaded` folder, and you want to make sure all of them contain some string (maybe you crawled some website and want to make sure you were correctly logged in by searching for some occurrence of your username):
26
+
27
+
```bash
28
+
xan progress files.csv | \
29
+
xan filter -p 'pathjoin("downloaded", path) |read|!contains(_, /yomguithereal/i)' > not-logged.csv
30
+
```
31
+
32
+
## Parsing logs using `xan separate`
33
+
34
+
<!-- show plots -->
35
+
36
+
```bash
37
+
xan from -f txt ~/Downloads/toflit18.log.gz | xan rename log | xan separate 0 -rc '- - \[([^\]]+)\]"([^"]+)" (\d+) \d+ "[^"]*""([^"]+)"' --keep --into datetime,http_call,http_status,user_agent | xan map -O 'datetime.datetime("%d/%b/%Y:%H:%M:%S %z") as datetime, http_call.split("")[1] as url' > toflit18-log.csv
38
+
39
+
xan search -s url -e / toflit18-log.csv.gz | xan plot -LT datetime --count
40
+
xan plot -LT datetime --count toflit18-log.csv.gz --ignore
0 commit comments