Reorganizing learning docs

Yomguithereal · Yomguithereal · commit 2bc271fc7b67 · 2026-02-19T16:47:38.000+01:00
diff --git a/README.md b/README.md
@@ -46,6 +46,7 @@ Finally, `xan` can be used to display CSV files in the terminal, for easy explor
   * [Pre-built binaries](#pre-built-binaries)
   * [Installing completions](#installing-completions)
 * [Quick tour](#quick-tour)
+* [Learning](#learning)
 * [Available commands](#available-commands)
 * [General flags and IO model](#general-flags-and-io-model)
   * [Getting help](#getting-help)
@@ -57,7 +58,6 @@ Finally, `xan` can be used to display CSV files in the terminal, for easy explor
   * [Compressed files](#compressed-files)
   * [Regarding color](#regarding-color)
 * [Expression language reference](#expression-language-reference)
-* [Cookbook](#cookbook)
 * [News](#news)
 * [How to cite?](#how-to-cite)
 * [Frequently Asked Questions](#frequently-asked-questions)
@@ -538,6 +538,18 @@ Displaying 1 col from 5 rows of <stdin>
 
 To access the expression language's [cheatsheet](./docs/moonblade/cheatsheet.md), run `xan help cheatsheet`. To display the full list of available [functions](./docs/moonblade/functions.md), run `xan help functions`. Finally, to display the list of available [aggregation functions](./docs/moonblade/aggs.md), run `xan help aggs`.
 
+## Learning
+
+If you speak French, here is a quick rundown of the tool by our friends from [CERES](https://ceres.sorbonne-universite.fr/): [https://ceres.sorbonne-universite.fr/test_outil_xan/](https://ceres.sorbonne-universite.fr/test_outil_xan/)
+
+*Documented use-cases*
+
+* [Merging frequency tables, three ways](./docs/cookbook/frequency_tables.md)
+* [Parsing and visualizing dates with xan](./docs/cookbook/dates.md)
+* [Joining files by URL prefixes](./docs/cookbook/urls.md)
+
+For a sense of what can be achieved with `xan`, see this page summarizing a variety of complex but detailed pipelines that have been used in real-life by real people to solve their problems, using the tool: [PIPELINES](./docs/PIPELINES.md).
+
 ## Available commands
 
 - [**help**](./docs/cmd/help.md): Get help regarding the expression language
@@ -738,13 +750,6 @@ They also respect typical environment variables related to ANSI colouring, such
 - [Comprehensive list of window aggregation functions](./docs/moonblade/window.md)
 - [Scraping DSL](./docs/moonblade/scraping.md)
 
-## Cookbook
-
-* [Merging frequency tables, three ways](./docs/cookbook/frequency_tables.md)
-* [Parsing and visualizing dates with xan](./docs/cookbook/dates.md)
-* [Joining files by URL prefixes](./docs/cookbook/urls.md)
-* [Miscellaneous](./docs/cookbook/misc.md)
-
 ## News
 
 For news about the tool's evolutions feel free to read:
diff --git a/docs/PIPELINES.md b/docs/PIPELINES.md
@@ -0,0 +1,63 @@
+# `xan` pipelines
+
+Curated collection of unhinged `xan` pipelines.
+
+## Summary
+
+* [Paginating urls to download](#paginating-urls-to-download)
+* [Making sure a crawler was logged in by reading files in parallel](#making-sure-a-crawler-was-logged-in-by-reading-files-in-parallel)
+* [Parsing logs using `xan separate`](#parsing-logs-using-xan-separate)
+
+## Paginating urls to download
+
+Let's say you want to download the latest 50 pages from [Hacker News](https://news.ycombinator.com). Fortunately our [`minet`](https://github.com/medialab/minet) tool knows how to efficiently download a bunch of urls fed through a CSV file.
+
+The idea here is to generate CSV data out of thin air and to transform it into an url list to be fed to the `minet fetch` command:
+
+```bash
+xan range --start 1 50 --inclusive | \
+xan select --evaluate '"https://news.ycombinator.com/?p=" ++ n as url' | \
+minet fetch url --input -
+```
+
+## Making sure a crawler was logged in by reading files in parallel
+
+Let's say one column of your CSV file is containing paths to files, relative to some `downloaded` folder, and you want to make sure all of them contain some string (maybe you crawled some website and want to make sure you were correctly logged in by searching for some occurrence of your username):
+
+```bash
+xan progress files.csv | \
+xan filter -p 'pathjoin("downloaded", path) | read | !contains(_, /yomguithereal/i)' > not-logged.csv
+```
+
+## Parsing logs using `xan separate`
+
+<!-- show plots -->
+
+```bash
+xan from -f txt ~/Downloads/toflit18.log.gz | xan rename log | xan separate 0 -rc '- - \[([^\]]+)\] "([^"]+)" (\d+) \d+ "[^"]*" "([^"]+)"' --keep --into datetime,http_call,http_status,user_agent | xan map -O 'datetime.datetime("%d/%b/%Y:%H:%M:%S %z") as datetime, http_call.split(" ")[1] as url' > toflit18-log.csv
+
+xan search -s url -e / toflit18-log.csv.gz | xan plot -LT datetime --count
+xan plot -LT datetime --count toflit18-log.csv.gz --ignore
+```
+
+<!--
+
+xan filter 'http_status == 200 && col("path", 1).endswith(".pdf")' report-files.csv | xan map -p 'col("path", 1) | pjoin("files", _) | fmt("pdftotext {} -", _) | shell(_).trim() as text' | xan select ndoc,uid,title,lastModified,link,text | xan rename -s lastModified last_modified > final.csv
+
+xan parallel cat \
+  --progress \
+  -S media \
+  -B -1 \
+  -P '
+    select -f scripts/harmonization.moonblade |
+    map "date_published.ym().try() || `N/A` as month" |
+    search -Bri -s headline,description,text
+      --patterns scripts/climate_week/queries.csv
+      --pattern-column pattern
+      --name-column name |
+    groupby month -C -5: "sum(_)" |
+    sort -s month' \
+  */articles.csv.gz | \
+xan transform media '_.split("/")[0]' > $BASE_DIR/matches.csv
+
+ -->
diff --git a/docs/blog/csv_base_jumping.md b/docs/blog/csv_base_jumping.md
@@ -347,6 +347,8 @@ time xan bisect --search name Chloe sorted-people.csv
 0.017s
 ```
 
+As an aside, this is very similar to what the [`look`](https://man7.org/linux/man-pages/man1/look.1.html) unix command does, but for lines instead of CSV data.
+
 ## Caveat emptor
 
 The technique demonstrated by this article is far from a silver bullet and suffers from some drawbacks. Here is unabdridged list of those drawbacks:
diff --git a/docs/cookbook/misc.md b/docs/cookbook/misc.md