parallel cats

Yomguithereal · Yomguithereal · commit 8ab64b8c75c1 · 2026-02-20T15:22:27.000+01:00
diff --git a/docs/PIPELINES.md b/docs/PIPELINES.md
@@ -8,6 +8,7 @@ Curated collection of unhinged `xan` pipelines.
 * [Making sure a crawler was logged in by reading files in parallel](#making-sure-a-crawler-was-logged-in-by-reading-files-in-parallel)
 * [Parsing logs using `xan separate`](#parsing-logs-using-xan-separate)
 * [Running subprocesses to extract raw text from PDF files](#running-subprocesses-to-extract-raw-text-from-pdf-files)
+* [Matching multiple queries in a press articles corpus, in parallel](#matching-multiple-queries-in-a-press-articles-corpus-in-parallel)
 
 ## Paginating urls to download
 
@@ -33,7 +34,7 @@ The `xan range` command produces a CSV looking like this:
 | ... |
 
 Then the `xan select --evaluate` part use the following expression to transform the file on the fly:
-
+Matching multiple queries in a press articles corpus, in parallel
 ```python
 # We append the content of the "n" column to the given url
 "https://news.ycombinator.com/?p=" ++ n as url
@@ -138,6 +139,107 @@ We need to use `col("path", 1)` in our expressions because of course there are t
 
 We also use the `xan rename` command in the end because mixing camelCase and snake_case is an unforgivable fashion *faux-pas*.
 
+## Matching multiple queries in a press articles corpus, in parallel
+
+We have a corpus of several GBs of CSV files containing press articles from various French media outlets.
+
+We need to match a bunch of regex patterns in each article to plot time series of the relevance of climate change-related concepts across time.
+
+Here is our `queries.csv` file:
+
+| name                 | pattern                                                      |
+| -------------------- | ------------------------------------------------------------ |
+| query_climatique     | \bclimatique                                                 |
+| query_effet_de_serre | effet\s+de\s+serre\|couche\s+d[’']ozone                      |
+| query_biodiversite   | \bbiodiversit[ée]                                            |
+| query_transition     | transitions?\s+(?:[ée]cologique\|[ée]n[ée]rg[ée]tique)       |
+| query_durable        | d[ée]veloppement\s+durable\|[ée]n[ée]rgies?\s+renouvelables? |
+
+Here is our `xan` pipeline:
+
+```bash
+xan parallel cat \
+  --progress \
+  --source-column media \
+  --buffer-size -1 \
+  --preprocess '
+    map "date_published.ym().try() || `N/A` as month" |
+    search --breakdown --regex --ignore-case -s headline,description,text
+      --patterns queries.csv
+      --pattern-column pattern
+      --name-column name |
+    groupby month --along-columns "query_*" "sum(_)" |
+    sort -s month' \
+  */articles.csv.gz | \
+xan transform media '_.split("/")[0]' > $BASE_DIR/matches.csv
+```
+
+*Regarding `parallel cat`*
+
+`xan parallel cat` consumes a bunch of CSV files (here everything matching `*/articles.csv.gz`), applies some sort of preprocessing on each file (as given to the `--preprocess` flag here) and redirect everything to the standard output.
+
+<p align="center">
+  <img src="https://i.redd.it/io23lob82pp61.jpg" alt="parallel cats" width="250px" />
+</p>
+
+Then `--progress` means we want to display a progress bar, `--source-column` means we want to add a new column to the output tracing from which file a row came from (here each CSV file is in fact the collection of all articles from one media, so it is important to us to track from which media came each resulting row).
+
+When running a `xan parallel cat` command, output rows are flushed regularly to stdout to avoid overflowing memory. This means however that the command must lock an access to stdout to serialize the result and avoid race conditions between threads. This ultimately means that output rows might be in some arbitrary order. Here, because we are using `xan search --breakdown`, we know beforehand that each media will only get one row per month in the output. So we can afford holding all breakdown rows per media before flushing them, in order to ensure the output order is consistent (meaning that resulting rows per media are not interleaved in the output). We therefore use the `--buffer-size -1` flag.
+
+*Regarding the preprocessing*
+
+Here is the preprocessing (the `xan` part can be omitted in a command fed to `--preprocess`):
+
+```bash
+map "date_published.ym().try() || `N/A` as month" |
+search --breakdown --regex --ignore-case -s headline,description,text
+  --patterns queries.csv
+  --pattern-column pattern
+  --name-column name |
+groupby month --along-columns "query_*" "sum(_)" |
+sort -s month
+```
+
+First we create a column indicating the month from an article date, because we are going to use it for aggregating search results. For instance `2023-01-01T02:45:07+01:00` will become `2023-01`.
+
+Then we apply the search, feeding the patterns from `queries.csv` using the `--patterns` flag. `--pattern-column` lets us tell which column of `queries.csv` contain the actual regex pattern, while `--name-column` indicate an associated name that will be used by the `--breakdown` flag to produce output columns.
+
+Now let's consider the following file:
+
+| group | text                   |
+| ----- | ---------------------- |
+| one   | the cat eats the mouse |
+| one   | the sun is shining     |
+| two   | a cat is nice          |
+
+Using `search --breakdown` on it with patterns `the` and `cat` will produce the following result:
+
+| group | text                   | the | cat |
+| ----- | ---------------------- | --- | --- |
+| one   | the cat eats the mouse | 2   | 1   |
+| one   | the sun is shining     | 1   | 0   |
+| two   | a cat is nice          | 0   | 1   |
+
+We add one column per query and tally the number of their occurrences.
+
+Now the `groupby --along-columns` part lets us run a same aggregation over a selection of columns. So the following command on our previous example:
+
+```bash
+groupby group --along-columns the,cat 'sum(_)'
+```
+
+Would produce the following result:
+
+| group | the | cat |
+| ----- | --- | --- |
+| one   | 3   | 1   |
+| two   | 0   | 1   |
+
+Finally we use the `sort` command to make sure rows are sorted by month, and that's it (lol).
+
+*Regarding the final transformation*
+
+The last `xan transform` invocation is here to transform a file path into a proper media name. For instance `lemonde/articles.csv.gz` will become `lemonde`.
 
 <!--