Skip to content

Commit 8ab64b8

Browse files
committed
parallel cats
1 parent c892312 commit 8ab64b8

File tree

1 file changed

+103
-1
lines changed

1 file changed

+103
-1
lines changed

docs/PIPELINES.md

Lines changed: 103 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Curated collection of unhinged `xan` pipelines.
88
* [Making sure a crawler was logged in by reading files in parallel](#making-sure-a-crawler-was-logged-in-by-reading-files-in-parallel)
99
* [Parsing logs using `xan separate`](#parsing-logs-using-xan-separate)
1010
* [Running subprocesses to extract raw text from PDF files](#running-subprocesses-to-extract-raw-text-from-pdf-files)
11+
* [Matching multiple queries in a press articles corpus, in parallel](#matching-multiple-queries-in-a-press-articles-corpus-in-parallel)
1112

1213
## Paginating urls to download
1314

@@ -33,7 +34,7 @@ The `xan range` command produces a CSV looking like this:
3334
| ... |
3435
3536
Then the `xan select --evaluate` part use the following expression to transform the file on the fly:
36-
37+
Matching multiple queries in a press articles corpus, in parallel
3738
```python
3839
# We append the content of the "n" column to the given url
3940
"https://news.ycombinator.com/?p=" ++ n as url
@@ -138,6 +139,107 @@ We need to use `col("path", 1)` in our expressions because of course there are t
138139
139140
We also use the `xan rename` command in the end because mixing camelCase and snake_case is an unforgivable fashion *faux-pas*.
140141
142+
## Matching multiple queries in a press articles corpus, in parallel
143+
144+
We have a corpus of several GBs of CSV files containing press articles from various French media outlets.
145+
146+
We need to match a bunch of regex patterns in each article to plot time series of the relevance of climate change-related concepts across time.
147+
148+
Here is our `queries.csv` file:
149+
150+
| name | pattern |
151+
| -------------------- | ------------------------------------------------------------ |
152+
| query_climatique | \bclimatique |
153+
| query_effet_de_serre | effet\s+de\s+serre\|couche\s+d[’']ozone |
154+
| query_biodiversite | \bbiodiversit[ée] |
155+
| query_transition | transitions?\s+(?:[ée]cologique\|[ée]n[ée]rg[ée]tique) |
156+
| query_durable | d[ée]veloppement\s+durable\|[ée]n[ée]rgies?\s+renouvelables? |
157+
158+
Here is our `xan` pipeline:
159+
160+
```bash
161+
xan parallel cat \
162+
--progress \
163+
--source-column media \
164+
--buffer-size -1 \
165+
--preprocess '
166+
map "date_published.ym().try() || `N/A` as month" |
167+
search --breakdown --regex --ignore-case -s headline,description,text
168+
--patterns queries.csv
169+
--pattern-column pattern
170+
--name-column name |
171+
groupby month --along-columns "query_*" "sum(_)" |
172+
sort -s month' \
173+
*/articles.csv.gz | \
174+
xan transform media '_.split("/")[0]' > $BASE_DIR/matches.csv
175+
```
176+
177+
*Regarding `parallel cat`*
178+
179+
`xan parallel cat` consumes a bunch of CSV files (here everything matching `*/articles.csv.gz`), applies some sort of preprocessing on each file (as given to the `--preprocess` flag here) and redirect everything to the standard output.
180+
181+
<p align="center">
182+
<img src="https://i.redd.it/io23lob82pp61.jpg" alt="parallel cats" width="250px" />
183+
</p>
184+
185+
Then `--progress` means we want to display a progress bar, `--source-column` means we want to add a new column to the output tracing from which file a row came from (here each CSV file is in fact the collection of all articles from one media, so it is important to us to track from which media came each resulting row).
186+
187+
When running a `xan parallel cat` command, output rows are flushed regularly to stdout to avoid overflowing memory. This means however that the command must lock an access to stdout to serialize the result and avoid race conditions between threads. This ultimately means that output rows might be in some arbitrary order. Here, because we are using `xan search --breakdown`, we know beforehand that each media will only get one row per month in the output. So we can afford holding all breakdown rows per media before flushing them, in order to ensure the output order is consistent (meaning that resulting rows per media are not interleaved in the output). We therefore use the `--buffer-size -1` flag.
188+
189+
*Regarding the preprocessing*
190+
191+
Here is the preprocessing (the `xan` part can be omitted in a command fed to `--preprocess`):
192+
193+
```bash
194+
map "date_published.ym().try() || `N/A` as month" |
195+
search --breakdown --regex --ignore-case -s headline,description,text
196+
--patterns queries.csv
197+
--pattern-column pattern
198+
--name-column name |
199+
groupby month --along-columns "query_*" "sum(_)" |
200+
sort -s month
201+
```
202+
203+
First we create a column indicating the month from an article date, because we are going to use it for aggregating search results. For instance `2023-01-01T02:45:07+01:00` will become `2023-01`.
204+
205+
Then we apply the search, feeding the patterns from `queries.csv` using the `--patterns` flag. `--pattern-column` lets us tell which column of `queries.csv` contain the actual regex pattern, while `--name-column` indicate an associated name that will be used by the `--breakdown` flag to produce output columns.
206+
207+
Now let's consider the following file:
208+
209+
| group | text |
210+
| ----- | ---------------------- |
211+
| one | the cat eats the mouse |
212+
| one | the sun is shining |
213+
| two | a cat is nice |
214+
215+
Using `search --breakdown` on it with patterns `the` and `cat` will produce the following result:
216+
217+
| group | text | the | cat |
218+
| ----- | ---------------------- | --- | --- |
219+
| one | the cat eats the mouse | 2 | 1 |
220+
| one | the sun is shining | 1 | 0 |
221+
| two | a cat is nice | 0 | 1 |
222+
223+
We add one column per query and tally the number of their occurrences.
224+
225+
Now the `groupby --along-columns` part lets us run a same aggregation over a selection of columns. So the following command on our previous example:
226+
227+
```bash
228+
groupby group --along-columns the,cat 'sum(_)'
229+
```
230+
231+
Would produce the following result:
232+
233+
| group | the | cat |
234+
| ----- | --- | --- |
235+
| one | 3 | 1 |
236+
| two | 0 | 1 |
237+
238+
Finally we use the `sort` command to make sure rows are sorted by month, and that's it (lol).
239+
240+
*Regarding the final transformation*
241+
242+
The last `xan transform` invocation is here to transform a file path into a proper media name. For instance `lemonde/articles.csv.gz` will become `lemonde`.
141243
142244
<!--
143245

0 commit comments

Comments
 (0)