You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
xan transform media '_.split("/")[0]' > $BASE_DIR/matches.csv
175
+
```
176
+
177
+
*Regarding `parallel cat`*
178
+
179
+
`xan parallel cat` consumes a bunch of CSV files (here everything matching `*/articles.csv.gz`), applies some sort of preprocessing on each file (as given to the `--preprocess` flag here) and redirect everything to the standard output.
Then `--progress` means we want to display a progress bar, `--source-column` means we want to add a new column to the output tracing from which file a row came from (here each CSV file is in fact the collection of all articles from one media, so it is important to us to track from which media came each resulting row).
186
+
187
+
When running a `xan parallel cat` command, output rows are flushed regularly to stdout to avoid overflowing memory. This means however that the command must lock an access to stdout to serialize the result and avoid race conditions between threads. This ultimately means that output rows might be in some arbitrary order. Here, because we are using `xan search --breakdown`, we know beforehand that each media will only get one row per month in the output. So we can afford holding all breakdown rows per media before flushing them, in order to ensure the output order is consistent (meaning that resulting rows per media are not interleaved in the output). We therefore use the `--buffer-size -1` flag.
188
+
189
+
*Regarding the preprocessing*
190
+
191
+
Here is the preprocessing (the `xan` part can be omitted in a command fed to `--preprocess`):
192
+
193
+
```bash
194
+
map "date_published.ym().try() || `N/A` as month" |
First we create a column indicating the month from an article date, because we are going to use it for aggregating search results. For instance `2023-01-01T02:45:07+01:00` will become `2023-01`.
204
+
205
+
Then we apply the search, feeding the patterns from `queries.csv` using the `--patterns` flag. `--pattern-column` lets us tell which column of `queries.csv` contain the actual regex pattern, while `--name-column` indicate an associated name that will be used by the `--breakdown` flag to produce output columns.
206
+
207
+
Now let's consider the following file:
208
+
209
+
| group | text |
210
+
| ----- | ---------------------- |
211
+
| one | the cat eats the mouse |
212
+
| one | the sun is shining |
213
+
| two | a cat is nice |
214
+
215
+
Using `search --breakdown` on it with patterns `the` and `cat` will produce the following result:
216
+
217
+
| group | text | the | cat |
218
+
| ----- | ---------------------- | --- | --- |
219
+
| one | the cat eats the mouse | 2 | 1 |
220
+
| one | the sun is shining | 1 | 0 |
221
+
| two | a cat is nice | 0 | 1 |
222
+
223
+
We add one column per query and tally the number of their occurrences.
224
+
225
+
Now the `groupby --along-columns` part lets us run a same aggregation over a selection of columns. So the following command on our previous example:
226
+
227
+
```bash
228
+
groupby group --along-columns the,cat 'sum(_)'
229
+
```
230
+
231
+
Would produce the following result:
232
+
233
+
| group | the | cat |
234
+
| ----- | --- | --- |
235
+
| one | 3 | 1 |
236
+
| two | 0 | 1 |
237
+
238
+
Finally we use the `sort`command to make sure rows are sorted by month, and that's it (lol).
239
+
240
+
*Regarding the final transformation*
241
+
242
+
The last `xan transform` invocation is here to transform a file path into a proper media name. For instance `lemonde/articles.csv.gz` will become `lemonde`.
0 commit comments