Skip to content

Commit 2bc271f

Browse files
committed
Reorganizing learning docs
1 parent d33471b commit 2bc271f

File tree

4 files changed

+78
-45
lines changed

4 files changed

+78
-45
lines changed

README.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ Finally, `xan` can be used to display CSV files in the terminal, for easy explor
4646
* [Pre-built binaries](#pre-built-binaries)
4747
* [Installing completions](#installing-completions)
4848
* [Quick tour](#quick-tour)
49+
* [Learning](#learning)
4950
* [Available commands](#available-commands)
5051
* [General flags and IO model](#general-flags-and-io-model)
5152
* [Getting help](#getting-help)
@@ -57,7 +58,6 @@ Finally, `xan` can be used to display CSV files in the terminal, for easy explor
5758
* [Compressed files](#compressed-files)
5859
* [Regarding color](#regarding-color)
5960
* [Expression language reference](#expression-language-reference)
60-
* [Cookbook](#cookbook)
6161
* [News](#news)
6262
* [How to cite?](#how-to-cite)
6363
* [Frequently Asked Questions](#frequently-asked-questions)
@@ -538,6 +538,18 @@ Displaying 1 col from 5 rows of <stdin>
538538
539539
To access the expression language's [cheatsheet](./docs/moonblade/cheatsheet.md), run `xan help cheatsheet`. To display the full list of available [functions](./docs/moonblade/functions.md), run `xan help functions`. Finally, to display the list of available [aggregation functions](./docs/moonblade/aggs.md), run `xan help aggs`.
540540
541+
## Learning
542+
543+
If you speak French, here is a quick rundown of the tool by our friends from [CERES](https://ceres.sorbonne-universite.fr/): [https://ceres.sorbonne-universite.fr/test_outil_xan/](https://ceres.sorbonne-universite.fr/test_outil_xan/)
544+
545+
*Documented use-cases*
546+
547+
* [Merging frequency tables, three ways](./docs/cookbook/frequency_tables.md)
548+
* [Parsing and visualizing dates with xan](./docs/cookbook/dates.md)
549+
* [Joining files by URL prefixes](./docs/cookbook/urls.md)
550+
551+
For a sense of what can be achieved with `xan`, see this page summarizing a variety of complex but detailed pipelines that have been used in real-life by real people to solve their problems, using the tool: [PIPELINES](./docs/PIPELINES.md).
552+
541553
## Available commands
542554
543555
- [**help**](./docs/cmd/help.md): Get help regarding the expression language
@@ -738,13 +750,6 @@ They also respect typical environment variables related to ANSI colouring, such
738750
- [Comprehensive list of window aggregation functions](./docs/moonblade/window.md)
739751
- [Scraping DSL](./docs/moonblade/scraping.md)
740752
741-
## Cookbook
742-
743-
* [Merging frequency tables, three ways](./docs/cookbook/frequency_tables.md)
744-
* [Parsing and visualizing dates with xan](./docs/cookbook/dates.md)
745-
* [Joining files by URL prefixes](./docs/cookbook/urls.md)
746-
* [Miscellaneous](./docs/cookbook/misc.md)
747-
748753
## News
749754
750755
For news about the tool's evolutions feel free to read:

docs/PIPELINES.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# `xan` pipelines
2+
3+
Curated collection of unhinged `xan` pipelines.
4+
5+
## Summary
6+
7+
* [Paginating urls to download](#paginating-urls-to-download)
8+
* [Making sure a crawler was logged in by reading files in parallel](#making-sure-a-crawler-was-logged-in-by-reading-files-in-parallel)
9+
* [Parsing logs using `xan separate`](#parsing-logs-using-xan-separate)
10+
11+
## Paginating urls to download
12+
13+
Let's say you want to download the latest 50 pages from [Hacker News](https://news.ycombinator.com). Fortunately our [`minet`](https://github.com/medialab/minet) tool knows how to efficiently download a bunch of urls fed through a CSV file.
14+
15+
The idea here is to generate CSV data out of thin air and to transform it into an url list to be fed to the `minet fetch` command:
16+
17+
```bash
18+
xan range --start 1 50 --inclusive | \
19+
xan select --evaluate '"https://news.ycombinator.com/?p=" ++ n as url' | \
20+
minet fetch url --input -
21+
```
22+
23+
## Making sure a crawler was logged in by reading files in parallel
24+
25+
Let's say one column of your CSV file is containing paths to files, relative to some `downloaded` folder, and you want to make sure all of them contain some string (maybe you crawled some website and want to make sure you were correctly logged in by searching for some occurrence of your username):
26+
27+
```bash
28+
xan progress files.csv | \
29+
xan filter -p 'pathjoin("downloaded", path) | read | !contains(_, /yomguithereal/i)' > not-logged.csv
30+
```
31+
32+
## Parsing logs using `xan separate`
33+
34+
<!-- show plots -->
35+
36+
```bash
37+
xan from -f txt ~/Downloads/toflit18.log.gz | xan rename log | xan separate 0 -rc '- - \[([^\]]+)\] "([^"]+)" (\d+) \d+ "[^"]*" "([^"]+)"' --keep --into datetime,http_call,http_status,user_agent | xan map -O 'datetime.datetime("%d/%b/%Y:%H:%M:%S %z") as datetime, http_call.split(" ")[1] as url' > toflit18-log.csv
38+
39+
xan search -s url -e / toflit18-log.csv.gz | xan plot -LT datetime --count
40+
xan plot -LT datetime --count toflit18-log.csv.gz --ignore
41+
```
42+
43+
<!--
44+
45+
xan filter 'http_status == 200 && col("path", 1).endswith(".pdf")' report-files.csv | xan map -p 'col("path", 1) | pjoin("files", _) | fmt("pdftotext {} -", _) | shell(_).trim() as text' | xan select ndoc,uid,title,lastModified,link,text | xan rename -s lastModified last_modified > final.csv
46+
47+
xan parallel cat \
48+
--progress \
49+
-S media \
50+
-B -1 \
51+
-P '
52+
select -f scripts/harmonization.moonblade |
53+
map "date_published.ym().try() || `N/A` as month" |
54+
search -Bri -s headline,description,text
55+
--patterns scripts/climate_week/queries.csv
56+
--pattern-column pattern
57+
--name-column name |
58+
groupby month -C -5: "sum(_)" |
59+
sort -s month' \
60+
*/articles.csv.gz | \
61+
xan transform media '_.split("/")[0]' > $BASE_DIR/matches.csv
62+
63+
-->

docs/blog/csv_base_jumping.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -347,6 +347,8 @@ time xan bisect --search name Chloe sorted-people.csv
347347
0.017s
348348
```
349349

350+
As an aside, this is very similar to what the [`look`](https://man7.org/linux/man-pages/man1/look.1.html) unix command does, but for lines instead of CSV data.
351+
350352
## Caveat emptor
351353

352354
The technique demonstrated by this article is far from a silver bullet and suffers from some drawbacks. Here is unabdridged list of those drawbacks:

docs/cookbook/misc.md

Lines changed: 0 additions & 37 deletions
This file was deleted.

0 commit comments

Comments
 (0)