Skip to content

Commit 5655db7

Browse files
File preprocessing (#114)
1 parent 4129314 commit 5655db7

File tree

6 files changed

+106
-76
lines changed

6 files changed

+106
-76
lines changed

.github/workflows/check-pr-links.yml

Lines changed: 12 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -9,59 +9,23 @@ jobs:
99
linkChecker:
1010
runs-on: ubuntu-latest
1111
steps:
12-
- name: Clone repository
13-
uses: actions/checkout@v5
14-
with:
15-
fetch-depth: 0
16-
17-
- name: Setup Node.js
18-
uses: actions/setup-node@v6
19-
with:
20-
node-version: "20"
12+
- uses: actions/checkout@v5
2113

22-
- name: Setup pnpm
23-
uses: pnpm/action-setup@v4
14+
- name: Build site
15+
uses: withastro/action@v5
2416
with:
25-
version: latest
26-
27-
- name: Check out master branch
28-
run: git checkout master
29-
30-
- name: Install dependencies for master
31-
run: pnpm install --frozen-lockfile
32-
33-
- name: Build site from master
34-
run: pnpm build
35-
36-
- name: Dump all links from master
37-
id: dump_links_from_master
38-
uses: lycheeverse/lychee-action@v2
39-
with:
40-
args: '--dump --root-dir ${{ github.workspace }}/dist --exclude-all-private dist'
41-
output: ./links-master.txt
42-
43-
- name: Stash untracked files
44-
run: git stash push --include-untracked
45-
46-
- name: Check out feature branch
47-
run: git fetch origin ${{ github.ref }} && git checkout FETCH_HEAD
48-
49-
- name: Apply stashed changes
50-
run: git stash pop || true
51-
52-
- name: Install dependencies for feature branch
53-
run: pnpm install --frozen-lockfile
54-
55-
- name: Build site from feature branch
56-
run: pnpm build
57-
58-
- name: Append links-master.txt to .lycheeignore
59-
run: cat links-master.txt >> .lycheeignore
17+
package-manager: pnpm@latest
6018

61-
- name: Check links in PR changes
19+
- name: Check links
6220
uses: lycheeverse/lychee-action@v2
6321
with:
64-
args: '--root-dir ${{ github.workspace }}/dist --exclude-all-private dist'
22+
# Remap live URLs to build directory because the links are potentially not live (not yet on master)
23+
args: |
24+
--root-dir $PWD/dist
25+
--exclude-all-private
26+
--remap 'https://lychee\.cli\.rs/(.*)/ file://'$PWD'/dist/$1/index.html'
27+
dist/
28+
src/
6529
fail: true
6630

6731
- name: Suggestions

.lycheeignore

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,17 @@
1-
https://api.reacher.email/v0/check_email
21
file:///home/user/website/
32
^https://www/$
43
^https://web/$
5-
# 404 page returns a 404, d'oh
6-
https://lychee.cli.rs/404/
7-
# Errors with "Too Many Requests"
4+
5+
# URL is used with POST
6+
https://api.reacher.email/v0/check_email
7+
8+
# 404 page is directly in dist/404.html but we've remapped it to an invalid path
9+
dist/404/index.html$
10+
11+
# Code examples in base-url.mdx which don't exist
12+
/docs/about.php$
13+
/docs/recipes/guide.php$
14+
15+
# Websites with aggressive rate limiting / bot detection
816
https://www.nongnu.org/atool
17+
https://builtwith.com/

astro.config.mjs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ export default defineConfig({
4242
"guides/config",
4343
"guides/cli",
4444
"guides/output",
45+
"guides/preprocessing",
4546
],
4647
},
4748
{

src/content/docs/guides/getting-started.mdx

Lines changed: 5 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ You can install Lychee using various package managers.
2323
<Code code="docker pull lycheeverse/lychee" lang="sh" />
2424
</TabItem>
2525
<TabItem label="NixOS">
26-
<Code code="nix-env -iA nixos.lychee" lang="sh" />
26+
<Code code="nix-shell -p lychee" lang="sh" />
2727
</TabItem>
2828
<TabItem label="FreeBSD">
2929
<Code code="pkg install lychee" lang="sh" />
@@ -206,24 +206,11 @@ In this command, we ignore the case when globbing, so it matches
206206
- `~/projects/rust_game_/README`
207207
- `~/projects/python_script_/Readme.markdown`
208208

209-
### Check Links From Epub File
209+
### Check other file formats
210210

211-
If you have [atool](https://www.nongnu.org/atool) installed, you can check links inside `.epub` files as well!
212-
213-
```bash
214-
acat -F zip {file.epub} "_.xhtml" "_.html" | lychee -
215-
```
216-
217-
:::caution[Attention]
218-
lychee parses other file formats as plaintext and extracts links using [linkify](https://github.com/robinst/linkify).
219-
This generally works well if there are no format- or encoding
220-
specifics, but in case you need dedicated support for a new file format, please
221-
consider [creating an issue](https://github.com/lycheeverse/lychee/issues).
222-
:::
223-
224-
[atool]: https://www.nongnu.org/atool
225-
[linkify]: https://github.com/robinst/linkify
226-
[issue]: https://github.com/lycheeverse/lychee/issues
211+
By preprocessing files it is possible to do link checking on
212+
files which aren't officially supported by lychee.
213+
See [file preprocessing](/guides/preprocessing).
227214

228215
## GitHub Action
229216

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
title: File preprocessing
3+
---
4+
5+
Out of the box lychee supports HTML, Markdown and plain text formats.
6+
More precisely, HTML files are parsed as HTML5 with the use of the [html5ever] parser.
7+
Markdown files are treated as [CommonMark] with the use of [pulldown-cmark].
8+
9+
For any other file format lychee falls back to a "plain text" mode.
10+
This means that [linkify] attempts to extract URLs on a best-effort basis.
11+
If invalid UTF-8 characters are encountered, the input file is skipped,
12+
because it is assumed that the file is in a binary format lychee cannot understand.
13+
14+
lychee allows file preprocessing with the `--preprocess` flag.
15+
For each input file the command specified with `--preprocess` is invoked instead of reading the input file directly.
16+
In the following there are examples how to preprocess common file formats.
17+
In most cases it's necessary to create a helper script for preprocessing,
18+
as no parameters can be supplied from the CLI directly.
19+
20+
```bash
21+
lychee files/* --preprocess ./preprocess.sh
22+
```
23+
24+
The referenced `preprocess.sh` script could look like this:
25+
26+
```bash
27+
#!/usr/bin/env bash
28+
29+
case "$1" in
30+
*.pdf)
31+
exec pdftohtml -i -s -stdout "$1"
32+
# Alternatives:
33+
# exec pdftotext "$1" -
34+
# exec pdftk "$1" output - uncompress | grep -aPo '/URI *\(\K[^)]*'
35+
;;
36+
*.odt|*.docx|*.epub|*.ipynb)
37+
exec pandoc "$1" --to=html --wrap=none --markdown-headings=atx
38+
;;
39+
*.odp|*.pptx|*.ods|*.xlsx)
40+
# libreoffice can't print to stdout unfortunately
41+
libreoffice --headless --convert-to html "$1" --outdir /tmp
42+
file=$(basename "$1")
43+
file="/tmp/${file%.*}.html"
44+
sed '/<body/,$!d' "$file" # discard content before body which contains libreoffice URLs
45+
rm "$file"
46+
;;
47+
*.adoc|*.asciidoc)
48+
asciidoctor -a stylesheet! "$1" -o -
49+
;;
50+
*.csv)
51+
# specify --delimiter if values not delimited by ","
52+
exec csvtk csv2json "$1"
53+
;;
54+
*)
55+
# identity function, output input without changes
56+
exec cat
57+
;;
58+
esac
59+
```
60+
61+
For more examples and information take a look at [lychee-all],
62+
a repository dedicated to collect use-cases with file preprocessing.
63+
Feel free to open up an issue if you are missing a specific file format or have questions.
64+
65+
[linkify]: https://github.com/robinst/linkify
66+
[html5ever]: https://github.com/servo/html5ever
67+
[CommonMark]: https://commonmark.org/
68+
[pulldown-cmark]: https://github.com/pulldown-cmark/pulldown-cmark/
69+
[lychee-all]: https://github.com/lycheeverse/lychee-all

src/content/docs/recipes/base-url.mdx

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -66,15 +66,15 @@ Here's what happens to different types of links:
6666

6767
<Code
6868
code={`<!-- Original links -->
69-
<a href="./guide.html">Guide</a>
70-
<a href="../about.html">About</a>
71-
<a href="https://other.com">External</a>
69+
<a href="./guide.php">Guide</a>
70+
<a href="../about.php">About</a>
71+
<a href="https://example.com">Absolute</a>
7272
7373
<!-- After --base-url https://example.com/docs/ -->
7474
75-
<a href="https://example.com/docs/guide.html">Guide</a>
76-
<a href="https://example.com/about.html">About</a>
77-
<a href="https://other.com">External</a>`} lang={fileLang}
75+
<a href="https://example.com/docs/guide.php">Guide</a>
76+
<a href="https://example.com/about.php">About</a>
77+
<a href="https://example.com">Absolute</a>`} lang={fileLang}
7878
title="Link Resolution Example" />
7979

8080
## Common Use Cases

0 commit comments

Comments
 (0)