|
| 1 | +--- |
| 2 | +title: File preprocessing |
| 3 | +--- |
| 4 | + |
| 5 | +Out of the box lychee supports HTML, Markdown and plain text formats. |
| 6 | +More precisely, HTML files are parsed as HTML5 with the use of the [html5ever] parser. |
| 7 | +Markdown files are treated as [CommonMark] with the use of [pulldown-cmark]. |
| 8 | + |
| 9 | +For any other file format lychee falls back to a "plain text" mode. |
| 10 | +This means that [linkify] attempts to extract URLs on a best-effort basis. |
| 11 | +If invalid UTF-8 characters are encountered, the input file is skipped, |
| 12 | +because it is assumed that the file is in a binary format lychee cannot understand. |
| 13 | + |
| 14 | +lychee allows file preprocessing with the `--preprocess` flag. |
| 15 | +For each input file the command specified with `--preprocess` is invoked instead of reading the input file directly. |
| 16 | +In the following there are examples how to preprocess common file formats. |
| 17 | +In most cases it's necessary to create a helper script for preprocessing, |
| 18 | +as no parameters can be supplied from the CLI directly. |
| 19 | + |
| 20 | +```bash |
| 21 | +lychee files/* --preprocess ./preprocess.sh |
| 22 | +``` |
| 23 | + |
| 24 | +The referenced `preprocess.sh` script could look like this: |
| 25 | + |
| 26 | +```bash |
| 27 | +#!/usr/bin/env bash |
| 28 | + |
| 29 | +case "$1" in |
| 30 | +*.pdf) |
| 31 | + exec pdftohtml -i -s -stdout "$1" |
| 32 | + # Alternatives: |
| 33 | + # exec pdftotext "$1" - |
| 34 | + # exec pdftk "$1" output - uncompress | grep -aPo '/URI *\(\K[^)]*' |
| 35 | + ;; |
| 36 | +*.odt|*.docx|*.epub|*.ipynb) |
| 37 | + exec pandoc "$1" --to=html --wrap=none --markdown-headings=atx |
| 38 | + ;; |
| 39 | +*.odp|*.pptx|*.ods|*.xlsx) |
| 40 | + # libreoffice can't print to stdout unfortunately |
| 41 | + libreoffice --headless --convert-to html "$1" --outdir /tmp |
| 42 | + file=$(basename "$1") |
| 43 | + file="/tmp/${file%.*}.html" |
| 44 | + sed '/<body/,$!d' "$file" # discard content before body which contains libreoffice URLs |
| 45 | + rm "$file" |
| 46 | +;; |
| 47 | +*.adoc|*.asciidoc) |
| 48 | + asciidoctor -a stylesheet! "$1" -o - |
| 49 | + ;; |
| 50 | +*.csv) |
| 51 | + # specify --delimiter if values not delimited by "," |
| 52 | + exec csvtk csv2json "$1" |
| 53 | + ;; |
| 54 | +*) |
| 55 | + # identity function, output input without changes |
| 56 | + exec cat |
| 57 | + ;; |
| 58 | +esac |
| 59 | +``` |
| 60 | + |
| 61 | +For more examples and information take a look at [lychee-all], |
| 62 | +a repository dedicated to collect use-cases with file preprocessing. |
| 63 | +Feel free to open up an issue if you are missing a specific file format or have questions. |
| 64 | + |
| 65 | +[linkify]: https://github.com/robinst/linkify |
| 66 | +[html5ever]: https://github.com/servo/html5ever |
| 67 | +[CommonMark]: https://commonmark.org/ |
| 68 | +[pulldown-cmark]: https://github.com/pulldown-cmark/pulldown-cmark/ |
| 69 | +[lychee-all]: https://github.com/lycheeverse/lychee-all |
0 commit comments