Preprocess files

A collection of use-cases and scripts to make lychee compatible with additional file formats.

Your use-case or file format is not yet documented? Feel free to file an issue or create a PR!

Out of the box lychee supports HTML and Markdown file formats. More precisely, HTML files are parsed as HTML5 with the use of the html5ever parser. Markdown files are treated as CommonMark with the use of pulldown-cmark.

For any other file format lychee falls back to a "plain text" mode. This means that linkify attempts to extract URLs on a best-effort basis. If invalid UTF-8 characters are encountered, the input file is skipped. This is because it is assumed that the file is in a binary format lychee cannot understand.

Preprocess files

lychee allows file preprocessing with the --preprocess flag. For each input file the command specified with --preprocess is invoked instead of reading the input file directly. In the following there are examples how to preprocess common file formats. In most cases it's necessary to create a helper script for preprocessing, as no parameters can be supplied from the CLI directly.

lychee files/* --preprocess ./preprocess.sh

Take a look at preprocess.sh to see how this is done.

Converting file formats

epub, docx, odt, xlsx, ipynb

pandoc is a powerful conversion tool which allows us to convert many file types into HTML.

pandoc "$1" --to=html --wrap=none --markdown-headings=atx

Jupyter files (ipynb) can alternatively be converted with nbconvert.

jupyter nbconvert --to html "$1" --stdout --template basic

odp, pptx, ods, xlsx

LibreOffice can convert documents to various formats. Unfortunately, it does not support printing the result to stdout directly as of 2025, as the --cat option is not compatible with --convert-to. This makes usage a bit clumsy. Additionally, LibreOffice includes URLs in the head which we discard with sed.

libreoffice --headless --convert-to html "$1" --outdir /tmp
file=$(basename "$1")
file="/tmp/${file%.*}.html"
sed '/<body/,$!d' "$file" # discard content before body which contains libreoffice URLs
rm "$file"

AsciiDoc

Using asciidoctor we can convert AsciiDoc to HTML. We use -a stylesheet! to disable generation of stylesheets and associated URLs.

asciidoctor -a stylesheet! "$1" -o -

PDF

Using poppler-utils we can convert PDFs to HTML:

pdftohtml -i -s -stdout "$1"

# or to text
pdftotext "$1" -

Alternatively, pdftk can be used to extract URI directives from PDFs. Source: https://unix.stackexchange.com/a/531883

pdftk "$1" output - uncompress | grep -aPo '/URI *\(\K[^)]*'

CSV

Although, CSV seems like a simple data format lychee cannot understand value separators.

url,name
https://github.com/lycheeverse/lychee,Hello there

In the above example lychee might mistakingly detect https://github.com/lycheeverse/lychee,Hello as a URL. CSV separators can be customised and there is no way for lychee to know what separators are used. Because of that, it's the user's responsibility to transform CSV into a format lychee can understand.

csvtk

csvtk is a toolkit to work with CSV data. Apart from many advanced features, it allows us to convert and pretty-print CSV. One possible way to transform the data is:

csvtk csv2json "$1"

Tests

To ensure preprocess.sh works as expected, we have a test script test.sh which is also run in CI. The examples in this README should be tested this way. Whenever possible we try to create the test files programatically with produce-files.sh. When this is not easily possible we create them manually.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
files		files
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
input.md		input.md
preprocess.sh		preprocess.sh
produce-files.sh		produce-files.sh
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preprocess files

Converting file formats

epub, docx, odt, xlsx, ipynb

odp, pptx, ods, xlsx

AsciiDoc

PDF

CSV

csvtk

Tests

About

Uh oh!

Releases

Packages

Languages

License

lycheeverse/lychee-all

Folders and files

Latest commit

History

Repository files navigation

Preprocess files

Converting file formats

epub, docx, odt, xlsx, ipynb

odp, pptx, ods, xlsx

AsciiDoc

PDF

CSV

csvtk

Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages