Ignore a specific directory with HTML files to check #1909

honzajavorek · 2025-11-10T10:31:26Z

honzajavorek
Nov 10, 2025

Hi, my site contains generated content, which I cannot really influence, it's different every day, and I don't want to check the links there, because I can't even fix them.

My specific use case

I automatically read a feed of some job postings from several job boards and display them in one section of my web. There's a ton of links out, and many suffer aggressive anti-scraping (thus also anti-lychee) protections, the jobs expire throughout the day, etc.

I thought I could ignore those parts by exclude_path, but it doesn't seem to work. I'd like to ask what I'm doing wrong, or whether I misunderstood the option, or what's the best solution to my problem.

My site is a static site, plain HTML, rendered in a public folder. My command is ./lychee public. The problematic pages look like this:

[public/jobs/praha/index.html]:
[404] https://www.jobs.cz/rpd/2000811023/?utm_source=juniorguru | Error (cached)
[406] http://www.zebra.com/ | Error (cached)
[403] http://www.sap.com/ | Error (cached)
[404] https://www.jobs.cz/rpd/2000757067/?utm_source=juniorguru | Error (cached)

[public/jobs/index.html]:
[403] http://www.sap.com/ | Network error: Forbidden
[404] https://www.jobs.cz/rpd/2000811023/?utm_source=juniorguru | Network error: Not Found
[406] http://www.zebra.com/ | Network error: Not Acceptable
[404] https://www.jobs.cz/rpd/2000811023/?utm_source=juniorguru | Error (cached)
[404] https://www.jobs.cz/rpd/2000757067/?utm_source=juniorguru | Network error: Not Found

...

My TOML looks like this, but I still get the output above:

exclude_path = [
    "/jobs/index\\.html$",
    "/jobs/[^/]+/index\\.html$",
]

I'd be grateful for any guidance.

Answered by thomas-zahner

Nov 27, 2025

Ah that explains the issue 👍

I can recommend that you try the dedicated GitHub action instead of setting up lychee manually, though your approach of course is also fine.

So in summary, updating to lychee 0.21.0 fixes the problem and makes exclude_path work as expected. Also, the regular expressions for exclude_path do not have to match the full path, they just have to produce a match. So the docs are accurate and up to date. It probably was different in the past, e.g. with 0.18.1.

View full answer

mre · 2025-11-11T09:52:40Z

mre
Nov 11, 2025
Maintainer

You're very close.

Excluded paths need to be specified from the root directory where you called lychee from:

exclude_path = [
    "public/jobs/index\\.html$",
    "public/jobs/[^/]+/index\\.html$",
]

That might be a little surprising. After all, you provided public as the input directory already, so lychee should be smart enought to figure this out, right? But remember, that you can provide multiple inputs, at which point it's no longer clear to which input an excluded path refers to. That's why you always have to write out the full path or use a wildcard to match all directories:

exclude_path = [
    ".*/jobs/index\\.html$",
    ".*/jobs/[^/]+/index\\.html$",
]

Thanks for using lychee.

10 replies

thomas-zahner Nov 17, 2025
Maintainer

Good thinking, but the regex needs to fully match the path and this does not.

I've just double tested this and that's actually not true for lychee 0.21.0. The exclude path arguments apply for any regex that matches and it does not have to be an exact match. So for example a value of jobs excludes both of your examples.

I can really recommend that you use --dump-inputs to debug and understand which files lychee excludes. This skips the actual link checking and just prints the files which lychee would check.

@honzajavorek Which version of lychee do you use? There was a bug with --dump-inputs in previous versions and I recommend that you use the latest 0.21.0.

I've reproduced your use case with your provided TOML file with lychee 0.21.0 and it works:

➜ tree public
public
└── jobs
    ├── index.html
    └── praha
        └── index.html

3 directories, 2 files
➜ cat lychee.toml
exclude_path = [
    "/jobs/index\\.html$",
    "/jobs/[^/]+/index\\.html$",
]
➜ lychee public --dump-inputs

honzajavorek Nov 27, 2025
Author

I'm sorry, I didn't have time to dig into it, but I still struggle with it. You asking me about version actually gave me an idea if this could be the problem, because I don't seem to reproduce the issue locally, while on CI it behaves as if I didn't exclude anything. I'll check versions and dig deeper today.

honzajavorek Nov 27, 2025
Author

lychee-v0.18.1 🤦‍♂️

  check-links:
    executor: python-js
    steps:
      - attach_workspace:
          at: "~"
      - run:
          name: Download Lychee
          command: wget "https://github.com/lycheeverse/lychee/releases/download/lychee-v0.18.1/lychee-x86_64-unknown-linux-musl.tar.gz" -O "lychee.tar.gz"
      - run:
          name: Extract Lychee
          command: tar -xzvf "lychee.tar.gz" && chmod +x "lychee"
      - run:
          name: Lychee version
          command: ./lychee --version
      - run:
          name: Check links
          command: ./lychee public

thomas-zahner Nov 27, 2025
Maintainer

Ah that explains the issue 👍

I can recommend that you try the dedicated GitHub action instead of setting up lychee manually, though your approach of course is also fine.

So in summary, updating to lychee 0.21.0 fixes the problem and makes exclude_path work as expected. Also, the regular expressions for exclude_path do not have to match the full path, they just have to produce a match. So the docs are accurate and up to date. It probably was different in the past, e.g. with 0.18.1.

Answer selected by thomas-zahner

honzajavorek Nov 27, 2025
Author

I would use the dedicated GitHub action if I didn't have my all stack on CircleCI 😅 I'm in the process of verifying that after upgrade it's all good, but I think we've got it. Thanks very much - this has helped me to find the issue. It didn't cross my mind at all there could be version discrepancy until you asked about it, even though it's like the first thing one should check 🤦‍♂️ I wrote myself a small script which'll make sure I upgrade to new versions on CI as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ignore a specific directory with HTML files to check #1909

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 10 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Ignore a specific directory with HTML files to check #1909

Uh oh!

Uh oh!

honzajavorek Nov 10, 2025

Replies: 1 comment · 10 replies

Uh oh!

mre Nov 11, 2025 Maintainer

Uh oh!

thomas-zahner Nov 17, 2025 Maintainer

Uh oh!

honzajavorek Nov 27, 2025 Author

Uh oh!

honzajavorek Nov 27, 2025 Author

Uh oh!

thomas-zahner Nov 27, 2025 Maintainer

Uh oh!

honzajavorek Nov 27, 2025 Author

honzajavorek
Nov 10, 2025

Replies: 1 comment 10 replies

mre
Nov 11, 2025
Maintainer

thomas-zahner Nov 17, 2025
Maintainer

honzajavorek Nov 27, 2025
Author

honzajavorek Nov 27, 2025
Author

thomas-zahner Nov 27, 2025
Maintainer

honzajavorek Nov 27, 2025
Author