Skip to content

extract_links returns 0 results when anchor tags contain newlines between tag name and attributes #129

@haroldparis

Description

@haroldparis

Bug Description

extract_links() returns an empty set when the HTML page contains anchor tags with a newline (or other whitespace) between the tag name and its attributes — e.g. <a\nhref="..."> instead of the more common <a href="...">.

Root Cause

In courlan/core.py, the regex used to find anchor tags is:

FIND_LINKS_REGEX = re.compile(r"<a [^<>]+?>", re.I)

This pattern requires a literal space after <a. However, the HTML5 specification allows any ASCII whitespace (space, tab, newline, form feed, carriage return) between a tag name and its attributes. A page producing <a\nclass="..." href="..."> is perfectly valid HTML5, but the regex finds zero matches.

Minimal Reproducible Example

from courlan import extract_links

# Standard format — works correctly
html_standard = '<html><body><a href="https://example.com/page/">link</a></body></html>'
print(extract_links(html_standard, "https://example.com/"))
# {'https://example.com/page/'}

# Newline between tag name and attributes — returns empty set
html_newline = '<html><body><a\nhref="https://example.com/page/">link</a></body></html>'
print(extract_links(html_newline, "https://example.com/"))
# set()  ← expected: {'https://example.com/page/'}

Context

This was discovered on a WordPress site where a caching plugin outputs every <a> tag with a newline before the attributes (i.e. every single anchor tag on the page uses <a\nhref=...> format). As a result, trafilatura's focused_crawler reports "0 links found – 0 valid links" and returns only the seed URL, making the crawler completely ineffective on that site.

Interestingly, lxml has no problem parsing the same HTML:

from lxml import html
tree = html.fromstring('<html><body><a\nhref="https://example.com/page/">link</a></body></html>')
print(tree.xpath('//a/@href'))
# ['https://example.com/page/']

Suggested Fix

Replace the literal space with \s+ (one or more whitespace characters) in FIND_LINKS_REGEX:

# Before
FIND_LINKS_REGEX = re.compile(r"<a [^<>]+?>", re.I)

# After
FIND_LINKS_REGEX = re.compile(r"<a\s+[^<>]+?>", re.I)

This is backward-compatible: \s+ still matches a plain space but also handles tabs, newlines, and carriage returns — all of which are valid whitespace per the HTML5 spec.

Environment

  • courlan version: $(python3 -c "import courlan; print(courlan.version)" 2>/dev/null || echo "latest")
  • Python 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions