-
-
Notifications
You must be signed in to change notification settings - Fork 13
extract_links returns 0 results when anchor tags contain newlines between tag name and attributes #129
Description
Bug Description
extract_links() returns an empty set when the HTML page contains anchor tags with a newline (or other whitespace) between the tag name and its attributes — e.g. <a\nhref="..."> instead of the more common <a href="...">.
Root Cause
In courlan/core.py, the regex used to find anchor tags is:
FIND_LINKS_REGEX = re.compile(r"<a [^<>]+?>", re.I)This pattern requires a literal space after <a. However, the HTML5 specification allows any ASCII whitespace (space, tab, newline, form feed, carriage return) between a tag name and its attributes. A page producing <a\nclass="..." href="..."> is perfectly valid HTML5, but the regex finds zero matches.
Minimal Reproducible Example
from courlan import extract_links
# Standard format — works correctly
html_standard = '<html><body><a href="https://example.com/page/">link</a></body></html>'
print(extract_links(html_standard, "https://example.com/"))
# {'https://example.com/page/'}
# Newline between tag name and attributes — returns empty set
html_newline = '<html><body><a\nhref="https://example.com/page/">link</a></body></html>'
print(extract_links(html_newline, "https://example.com/"))
# set() ← expected: {'https://example.com/page/'}Context
This was discovered on a WordPress site where a caching plugin outputs every <a> tag with a newline before the attributes (i.e. every single anchor tag on the page uses <a\nhref=...> format). As a result, trafilatura's focused_crawler reports "0 links found – 0 valid links" and returns only the seed URL, making the crawler completely ineffective on that site.
Interestingly, lxml has no problem parsing the same HTML:
from lxml import html
tree = html.fromstring('<html><body><a\nhref="https://example.com/page/">link</a></body></html>')
print(tree.xpath('//a/@href'))
# ['https://example.com/page/']Suggested Fix
Replace the literal space with \s+ (one or more whitespace characters) in FIND_LINKS_REGEX:
# Before
FIND_LINKS_REGEX = re.compile(r"<a [^<>]+?>", re.I)
# After
FIND_LINKS_REGEX = re.compile(r"<a\s+[^<>]+?>", re.I)This is backward-compatible: \s+ still matches a plain space but also handles tabs, newlines, and carriage returns — all of which are valid whitespace per the HTML5 spec.
Environment
courlanversion: $(python3 -c "import courlan; print(courlan.version)" 2>/dev/null || echo "latest")- Python 3.12