extract_links returns 0 results when anchor tags contain newlines between tag name and attributes

## Bug Description

`extract_links()` returns an empty set when the HTML page contains anchor tags with a newline (or other whitespace) between the tag name and its attributes — e.g. `<a\nhref="...">` instead of the more common `<a href="...">`.

## Root Cause

In `courlan/core.py`, the regex used to find anchor tags is:

```python
FIND_LINKS_REGEX = re.compile(r"<a [^<>]+?>", re.I)
```

This pattern requires a **literal space** after `<a`. However, [the HTML5 specification](https://html.spec.whatwg.org/#before-attribute-name-state) allows any ASCII whitespace (space, tab, newline, form feed, carriage return) between a tag name and its attributes. A page producing `<a\nclass="..." href="...">` is perfectly valid HTML5, but the regex finds zero matches.

## Minimal Reproducible Example

```python
from courlan import extract_links

# Standard format — works correctly
html_standard = '<html><body><a href="https://example.com/page/">link</a></body></html>'
print(extract_links(html_standard, "https://example.com/"))
# {'https://example.com/page/'}

# Newline between tag name and attributes — returns empty set
html_newline = '<html><body><a\nhref="https://example.com/page/">link</a></body></html>'
print(extract_links(html_newline, "https://example.com/"))
# set()  ← expected: {'https://example.com/page/'}
```

## Context

This was discovered on a WordPress site where a caching plugin outputs every `<a>` tag with a newline before the attributes (i.e. every single anchor tag on the page uses `<a\nhref=...>` format). As a result, `trafilatura`'s `focused_crawler` reports `"0 links found – 0 valid links"` and returns only the seed URL, making the crawler completely ineffective on that site.

Interestingly, lxml has no problem parsing the same HTML:

```python
from lxml import html
tree = html.fromstring('<html><body><a\nhref="https://example.com/page/">link</a></body></html>')
print(tree.xpath('//a/@href'))
# ['https://example.com/page/']
```

## Suggested Fix

Replace the literal space with `\s+` (one or more whitespace characters) in `FIND_LINKS_REGEX`:

```python
# Before
FIND_LINKS_REGEX = re.compile(r"<a [^<>]+?>", re.I)

# After
FIND_LINKS_REGEX = re.compile(r"<a\s+[^<>]+?>", re.I)
```

This is backward-compatible: `\s+` still matches a plain space but also handles tabs, newlines, and carriage returns — all of which are valid whitespace per the HTML5 spec.

## Environment

- `courlan` version: $(python3 -c "import courlan; print(courlan.__version__)" 2>/dev/null || echo "latest")
- Python 3.12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

extract_links returns 0 results when anchor tags contain newlines between tag name and attributes #129

Bug Description

Root Cause

Minimal Reproducible Example

Context

Suggested Fix

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

extract_links returns 0 results when anchor tags contain newlines between tag name and attributes #129

Description

Bug Description

Root Cause

Minimal Reproducible Example

Context

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions