-
-
Notifications
You must be signed in to change notification settings - Fork 6k
Description
crawl4ai version
0.6.4
Expected Behavior
The JSONCSSSelector should correctly process sibling CSS selectors like "+ tr span.age" to extract data from elements such as located within the immediately following sibling of a base element (e.g., on Hacker News). Specifically, when using a selector like "+ tr span.age", it should select the element from the next sibling and extract the desired attribute or text, such as the title attribute containing timestamp data.
Current Behavior
When applying a sibling selector such as "+ tr span.age" to a base element like , the JSONCSSSelector fails to select any elements. As a result, no data is extracted, and the expected fields (e.g., timestamps in the time field) are missing or empty in the output.
Is this reproducible?
Yes
Inputs Causing the Bug
- URL(s):
https://news.ycombinator.com/ (Hacker News homepage)
- Settings used:
Crawler: AsyncWebCrawler
Extraction Strategy: JsonCssExtractionStrategy with a schema containing sibling selectors.
- Input data:
Base selector: "tr.athing"
Field selector: "+ tr span.age" (for extracting the title attribute of <span class="age">)Steps to Reproduce
1. Set up a Crawl4AI crawler with AsyncWebCrawler and JsonCssExtractionStrategy.
2. Define a schema with:
Base selector: "tr.athing"
Field:
{
"name": "time",
"selector": "+ tr span.age",
"type": "attribute",
"attribute": "title"
}
3. Run the crawler on https://news.ycombinator.com/.
4. Observe that the extracted data does not include the time field or contains empty values for it, despite the presence of <span class="age"> elements in the sibling <tr>.Code snippets
from crawl4ai import AsyncWebCrawler, JsonCssExtractionStrategy
# Define the schema with a sibling selector
schema = {
"name": "HackerNewsScraper",
"baseSelector": "tr.athing",
"fields": [
{
"name": "time",
"selector": "+ tr span.age",
"type": "attribute",
"attribute": "title",
"default": ""
}
]
}
# Initialize the crawler with the extraction strategy
async with AsyncWebCrawler() as crawler:
strategy = JsonCssExtractionStrategy(schema=schema)
results = await crawler.arun(url="https://news.ycombinator.com/", config={"extraction_strategy": strategy})
# Print the extracted content (expected to have empty 'time' fields)
print(results[0].extracted_content)OS
Windows
Python version
3.12.9
Browser
Chrome
Browser version
126.0.6478.127
Error logs & Screenshots (if applicable)
No explicit errors are raised; however, the extracted data for the time field is consistently empty or missing.