Skip to content

[Bug]: JSONCSSSelector Fails to Handle Sibling CSS SelectorsΒ #1254

@itsskofficial

Description

@itsskofficial

crawl4ai version

0.6.4

Expected Behavior

The JSONCSSSelector should correctly process sibling CSS selectors like "+ tr span.age" to extract data from elements such as located within the immediately following sibling of a base element (e.g., on Hacker News). Specifically, when using a selector like "+ tr span.age", it should select the element from the next sibling and extract the desired attribute or text, such as the title attribute containing timestamp data.

Current Behavior

When applying a sibling selector such as "+ tr span.age" to a base element like , the JSONCSSSelector fails to select any elements. As a result, no data is extracted, and the expected fields (e.g., timestamps in the time field) are missing or empty in the output.

Is this reproducible?

Yes

Inputs Causing the Bug

- URL(s): 
https://news.ycombinator.com/ (Hacker News homepage)

- Settings used:
Crawler: AsyncWebCrawler
Extraction Strategy: JsonCssExtractionStrategy with a schema containing sibling selectors.

- Input data:
Base selector: "tr.athing"
Field selector: "+ tr span.age" (for extracting the title attribute of <span class="age">)

Steps to Reproduce

1. Set up a Crawl4AI crawler with AsyncWebCrawler and JsonCssExtractionStrategy.

2. Define a schema with:
Base selector: "tr.athing"
Field:
{
    "name": "time",
    "selector": "+ tr span.age",
    "type": "attribute",
    "attribute": "title"
}

3. Run the crawler on https://news.ycombinator.com/.

4. Observe that the extracted data does not include the time field or contains empty values for it, despite the presence of <span class="age"> elements in the sibling <tr>.

Code snippets

from crawl4ai import AsyncWebCrawler, JsonCssExtractionStrategy

# Define the schema with a sibling selector
schema = {
    "name": "HackerNewsScraper",
    "baseSelector": "tr.athing",
    "fields": [
        {
            "name": "time",
            "selector": "+ tr span.age",
            "type": "attribute",
            "attribute": "title",
            "default": ""
        }
    ]
}

# Initialize the crawler with the extraction strategy
async with AsyncWebCrawler() as crawler:
    strategy = JsonCssExtractionStrategy(schema=schema)
    results = await crawler.arun(url="https://news.ycombinator.com/", config={"extraction_strategy": strategy})
    
    # Print the extracted content (expected to have empty 'time' fields)
    print(results[0].extracted_content)

OS

Windows

Python version

3.12.9

Browser

Chrome

Browser version

126.0.6478.127

Error logs & Screenshots (if applicable)

No explicit errors are raised; however, the extracted data for the time field is consistently empty or missing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions