Skip to content

[Bug]: Token count discrepancy: Local LLMs process filtered HTML while cloud LLMs process unfiltered HTML for raw HTML inputΒ #1499

@shawarr

Description

@shawarr

crawl4ai version

0.7.4

Expected Behavior

Bug Description
When processing raw HTML content (not URLs), there's a significant discrepancy in token usage between local LLMs (Ollama) and cloud LLMs (Groq, DeepSeek). Local LLMs appear to process a filtered/compressed version of the HTML, while cloud LLMs process the full unfiltered version.

Expected Behavior
Both local and cloud LLMs should process the same filtered version of HTML content when using input_format="fit_markdown" and content filtering configurations.

Current Behavior

Actual Behavior

  • Local LLM (Ollama): ~4k tokens for the same HTML content
  • Cloud LLMs (Groq/DeepSeek): 70-80k tokens for the same HTML content
  • Note: This issue only occurs with raw HTML input. URL crawling works consistently across all LLM providers.

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

async def run_crawler_for_text(raw_html: str):
    try:
        extraction_strategy = LLMExtractionStrategy(
            llm_config=LLMConfig(
                provider="ollama/llama3.1:8b", 
                base_url="http://localhost:11434/", 
                #provider = "openai/deepseek-chat",
                #base_url="https://api.deepseek.com/v1",
                #api_token="xyz",
                temperature=0.0,
                top_p=0.0
            ),
            instruction="""
instruction for llm
""",      #LLM Config Options
            extraction_type="schema",
            extra_args={"temperature": 0.0}, 
            verbose=True, 
            input_format="fit_markdown", 
            apply_chunking=True, 
            chunk_token_threshold=3200,
            force_json_response=True,
            overlap_rate=0.3,
        )
        pruning_filter = PruningContentFilter(threshold_type="fixed", threshold=0.05) #Filter configs
        run_config = CrawlerRunConfig( 
            extraction_strategy=extraction_strategy,
     markdown_generator=DefaultMarkdownGenerator(content_filter=pruning_filter, content_source="cleaned_html", options={"ignore_links": True, "ignore_images" : True}),
            exclude_external_links=False,
            exclude_all_images=False,
            exclude_social_media_links=True,
            exclude_external_images=False,
            verbose=True,
            cache_mode=CacheMode.BYPASS
        )

OS

macOS

Python version

3.13.5

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions