-
-
Notifications
You must be signed in to change notification settings - Fork 6k
Closed
Labels
π BugSomething isn't workingSomething isn't workingπ©Ί Needs TriageNeeds attention of maintainersNeeds attention of maintainers
Description
crawl4ai version
0.7.4
Expected Behavior
Bug Description
When processing raw HTML content (not URLs), there's a significant discrepancy in token usage between local LLMs (Ollama) and cloud LLMs (Groq, DeepSeek). Local LLMs appear to process a filtered/compressed version of the HTML, while cloud LLMs process the full unfiltered version.
Expected Behavior
Both local and cloud LLMs should process the same filtered version of HTML content when using input_format="fit_markdown" and content filtering configurations.
Current Behavior
Actual Behavior
- Local LLM (Ollama): ~4k tokens for the same HTML content
- Cloud LLMs (Groq/DeepSeek): 70-80k tokens for the same HTML content
- Note: This issue only occurs with raw HTML input. URL crawling works consistently across all LLM providers.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
async def run_crawler_for_text(raw_html: str):
try:
extraction_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="ollama/llama3.1:8b",
base_url="http://localhost:11434/",
#provider = "openai/deepseek-chat",
#base_url="https://api.deepseek.com/v1",
#api_token="xyz",
temperature=0.0,
top_p=0.0
),
instruction="""
instruction for llm
""", #LLM Config Options
extraction_type="schema",
extra_args={"temperature": 0.0},
verbose=True,
input_format="fit_markdown",
apply_chunking=True,
chunk_token_threshold=3200,
force_json_response=True,
overlap_rate=0.3,
)
pruning_filter = PruningContentFilter(threshold_type="fixed", threshold=0.05) #Filter configs
run_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
markdown_generator=DefaultMarkdownGenerator(content_filter=pruning_filter, content_source="cleaned_html", options={"ignore_links": True, "ignore_images" : True}),
exclude_external_links=False,
exclude_all_images=False,
exclude_social_media_links=True,
exclude_external_images=False,
verbose=True,
cache_mode=CacheMode.BYPASS
)OS
macOS
Python version
3.13.5
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
Metadata
Metadata
Assignees
Labels
π BugSomething isn't workingSomething isn't workingπ©Ί Needs TriageNeeds attention of maintainersNeeds attention of maintainers