Skip to content

Building a Agentic scraper using Firecrawl, this project shows the step by step implemintation

Notifications You must be signed in to change notification settings

Darshan174/AI-Stocks-news-scraper-using-Firecrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

v1.0.0: This initial version establishes a basic automated pipeline to fetch financial news and perform sentiment analysis using a local LLM. It demonstrates the fundamental integration between the Firecrawl scraping API and the Ollama inference engine.

Core Components:

scraper.py: A Python script that calls the Firecrawl /scrape endpoint to convert a Yahoo Finance URL into Markdown format. analyst.py: A processing script that reads the Markdown data and utilizes Llama 3.2 (via Ollama) to generate summaries and "Bullish/Bearish" verdicts.

Current Limitations: The "Junk Text" Observation: During development, I observed that even with onlyMainContent: True, the generated Markdown file (market_news.md) contains "noise" such as "Skip to navigation" links and accessibility tags. Hence this increases the token count sent to the LLM and can occasionally confuse sentiment extraction.

  • This limitation serves as the primary motivation for Phase 1 (Structured Extraction), where we will implement JSON schemas to filter out everything except the actual news data.

Token Inefficiency (Image & Link Noise): In market_news.md, notice the large blocks of text describing images, such as "Asian shares were moderately higher...". The analyst.py reads this entire file and sends it to Ollama. This is "wasting" tokens by asking the LLM to process descriptions of photos and raw URL strings that don't help with sentiment analysis.

"Hardcoded" Bottleneck: This makes the pipeline "linear" and manual. If you wanted to check news for "Tesla" or "Crypto" instead of general market news, you would have to manually open the Python file and change the code.

  • A true agent needs to Search for URLs dynamically based on your request, rather than relying on a hardcoded link.

"Unstructured" Output: Currently, analyst.py simply prints a block of text to the terminal. While the summary is helpful for a human to read, another computer program (like a stock-trading bot) couldn't easily "understand" it.

  • This is why a need for JSON schemas in Phase 1 (v2.0.0) —to turn that text into data a machine can use.

About

Building a Agentic scraper using Firecrawl, this project shows the step by step implemintation

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages