GitHub - Darshan174/AI-Stocks-news-scraper-using-Firecrawl: Building a Agentic scraper using Firecrawl, this project shows the step by step implemintation

v1.0.0: This initial version establishes a basic automated pipeline to fetch financial news and perform sentiment analysis using a local LLM. It demonstrates the fundamental integration between the Firecrawl scraping API and the Ollama inference engine.

Core Components:

scraper.py: A Python script that calls the Firecrawl /scrape endpoint to convert a Yahoo Finance URL into Markdown format. analyst.py: A processing script that reads the Markdown data and utilizes Llama 3.2 (via Ollama) to generate summaries and "Bullish/Bearish" verdicts.

Current Limitations: The "Junk Text" Observation: During development, I observed that even with onlyMainContent: True, the generated Markdown file (market_news.md) contains "noise" such as "Skip to navigation" links and accessibility tags. Hence this increases the token count sent to the LLM and can occasionally confuse sentiment extraction.

This limitation serves as the primary motivation for Phase 1 (Structured Extraction), where we will implement JSON schemas to filter out everything except the actual news data.

Token Inefficiency (Image & Link Noise): In market_news.md, notice the large blocks of text describing images, such as "Asian shares were moderately higher...". The analyst.py reads this entire file and sends it to Ollama. This is "wasting" tokens by asking the LLM to process descriptions of photos and raw URL strings that don't help with sentiment analysis.

"Hardcoded" Bottleneck: This makes the pipeline "linear" and manual. If you wanted to check news for "Tesla" or "Crypto" instead of general market news, you would have to manually open the Python file and change the code.

A true agent needs to Search for URLs dynamically based on your request, rather than relying on a hardcoded link.

"Unstructured" Output: Currently, analyst.py simply prints a block of text to the terminal. While the summary is helpful for a human to read, another computer program (like a stock-trading bot) couldn't easily "understand" it.

This is why a need for JSON schemas in Phase 1 (v2.0.0) —to turn that text into data a machine can use.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.vscode		.vscode
README.md		README.md
analyst.py		analyst.py
market_news.json		market_news.json
market_news.md		market_news.md
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases 3

Packages

Languages

Darshan174/AI-Stocks-news-scraper-using-Firecrawl

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages