This repository contains my submission for the Full Stack Developer Intern assignment at BeyondChats. The goal of this project is to scrape blog articles, store them, update them using LLMs, and display both original and updated versions through a simple frontend.
The work is divided into three phases as mentioned in the assignment.
Frontend: https://llm-rewrite-article.vercel.app/ Backend: https://llm-rewrite-article.onrender.com/api/articles
The live frontend displays original and LLM-updated articles side by side for direct comparison.
- Node.js
- Express.js
- MySQL
- React
The backend handles scraping, database operations, and automation logic. The frontend is a small React app used to display articles.
-
Scrapes the 5 oldest articles from the BeyondChats blogs section.
-
Stores article data in a MySQL database.
-
Exposes CRUD APIs to manage articles.
-
Scraper → Stores articles in DB
-
APIs: GET /articles GET /articles/:id POST /articles PUT /articles/:id DELETE /articles/:id
The scraper first identifies the last page of the blogs section and starts collecting articles from there. If the last page contains fewer than 5 articles, it fetches one additional previous page and combines results to complete the required count.
When combining multiple pages, articles from previous pages are read bottom-up to preserve chronological order and ensure the oldest articles are selected.
Phase 2 is implemented as a Node.js script that runs as a background/batch job. It is intentionally not exposed as an API endpoint, since the task is automation-oriented and easier to debug as a standalone script.
The goal of this phase is to update existing articles by learning from similar articles that rank higher on search engines.
- Fetch original articles from the internal Articles API.
- For each article, search Google using the article title.
- From the search results, pick the first two links that point to blog/article pages published by other websites.
- Scrape the main content from these two external articles.
- Send the original article and the reference articles to an LLM with a controlled prompt.
- Store the newly generated article using the existing CRUD APIs.
- Save reference links separately and display them at the bottom of the updated article.
Instead of scraping Google HTML directly (which is brittle and prone to blocking), a search API is used to fetch search results in a stable and predictable way.
This keeps the focus of the assignment on content processing and automation logic, rather than dealing with anti-bot protections.
Only results that look like blog or article pages are considered.
Basic heuristics are applied to filter blog/article-style links from search results.
External article content is lightly truncated to keep LLM input manageable.
The LLM is used as a controlled rewriting step, not as a content generator from scratch.
The prompt is designed to:
- Preserve the original intent of the article.
- Improve clarity, structure, and depth.
- Align tone and formatting with the reference articles.
- Avoid copying or closely paraphrasing reference content.
The LLM is instructed to return clean, structured markdown without inline citations. Reference links are added separately at the end of the article.
The LLM is treated as a rewriting tool with strict constraints, not as a free-form content generator.
Phase 2 requires calling an LLM API to rewrite existing articles based on reference content. The pipeline was intentionally designed to be provider-agnostic, with all LLM logic isolated inside a single service layer.
The initial implementation used OpenAI-compatible APIs. However, during local development,
the OpenAI free tier account had $0 available credits, which resulted in consistent
insufficient_quota errors. Even though the integration and retry logic were correct,
requests were rejected before execution.
A similar issue was encountered while testing Google Gemini, where valid API keys were loaded successfully but requests failed due to account-level API restrictions.
These issues were related to API access and billing, not code correctness.
To ensure the Phase 2 pipeline could run end-to-end during local testing, the LLM provider was switched to Groq, which offers a free and reliable API for OpenAI-compatible models (e.g. LLaMA 3.3).
Reasons for choosing Groq:
- Free-tier availability without billing setup
- OpenAI-compatible API format (minimal code changes)
- Fast and stable responses for long-form text rewriting
The original llm.service.js was replaced with groq.service.js, while keeping the same
function signature. This allowed the rest of the pipeline to remain unchanged.
- The Phase 2 script successfully:
- Fetches original articles
- Searches and scrapes reference articles
- Calls the Groq LLM to rewrite content
- Stores updated articles along with reference links
- All external API calls include defensive checks and graceful failure handling
- If any step fails for a specific article, the pipeline skips it and continues
This approach ensures correctness, transparency, and a working end-to-end flow without relying on paid API credits.
Updated articles are stored using the same articles table:
- Original articles have
is_updated = 0 - Updated articles have
is_updated = 1 - Reference links are stored as JSON and rendered separately on the frontend
This avoids unnecessary schema complexity while keeping the relationship clear.
- If fewer than two suitable reference articles are found, the article is skipped.
- Failures in one article do not stop the entire script.
- The pipeline prioritizes clarity and determinism over aggressive automation.
The goal of this phase is correctness and explainability, not maximum throughput.
The backend diagram below shows the high-level data flow across all three phases. It highlights how articles are scraped, stored, updated using an LLM, and finally displayed on the frontend.
The focus of the diagram is clarity of data movement rather than low-level implementation details.
The frontend acts as a presentation layer that fetches article data from the backend API. Upon page load, the React application sends a GET request to /api/articles. The backend responds with both original and updated articles.
The frontend groups articles by title and displays the original and rewritten versions side by side, along with reference links. No data mutation occurs on the frontend.
- React-based frontend to display articles.
- Shows both original and updated versions.
- Simple, responsive UI focused on readability rather than heavy styling.
-
Install MySQL locally.
-
Create a database (for example:
beyondchats). -
Run the schema file:
mysql -u root -p beyondchats < backend/db/schema.sql
Database credentials are managed using environment variables.
- Clone repo
- Install dependencies npm install
- Create .env file
- Run scraper
- Start API server
- Run rewrite script
- Start frontend
The scraper intentionally uses simple and defensive selectors to avoid overfitting to the current HTML structure of the blog.
Pagination is handled explicitly to ensure the scraper always targets the oldest content, even when the last page contains fewer articles.
I checked for an RSS/Atom feed as a cleaner way to fetch articles, but chose to scrape the blogs section directly to stay aligned with the assignment requirement of scraping from the last page.
Backend code follows a basic service–controller pattern to keep responsibilities separated and the logic easy to follow.
Original articles are stored immutably. Updated articles are inserted as separate records, enabling clean side-by-side comparison without mutating source data.
The focus of this project is correctness, clarity, and following the assignment requirements without over-engineering.

