LLM Engineering Lab

Overview

A portfolio of production-grade LLM systems built from scratch — spanning RAG at 800k-document scale, open-source and frontier model fine-tuning, multi-agent orchestration, and serverless cloud deployment. Every project is end-to-end: real data, real evaluation, real infrastructure.

Skills demonstrated across this repo:

RAG: vector ingestion, embedding search, retrieval-augmented generation at scale (800k documents)
Fine-tuning: QLoRA fine-tuning of open-source LLMs (Llama 3.2 3B) on Colab and frontier model fine-tuning via the OpenAI API
Agents and orchestration: tool-use agents, autonomous planning loops, multi-agent coordination, real-time deal detection and notification
Cloud deployment: Modal serverless deployment, HuggingFace Hub integration, async batch jobs
Multimodal: vision and voice capabilities across select projects
Evaluation: rigorous benchmarking across a dozen model families on the same held-out test set

Projects escalate in complexity. Earlier ones establish core patterns like structured prompting, retrieval, and tool use. The flagship project (LLM Price Predictor) brings everything together into one end-to-end system: data curation at scale, LLM preprocessing, model training, RAG, ensemble inference, and an autonomous agent that scans for deals, prices them, and notifies the user in real time.

Skills by project

Project	Skills
LLM Price Predictor	Data curation pipeline, LLM batch preprocessing, QLoRA fine-tuning (Llama 3.2 3B), frontier model fine-tuning (OpenAI API), RAG (ChromaDB + sentence-transformers), multi-model ensemble, agent system design, tool-use agents, autonomous planning loops, serverless deployment (Modal), HuggingFace Hub, async job management, structured outputs (Pydantic), LLM evaluation and benchmarking, Weights & Biases, push notifications (Pushover)
Expert Knowledge Worker (RAG Chatbot)	RAG (LangChain + ChromaDB), document chunking, vector embeddings, query rewriting, LLM reranking, hierarchical RAG, retrieval evaluation (MRR, nDCG), LLM-as-judge evaluation, structured outputs, Gradio UI
Web Summary Tool	Prompt engineering, web scraping, LLM inference (OpenAI + Ollama), multi-backend integration, persona and tone control
Company Brochure Generator	Multi-step LLM chaining, web scraping, structured prompting, streaming generation, Gradio UI, multi-language output
Tech Tutor	System prompt design, persona control, multi-backend (OpenAI + Ollama), streaming responses, Gradio UI
Multi-Agent Conversation	Multi-agent orchestration, shared state management, prompt-as-contract, role prompting, turn-based agent loops
Sales Intake Copilot	Conversational AI, structured output generation, prompt engineering, streaming responses, Gradio UI
Flight Booking Agentic Tool	Tool use and function calling, JSON tool schemas, stateful backend (SQLite), multi-step agentic loop, TTS (text-to-speech), image generation, Gradio UI
Meeting Minute Generator	Audio transcription (Whisper), LLM summarisation, prompt-as-contract, multimodal (audio + text), structured output
Synthetic A/B Dataset Generator	Structured data generation, schema-as-contract prompting, LLM-generated metadata, Gradio UI
LLM Code Performance Benchmark	LLM benchmarking, code generation, multi-model and multi-provider evaluation (OpenAI, Anthropic, Ollama, OpenRouter), performance measurement

Projects

LLM Price Predictor

A system that scans the web for deals in real time, prices each one using a multi-model ensemble, and sends a push notification when it finds something worth buying. Under the hood it covers the full ML lifecycle: 820k-item data curation, LLM-powered preprocessing, training and benchmarking across a dozen model architectures, RAG over 800k documents, serverless cloud deployment, and an autonomous agent that ties it all together in a live Gradio dashboard.

Business problem

Product pricing at scale is hard. Prices depend on brand, category, material, and dozens of other factors buried in unstructured text. A model that can estimate price from a product description has direct applications in marketplace pricing tools, procurement automation, and catalogue quality checks.

The real question is not whether an LLM can do this, but which approach gives the best accuracy per cost. That requires a rigorous, apples-to-apples comparison across model families.

What it does

The system is split into seven stages, each with its own module:

Data curation (data_curation_orchestration.py + parser.py)
- loads Amazon product data across 8 categories from the McAuley-Lab dataset using parallel workers
- filters items to a $0.50-$999.49 price range, removes those under 600 characters, and strips part numbers and boilerplate
- deduplicates by title and full text, then resamples using quadratic weighting to reduce the dominance of low-priced items
- produces a full dataset (~820k items) and a lite version (~23k items), both pushed to HuggingFace Hub
Batch preprocessing (preprocessing_orchestration.py + batch.py)
- submits product descriptions to Groq's async batch API, which handles job submission, polling, and result retrieval
- generates structured summaries (Title, Category, Brand, Description, Details) using the Preprocessor class backed by an LLM
- saves state to disk (batches.pkl) so jobs can be resumed after interruption without data loss
- pushes summarised items back to HuggingFace Hub
Fine-tuning preparation (prompt_prep_fine_tunning.py)
- tokenises summaries with the Llama-3.2-3B tokeniser and truncates to a 110-token cap
- generates prompt-completion pairs in SFT format and pushes them to HuggingFace Hub
Modelling and evaluation (src/pricer/modeling/)
- trains and evaluates multiple model families on the same test split, comparing all on MAE, MSE, and R²
- open-source fine-tuning uses QLoRA: the base model is loaded in 4-bit NF4 quantisation (~2 GB on a T4 GPU), with LoRA adapters trained on attention layers in lite mode and also MLP layers in full mode
RAG pipeline (src/pricer/RAG/)
- rag_ingest.py encodes all 800k training products with sentence-transformers/all-MiniLM-L6-v2 and stores them in a ChromaDB vectorstore (build once, ~70 min)
- rag_pipeline.py retrieves the 5 most similar products for each test item and passes them as context to GPT-5.1, grounding predictions in real comparable products
Ensemble (src/pricer/modeling/ensemble_benchmark.py)
- combines three predictors: GPT-5.1+RAG (80%), a fine-tuned specialist deployed on Modal (10%), and the DNN (10%)
- the DNN is the same 10-layer residual network from DNN_benchmark.py; modeling/deep_neural_network.py handles training while agents/deep_neural_network.py is the inference-only version used at runtime
- the RAG model leads the blend while the specialist and DNN act as anchors that dampen outliers
Agent system (src/pricer/agents/)
- production-ready wrappers around each model: FrontierAgent (GPT-5.1+RAG), NeuralNetworkAgent (DNN), SpecialistAgent (Modal), EnsembleAgent (combines all three)
- EnsembleAgent first rewrites the product description with a lightweight local LLM via agents/preprocessor.py (litellm + Ollama by default) before passing the cleaned text to each pricing model — this mirrors the LLM preprocessing done at training time and keeps the input distribution consistent at inference
- AutonomousPlanningAgent is an LLM-driven orchestrator: GPT-5.1 receives three tools (scan_the_internet_for_bargains, estimate_true_value, notify_user_of_deal) and decides autonomously which deals to evaluate, runs the ensemble on each, picks the best opportunity, and triggers a notification. The orchestration logic lives in the model, not in code
- DealAgentFramework is the persistent backend used by the Gradio UI: it wraps PlanningAgent, maintains a memory.json of previously surfaced deals across runs, and exposes a get_plot_data() method that fetches embeddings from ChromaDB and reduces them to 3D with t-SNE for visualisation
- ScannerAgent scrapes RSS deal feeds, filters out previously seen URLs, and uses GPT structured outputs to select the 5 deals with the clearest prices and best descriptions
- MessagingAgent uses Claude to write the notification message and delivers it via the Pushover API
- run the full autonomous workflow from the terminal: uv run python llm_price_predictor/src/pricer/agents/run_agentic_workflow.py
- agents/items.py is a lightweight Item for inference only; the full version with fine-tuning fields lives in data_prep/items.py

Gradio UI (src/pricer/deployment/price_is_right.py)
- a live dashboard that runs the deal-finding agent on load and refreshes every 5 minutes
- streams agent logs in real time to the UI using a background thread and queue, with ANSI colours converted to HTML via log_utils.py
- displays found deals in a dataframe; clicking a row manually re-sends the push notification for that deal
- renders a 3D scatter plot of the vectorstore embeddings (coloured by product category) using t-SNE dimensionality reduction
- uses PlanningAgent (deterministic hardcoded pipeline) rather than AutonomousPlanningAgent (LLM-driven loop); the terminal workflow and the UI therefore behave slightly differently in how they orchestrate the scan
- requires: ChromaDB vectorstore already ingested, deep_neural_network.pth weights present, Modal deployment live, and Pushover credentials set
- run with: uv run python llm_price_predictor/src/pricer/deployment/price_is_right.py

Models benchmarked

Ensemble model result

The final ensemble achieves a mean absolute error of $29.95 and R² of 86.3% on a held-out test set of 10,000 Amazon products.

Model	Type
Constant / Linear / Random Forest / XGBoost	Traditional ML baselines
Neural Network (8-layer MLP)	Deep learning
Deep Neural Network (10-layer ResNet, log-space)	Deep learning
GPT-4.1 Nano (zero-shot)	Frontier LLM, pre-trained
GPT-4.1 Nano (fine-tuned)	Frontier LLM, fine-tuned
Llama-3.2-3B (base, no fine-tuning)	Open-source LLM, pre-trained
Llama-3.2-3B (fine-tuned)	Open-source LLM, fine-tuned
GPT-5.1 + RAG	Frontier LLM with retrieval augmentation
Ensemble (GPT-5.1+RAG + fine-tuned specialist + DNN)	Multi-model ensemble

Core data model: `Item`

Item (src/pricer/data_prep/items.py) is the Pydantic model that carries a product through every stage of the pipeline. Fields are populated progressively as the item moves from raw ingestion through preprocessing, prompt generation, and fine-tuning.

Field	Type	Populated at	Purpose
`title`	`str`	Curation	Product title
`category`	`str`	Curation	Amazon product category
`price`	`float`	Curation	Ground truth label
`full`	`str` (opt)	Curation	Raw concatenated product text (pre-summary)
`weight`	`float` (opt)	Curation	Sampling weight (quadratic, used for resampling)
`summary`	`str` (opt)	Preprocessing	LLM-generated structured summary (Title / Category / Brand / Description / Details)
`prompt`	`str` (opt)	Fine-tuning prep	Full prompt: `"What does this cost to the nearest dollar?\n\n{summary}\n\nPrice is $"`
`completion`	`str` (opt)	Fine-tuning prep	Target completion: `"{price}.00"` (rounded for train/val, exact for test)
`id`	`int` (opt)	Preprocessing	Stable ID for matching batch API results back to items

Key methods:

make_prompts(tokenizer, max_tokens, do_round): tokenises the summary, truncates to max_tokens if needed, and writes prompt and completion. Use do_round=True for train/val (rounded price) and False for test (exact price).
test_prompt() -> str: strips the completion from the prompt, returning only the question half for inference.
push_to_hub / from_hub: serialises and deserialises full Item lists to/from HuggingFace Hub across train, validation, and test splits.
push_prompts_to_hub: pushes only the {"prompt", "completion"} pairs needed for SFT training.

Design notes

Stage-based orchestration: each pipeline stage is an independent module that can be re-run or swapped without touching the others. This matters when iterating on a single layer, such as testing a different prompt format for fine-tuning.
LLM preprocessing: raw Amazon descriptions are noisy, full of boilerplate and HTML artefacts. Running an LLM batch step to produce clean structured summaries before any model sees the data is what makes fine-tuning effective and the benchmark comparison fair. All models consume the same cleaned input.
Weighted resampling: the raw dataset skews heavily toward low-priced items. Quadratic resampling at curation time makes the price distribution more uniform, which prevents models from gaming MAE by always predicting a low price.
Shared evaluation: all models go through the same Tester class (src/pricer/agents/evaluator.py), which handles numeric extraction from raw string outputs. This keeps the comparison fair across traditional models, deep learning, and generative LLMs alike.
Resumable async jobs: Groq batch processing and OpenAI fine-tuning can run for up to 24 hours. Both use a persist-and-poll pattern, saving state to disk so jobs survive notebook restarts and network interruptions.

Pipeline configuration (current defaults)

Preprocessing model: Groq batch API (LLM summarisation)
Vectoriser: HashingVectorizer (5,000 binary features) for NN / DNN models
Fine-tuning base model: meta-llama/Llama-3.2-3B (QLoRA: 4-bit NF4, LoRA-R 32 lite / 256 full, attention layers only in lite mode)
Frontier fine-tuning model: gpt-4.1-nano-2025-04-14
Token cutoff (prompts): 110 tokens
Dataset splits: 800k / 10k / 10k (full), 20k / 1k / 1k (lite)
Evaluation sample size: 200 test items per model run

Run locally

Modelling and evaluation scripts run locally from the project root. Data curation, batch preprocessing, and fine-tuning require credentials for HuggingFace, Groq, and OpenAI. The ensemble and agentic workflow additionally require MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, and PUSHOVER_USER / PUSHOVER_TOKEN.

Set your environment variables in a .env file: HF_TOKEN, OPENAI_API_KEY, GROQ_API_KEY
Run any benchmark:
- Neural network: uv run python llm_price_predictor/src/pricer/modeling/NN_benchmark.py
- Deep neural network: uv run python llm_price_predictor/src/pricer/modeling/DNN_benchmark.py
- Frontier LLM (zero-shot): uv run python llm_price_predictor/src/pricer/modeling/LLM_pretuned_benchmark.py
- Llama base model (local, Apple Silicon): uv run python llm_price_predictor/src/pricer/modeling/basemodel_llama_eval_benchmark_local.py
- Llama fine-tuned model (local, Apple Silicon): uv run python llm_price_predictor/src/pricer/modeling/llama_finetunning_eval_local.py
- GPT-5.1 + RAG (requires vectorstore built first): uv run python llm_price_predictor/src/pricer/modeling/openai_gpt5_1_rag_benchmark.py
- Ensemble (requires vectorstore + DNN weights + Modal deployment): uv run python llm_price_predictor/src/pricer/modeling/ensemble_benchmark.py
- Autonomous deal-finding agent (requires all of the above + Pushover credentials): uv run python llm_price_predictor/src/pricer/agents/run_agentic_workflow.py
- Gradio UI (requires vectorstore + DNN weights + Modal deployment + Pushover credentials): uv run python llm_price_predictor/src/pricer/deployment/price_is_right.py
Llama benchmarks and fine-tuning that require a CUDA GPU run in Google Colab (Runtime → T4 GPU). These scripts load from the items_prompts_full / items_prompts_lite HuggingFace datasets (prompt-formatted, distinct from items_full / items_lite used by other models):
- Llama base-model evaluation: llm_price_predictor/src/pricer/modeling/basemodel_llama_eval_benchmark_colab.py
- Llama QLoRA fine-tuning: llm_price_predictor/src/pricer/modeling/llama_finetunning_training_colab.py
- Llama fine-tuned model evaluation: llm_price_predictor/src/pricer/modeling/llama_finetunning_eval_colab.py
  - Requires HF_TOKEN and WANDB_API_KEY in Colab Secrets (Tools → Secrets)
  - Logs training metrics to Weights & Biases; optionally pushes checkpoints to HuggingFace Hub
  - Uses 4-bit NF4 quantisation by default; LITE_MODE=True runs a single epoch on the lite dataset for quick iteration

Expert Knowledge Worker (RAG Chatbot)

A lightweight Retrieval-Augmented Generation (RAG) assistant for answering questions about a company knowledge base (Insurellm). It combines a document-ingestion pipeline, a vector database, a Gradio chat UI that shows both the assistant response and the retrieved source context side-by-side for transparency, and an evaluation dashboard that measures retrieval and answer quality against a labelled test set.