Fine-tuning LLMs (Mistral, Llama) for financial sentiment analysis using QLoRA. The goal is to get better sentiment classification than FinBERT while being able to generate explanations — something traditional classifiers can't do.
After working with classical ML for a while, I wanted to try fine-tuning actual LLMs for a domain-specific task. Financial sentiment seemed like a good fit because:
- FinBERT exists as a strong baseline to compare against
- The domain has nuanced language that base LLMs get wrong ("revenue missed but guidance raised" — positive or negative?)
- QLoRA makes it possible to fine-tune 7B models on a single GPU
I waited for Llama 3 and Mistral to mature before starting, so I could use recent models with good tokenizers.
- Load Financial PhraseBank or FiQA dataset
- Format examples as instruction prompts (text → sentiment label)
- Fine-tune with LoRA (only ~4M params out of 7B — 0.06%)
- Evaluate against base model and FinBERT
Based on published benchmarks for similar QLoRA setups:
| Model | PhraseBank Acc (expected) |
|---|---|
| FinBERT (baseline) | ~87% |
| Mistral-7B (no fine-tuning) | ~72% |
| Mistral + LoRA | ~88-90% |
| Llama-3-8B (no fine-tuning) | ~74% |
| Llama-3 + LoRA | ~89-91% |
I haven't completed a full training run yet — the pipeline is set up and ready but I need to do the actual fine-tuning on a GPU instance (Colab A100 or similar, ~2 hours with 4-bit quantization). The numbers above are targets based on what similar setups achieve in the literature.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
huggingface-cli login # need access to gated models# with config file
python scripts/train.py --config configs/lora_mistral.yaml
# or CLI args
python scripts/train.py --base-model mistralai/Mistral-7B-v0.1 --epochs 3 --batch-size 4from src.inference.predictor import SentimentPredictor
predictor = SentimentPredictor("./outputs/finllm-sentiment-lora")
result = predictor.predict("Apple beat revenue estimates by 3%")
# {'sentiment': 'positive', 'confidence': 0.92}There's also a Streamlit demo (streamlit run streamlit_app/app.py) but it currently uses a keyword-based placeholder — haven't hooked up the actual model yet.
- CUDA OOM is real: Had to drop batch size from 8 to 4 and crank up gradient accumulation.
paged_adamw_32bitoptimizer helped a lot with memory. - Prompt format matters a lot: Spent weeks trying different templates. Simple classification prompts work best for training, but JSON format is better at inference time.
- Pad token debugging: Mistral doesn't have a pad token by default. Setting it to EOS and making sure
pad_token_idis set everywhere took embarrassingly long to figure out. Loss was going to NaN and I had no idea why. - Financial language is tricky: "Revenue missed expectations but guidance raised" — models disagree on this one. These edge cases are where fine-tuning actually helps vs just prompting.
- Try DPO (Direct Preference Optimization) instead of SFT — might help with the ambiguous cases
- Use a bigger evaluation set. PhraseBank is small and FiQA even smaller
- The inference pipeline is kind of hacky (regex-based output parsing). Would be cleaner with structured generation
- Batch inference is just a loop right now, should properly batch the generation