Scaffold AI Frequently Asked Questions (FAQ)

❓ Scaffold AI Frequently Asked Questions (FAQ)

Last updated: 2025-06-30

General

What is Scaffold AI?

Scaffold AI is a curriculum recommendation tool designed to help educators integrate sustainability and climate resilience topics into academic programs using state-of-the-art AI models, including retrieval-augmented generation (RAG), semantic search, and LLMs.

Who should use this project?

Educators, curriculum designers, and researchers seeking literature-backed, transparent curriculum recommendations for sustainability and engineering education.

Installation & Setup

What are the system requirements?

Python 3.11+ (3.11 recommended)
16GB+ RAM
NVIDIA GPU (recommended but not required)
Windows, Linux, or macOS

How do I set up the project?

Clone the repo:
git clone https://github.com/kevinmastascusa/scaffold_ai.git
Create & activate a virtual environment.
Install dependencies:
pip install -r requirements.txt
Run the setup script:
python setup.py
Place your PDF files in the data/ directory.

See Local Setup Guide for detailed steps.

Do I need a Hugging Face token?

Yes, for most LLM models.

Get your token at https://huggingface.co/settings/tokens
Add it to your environment or .env file as HUGGINGFACE_TOKEN=your_token_here
See Hugging Face Migration Guide.

Data Processing

How are PDFs processed?

PDFs are split into page-based chunks (one chunk per complete page).
Text is cleaned, Unicode-normalized, and analyzed for technical terms.
Math-aware and Unicode-aware chunking is available for advanced use.

What is the "combined words" issue?

PDF extraction can merge words (e.g., environmentalsustainability).

The pipeline detects and splits these using domain-specific rules and wordninja.
See outputs/combined_words_analysis_report.txt.

Vectorization & Search

How are documents searched?

Each chunk is embedded using all-MiniLM-L6-v2 (sentence-transformers).
Chunks are indexed using FAISS for efficient vector search.
Queries are embedded and matched to relevant chunks.

How is the final answer generated?

Top matches are reranked with a cross-encoder.
The LLM generates a grounded answer using only retrieved content.
Citations to source documents are included (citation layer in progress).

LLM & Model Usage

Which LLM is used?

Default: mistralai/Mistral-7B-Instruct-v0.2 (Hugging Face)
Alternatives: OpenHermes, TinyLlama, others (see model_summary.md)

Can I change the model?

Yes!

Edit LLM_MODEL in scaffold_core/config.py.
Make sure the model supports text-generation and you have access.

Testing & Validation

Is there a test suite?

Yes.

Run python scaffold_core/scripts/run_tests.py for comprehensive tests.
Generate a detailed report with python scaffold_core/scripts/generate_test_report.py.
See documentation/query_system_test_report.md.

Troubleshooting

I'm getting out of memory (OOM) errors. What should I do?

Use a smaller model (e.g., TinyLlama)
Reduce batch size or max_length in config.py
Run on CPU if GPU memory is insufficient

My model download fails or access is denied.

Check your Hugging Face token and model access permissions
Request access to gated models as described in the migration guide.
Make sure you have internet connectivity

The UI isn't available yet. When will it launch?

A pilot UI for querying and feedback is planned for the early testing phase.

Track progress in GitHub Issue #8.

Contributing & Support

How can I contribute?

Fork the repo and submit a pull request
See CONTRIBUTING.md (to be added)
Open an issue for bugs or feature requests

Where do I get support?

Open a GitHub issue
Contact the project maintainers via GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly