Skip to content

Adding RAG and new output mode#2

Open
mistrjirka wants to merge 4 commits intostratosphereips:mainfrom
mistrjirka:main
Open

Adding RAG and new output mode#2
mistrjirka wants to merge 4 commits intostratosphereips:mainfrom
mistrjirka:main

Conversation

@mistrjirka
Copy link

@mistrjirka mistrjirka commented Oct 18, 2025

Description

This PR adds two major features to bsy-clippy:

  1. Vector Database (RAG) Support: Implements Retrieval-Augmented Generation using vector embeddings to handle large stdin inputs more effectively
  2. Hide Thinking Mode: Adds a --hide-thinking flag to hide LLM reasoning (<think> tags) and show a spinner instead

Motivation and Context

Vector Database Feature

When processing large files through stdin, sending the entire content to the LLM can:

  • Exceed context window limits
  • Result in less relevant responses
  • Waste tokens on irrelevant sections

The vector database solves this by:

  • Chunking text intelligently (paragraph-aware with overlap)
  • Creating semantic embeddings using BAAI/bge-small-en-v1.5
  • Retrieving only the most relevant chunks for each query
  • Using HNSW indexing for fast similarity search

Hide Thinking Feature

Some users prefer cleaner output without the LLM's reasoning process. This feature:

  • Hides <think>...</think> sections completely
  • Shows an animated spinner in stream mode to indicate processing
  • Works in both batch and stream modes
  • Compatible with all other features (vector mode, interactive mode, etc.)

API Key Handling Fix

Fixed an issue where localhost endpoints (127.0.0.1, localhost) required an API key even though local LLM servers like Ollama don't need authentication.

Dependencies

New dependencies added to requirements.txt and pyproject.toml:

  • fastembed>=0.3.0 - CPU-based text embeddings
  • hnswlib>=0.8.0 - Fast approximate nearest neighbor search
  • numpy>=1.24.0 - Array operations for embeddings

Type of change

  • New feature (non-breaking change which adds functionality)
  • Bug fix (localhost API key handling)
  • This change requires a documentation update

How Has This Been Tested?

All tests were run on Python 3.13.7 with Ollama (qwen3:1.7b) on localhost.

Test 1: Vector Mode Without Chat Continuation

Command:

cat /tmp/test.txt | python bsy-clippy.py --vector --profile localollama -u "What is AI" --mode batch

Expected: Should build vector index, answer the question, and exit (no interactive mode).

Result: ✅ PASSED


Creating vector embeddings for 1 chunks...
Vector index ready (1 chunks, 384 dimensions)
Vector database ready with 1 chunks
<think>
Okay, the user asked "What is AI?" and provided some context. Let me see. The context mentions that the Solar System has eight planets, AI is a branch of computer science, and Python is a programming language.

So, the user wants a short explanation of AI. From the context, AI is defined as a branch of computer science. The other info about planets and Python doesn't directly relate to AI, but maybe the user wants a concise answer. I should focus on the main point from the context. Make sure it's brief and to the point. Avoid any extra info not in the context. So, the answer is "Artificial Intelligence is a branch of computer science." That's concise and uses the relevant context. Check if it's very short and brief. Yes, it's just a sentence. Alright, that should do it.
</think>

Artificial Intelligence is a branch of computer science.

Test 2: Hide-Thinking Flag in Batch Mode

Command:

cat /tmp/test.txt | python bsy-clippy.py --profile localollama -u "Explain this" --mode batch --hide-thinking

Expected: Should process input and show only the answer (no <think> tags).

Result: ✅ PASSED

The Solar System has eight planets (Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune). 
AI is a branch of computer science focused on creating intelligent machines. 
Python is a programming language used for tasks like web development, data analysis, and automation.

Test 3: Hide-Thinking Flag in Stream Mode

Command:

echo "test" | python bsy-clippy.py --profile localollama -u "Say hello" --hide-thinking

Expected: Should show animated spinner during thinking, then display only the final answer.

Result: ✅ PASSED

                  

Hello!

Note: The whitespace represents where the spinner was displayed and then cleared

Test 4: Normal Mode (Default Behavior)

Command:

echo "test" | python bsy-clippy.py --profile localollama -u "Say hi" --mode batch

Expected: Should show colored <think> tags with the reasoning process.

Result: ✅ PASSED

<think>
Okay, the user said "Say hi" and then wrote "test". I need to respond in a brief and short way. 
Let's see, the main action is to say hi, but the test part is probably a prompt. 
So a simple "Hi!" would work. Maybe add a smiley to keep it friendly. 
But since it's short, just "Hi!" is better. Make sure it's concise.
</think>

Hi! 😊

Test 5: Vector Mode With Chat Continuation

Command:

cat /tmp/test.txt | python bsy-clippy.py --vector --profile localollama -u "What is AI" -c

Expected: Should answer question first, then enter interactive mode with RAG enabled.

Result: ✅ PASSED - Answers question, then prompts "You can now ask questions about the input data." and enters interactive mode.

Test 6: Localhost API Key Handling

Command:

unset OPENAI_API_KEY
echo "test" | python bsy-clippy.py --profile localollama -u "Say hello"

Expected: Should work without API key for localhost endpoints.

Result: ✅ PASSED - No error, processes request successfully.

Test 7: Remote Endpoint API Key Requirement

Command:

unset OPENAI_API_KEY
echo "test" | python bsy-clippy.py --base-url https://api.openai.com/v1 -u "test"

Expected: Should show error requiring API key for remote endpoints.

Result: ✅ PASSED

[Error] OPENAI_API_KEY is not set. Create a .env file with OPENAI_API_KEY=<token> or export it.

Test Configuration

  • Python Version: 3.13.7
  • LLM: Ollama with qwen3:1.7b model
  • Endpoint: http://127.0.0.1:11434/v1
  • Test Data: /tmp/test.txt containing:
    The Solar System contains eight planets.
    Artificial Intelligence is a branch of computer science.
    Python is a programming language.
    

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have tested all new features with multiple scenarios
  • Dependencies are properly documented in requirements.txt and pyproject.toml
  • Backward compatibility maintained (all existing functionality still works)

Additional Notes

Architecture Changes

Vector Index Implementation

  • Class: VectorIndex - Manages embeddings and HNSW index
  • Function: chunk_text() - Intelligent paragraph-aware chunking with configurable overlap
  • Function: build_vector_index() - Wrapper to create index from text

Spinner Implementation

  • Class: Spinner - Threaded spinner with animated frames
  • Frames: Uses Braille patterns (⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏) for smooth animation
  • Thread Safety: Daemon thread with proper cleanup

Refactoring

  • Function: handle_stdin_with_vector() - Separated logic for vector-enabled stdin processing
  • Function: handle_stdin_without_vector() - Normal stdin processing
  • Main function: Reduced from 120+ lines to ~60 lines by extracting helper functions

New CLI Arguments

--vector                Enable RAG mode for large stdin inputs
--chunk-size N          Chunk size for vector database (default: 500 characters)
--retrieve-chunks N     Number of chunks to retrieve (default: 4)
--hide-thinking         Hide <think> sections and show spinner instead

Performance Considerations

  • Vector embedding is CPU-based (no GPU required)
  • HNSW index provides O(log n) search complexity
  • Spinner runs in background thread (no blocking)
  • Chunk overlap ensures context preservation across boundaries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant