OgbujiPT 0.10.0 Phase 1: Foundation & Rearchitecture by uogbuji · Pull Request #93 · OoriData/OgbujiPT

uogbuji · 2025-11-24T18:59:22Z

First major step toward transforming OgbujiPT into a general-purpose LLMOps knowledge bank system, as outlined in discussion #92. This phase focuses on establishing a solid foundation through code reorganization and introducing core retrieval capabilities.

For a quick intro to the new changes, a good start is the demo/pg-hybrid dir.

Major Changes

Code Reorg

Module restructuring: Moved modules into logical packages:
- llm_wrapper.py → llm/wrapper.py
- embedding/ → store/postgres/ (pgvector modules)
- embedding/qdrant.py → store/qdrant/collection.py
- text_helper.py → text/splitter.py
- html_helper.py → text/html.py
New package structure:
- pylib/retrieval/ - Search strategies (dense, sparse, hybrid)
- pylib/memory/ - Knowledge base interfaces and metadata
- pylib/store/ - Storage backends (postgres, qdrant)
- pylib/llm/ - LLM wrapper functionality
- pylib/text/ - Text processing utilities

New Retrieval Capabilities

Sparse retrieval (retrieval/sparse.py): BM25 implementation for keyword-based search
Dense retrieval (retrieval/dense.py): Wrapper for embedding-based semantic search
Hybrid search (retrieval/hybrid.py): Reciprocal Rank Fusion (RRF) combining multiple strategies

Memory/Knowledge Base Foundation

KBBackend protocol (memory/base.py): Protocol-based interface for knowledge base backends (vector stores, graph DBs, etc.)
SearchStrategy protocol: Interface for pluggable search strategies
SearchResult dataclass: Unified result format across all backends
Metadata support (memory/metadata.py): Foundation for enriched item metadata

PostgreSQL Enhancements

Sparse vector support (store/postgres/pgvector_sparse.py): BM25-compatible sparse vector storage using pgvector
Hybrid search demos (demo/pg-hybrid/): Complete examples showing dense + sparse retrieval

Removed Deprecated Code

pylib/prompting/ - Old prompting utilities (replaced by direct LLM wrapper usage)
pylib/word_loom.py - Deprecated prompt loading system
pylib/memoization/ - Moved to store/postgres/pgmemo.py
Various demo files that are no longer relevant

Updated Imports

All imports across the codebase have been updated to reflect the new structure. This includes:

Demo scripts (chat_web_selects.py, chat_doc_folder.py, etc.)
Test files
Internal module imports

Testing

Updated test paths to match new module structure
Existing tests continue to pass with updated imports
New retrieval functionality demonstrated in demo/pg-hybrid/

Documentation & Examples

demo/pg-hybrid/README.md: Comprehensive guide to hybrid search
demo/pg-hybrid/hybrid_search.ipynb: Interactive Jupyter notebook tutorial
demo/pg-hybrid/chat_with_hybrid_kb.py: Full conversational RAG example with hybrid search
demo/pg-hybrid/hybrid_search_demo.py: Standalone hybrid search demonstration

Migration Notes

For users upgrading:

Import paths have changed:
- ogbujipt.llm_wrapper → ogbujipt.llm.wrapper
- ogbujipt.embedding.* → ogbujipt.store.postgres.* or ogbujipt.store.qdrant.*
- ogbujipt.text_helper → ogbujipt.text.splitter
- ogbujipt.html_helper → ogbujipt.text.html
New capabilities available:
- Use ogbujipt.retrieval.hybrid.HybridSearch for combining dense + sparse search
- Use ogbujipt.retrieval.sparse.BM25Search for keyword-based retrieval
- Implement KBBackend protocol for custom storage backends

Next Steps (Future PRs)

This foundation enables future work on:

GraphRAG support using Onya
Query classification/routing
Multi-backend aggregation
Observability and query logging
Maintenance and pruning strategies
MCP provider/server

Checklist

Code reorganization complete
Retrieval abstraction layer implemented
Hybrid search working with PostgreSQL
All imports updated
Tests updated and passing
Documentation and examples added
Deprecated code removed
CI/CD workflows updated

Major breaking changes for the transformation into a focused LLMOps knowledge bank library. ## Removed (907 lines) - Entire prompting module (basic.py, model_style.py) - obsolete with modern chat templates - word_loom.py - TOML template system no longer needed - Prompting test suite (4 test files) - 3 demo files showcasing removed features ## Reorganized Module Structure - `pylib/embedding/` → `pylib/store/postgres/` (pgvector implementations) - `pylib/embedding/qdrant.py` → `pylib/store/qdrant/collection.py` - `pylib/memoization/pgmemo.py` → `pylib/store/postgres/pgmemo.py` - `pylib/llm_wrapper.py` → `pylib/llm/wrapper.py` - `pylib/text_helper.py` → `pylib/text/splitter.py` - `pylib/html_helper.py` → `pylib/text/html.py` - `test/embedding/` → `test/store/` ## New Module Structure Created directory structure for 0.10.0 features: - `pylib/memory/` - KB abstractions & unified API - `pylib/store/` - Storage backends (organized by backend type) - `pylib/retrieval/` - Retrieval strategies - `pylib/ingestion/` - Data pipelines - `pylib/maintenance/` - KB health & pruning - `pylib/observability/` - Logging, tracing, metrics - `pylib/mcp/` - Model Context Protocol ## New Foundation Created base KB abstractions (memory/base.py, memory/metadata.py): - Protocol-based interfaces (PEP 544) for flexibility - SearchResult, KBBackend, SearchStrategy protocols - ItemMetadata with RBAC support - Metadata filter builders (functional approach) ## Dependencies Added to pyproject.toml: - onya (GraphRAG) - chonkie (document chunking) - rank-bm25 (sparse retrieval) - structlog (structured logging) - tenacity (retry logic) - httpx (async HTTP) - mcp (Model Context Protocol) ## Import Updates Updated 400+ import statements across: - pylib/ modules (cross-references) - demo/ files (6 files) - test/ files (5 files + fixtures) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove circular imports in pgvector.py (DataDB, MessageDB now imported directly) - Update test fixture paths: test/embedding → test/store - Update qdrant test mock path to new module structure - All 11 non-database tests passing Database tests (15) skip gracefully when Postgres/Qdrant unavailable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…ood starting point.

…rieval

Implement InMemoryDataDB and InMemoryMessageDB as drop-in replacements for PostgreSQL-backed stores. Tests now run instantly without external dependencies, following the "own your I/O boundaries" principle. - In-memory stores available to users for prototyping/embedded use - All 14 pgvector tests pass without PostgreSQL (0.52s vs multi-second setup) - Integration test markers added for optional PostgreSQL testing - Abstracted terminology (setup/cleanup vs create_table/drop_table) Helps fix CI test failures requiring PostgreSQL setup.

uogbuji · 2025-11-27T06:53:40Z

Ended up on a bit of an extended detour today. CI tests were failing because they required a running PostgreSQL server with pgvector extension:

ERROR test/store/test_pgvector_data.py - OSError: Connect call failed ('127.0.0.1', 5432)

14 tests were blocked by this dependency, requiring complex CI setup (DB services, connection management, etc.). I could dip into the well of more and more mocking, but this can become brittle and unwieldy, especially trying to mock database operations directly.

I always like the reasoning from the Hynek article "'Don’t Mock What You Don’t Own' in 5 Minutes". The better approach is "own your I/O boundaries"—create your own abstraction and provide alternative implementations rather than always mocking third-party libraries.

Solution

I'd always wanted to have production-ready in-memory vector stores as a lightweight option for end users in prototyping, demos, and embedded. Might as well get to that, with the dual use of embodying fast, dependency-free tests of the OgbujiPT store components.

Changes

1. In-Memory Vector Store Implementation (`pylib/store/memory.py`)

InMemoryDataDB - Drop-in replacement for DataDB (document/snippet storage)
InMemoryMessageDB - Drop-in replacement for MessageDB (chat/message storage)

Features:

Full API compatibility with PostgreSQL versions
Numpy-based cosine similarity calculations
Complete metadata filtering support
Windowing support for message history
UUID string-to-object conversion
Abstracted lifecycle methods (setup()/cleanup() with compatibility aliases)

2. Test Infrastructure Updates (`test/store/conftest.py`)

Default fixtures now use in-memory stores (no external dependencies)
PG_DB* fixtures added for optional PostgreSQL integration tests
Automatic selection based on test file type (data vs message)

3. Pytest Configuration via `pyproject.toml`

Tip of these changes: bd97864

In-memory DB Usage Example

from ogbujipt.store.memory import InMemoryDataDB

# Use in-memory store (no PostgreSQL required)
db = InMemoryDataDB(embedding_model=model, collection_name='my_docs')
await db.setup()

await db.insert('Hello world', metadata={'source': 'greeting'})
async for result in db.search('greeting', limit=5):
    print(result.content, result.score)

await db.cleanup()

Testing

All existing PG tests pass with in-memory implementation. Note: there's still a qdrant test I need to nurse back to health. To run PostgreSQL integration tests:

# Default: fast in-memory tests
pytest test/store/ -v

# With PostgreSQL (requires running instance)
pytest test/store/ -v -m integration

Follow-up Work

Add demo files paralleling demo/pg-hybrid using in-memory stores
Update test/store/README.md with usage examples
Add docstring examples for in-memory stores

n.b. 1: Finally added a CONTRIBUTING.md

n.b. 2: Oh yes; I got help from Claude Code in this, but as always under a ton of supervision and constraint.

Introduce two new demo scripts: `chat_with_memory.py` for simulating conversations with in-memory message storage and semantic search, and `simple_search_demo.py` for demonstrating vector search capabilities without database setup. Additionally, a README.md file is added to provide an overview of the in-memory vector store demos, including prerequisites and usage patterns. - `chat_with_memory.py`: Simulates a conversation, showcases message storage, retrieval, and semantic search. - `simple_search_demo.py`: Demonstrates basic vector search with filtering and metadata. - `README.md`: Overview of demos, installation instructions, and comparison with PostgreSQL-based solutions. These additions enhance the usability and accessibility of the in-memory vector store for prototyping and learning purposes.

uogbuji and others added 5 commits November 13, 2025 18:27

Rephase for PG-based storage & retrieval logic. demo/pg-hybrid is a g…

61c8663

…ood starting point.

Pace lint

776998b

Workflow updates

68a921c

uogbuji self-assigned this Nov 24, 2025

uogbuji added 7 commits November 24, 2025 14:32

Workflow updates

ace030a

Add reranker model support in hybrid search

cd5398b

Pace lint

c5336fb

Demo housekeeping, and also add one for 'chat my docs' via hybrid ret…

d25f5a6

…rieval

Pace lint

68ad9b8

Missing changes

bd97864

uogbuji merged commit 6e7c80d into main Nov 29, 2025
4 checks passed

uogbuji deleted the feature/kb-rearch branch November 29, 2025 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OgbujiPT 0.10.0 Phase 1: Foundation & Rearchitecture#93

OgbujiPT 0.10.0 Phase 1: Foundation & Rearchitecture#93
uogbuji merged 13 commits intomainfrom
feature/kb-rearch

uogbuji commented Nov 24, 2025 •

edited

Loading

Uh oh!

uogbuji commented Nov 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

uogbuji commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Major Changes

Code Reorg

New Retrieval Capabilities

Memory/Knowledge Base Foundation

PostgreSQL Enhancements

Removed Deprecated Code

Updated Imports

Testing

Documentation & Examples

Migration Notes

Next Steps (Future PRs)

Checklist

Uh oh!

uogbuji commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Solution

Changes

1. In-Memory Vector Store Implementation (pylib/store/memory.py)

2. Test Infrastructure Updates (test/store/conftest.py)

3. Pytest Configuration via pyproject.toml

In-memory DB Usage Example

Testing

Follow-up Work

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

uogbuji commented Nov 24, 2025 •

edited

Loading

uogbuji commented Nov 27, 2025 •

edited

Loading

1. In-Memory Vector Store Implementation (`pylib/store/memory.py`)

2. Test Infrastructure Updates (`test/store/conftest.py`)

3. Pytest Configuration via `pyproject.toml`