Skip to content

Conversation

@trxvorr
Copy link
Contributor

@trxvorr trxvorr commented Dec 19, 2025

Description

Implements a new module that processes unstructured textual documents and maps them into the existing Semantic Layer. This enables converting extracted text (from OCR or other sources) into RDF triples and integrating them with the SemanticModel.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update
  • Configuration change
  • Infrastructure/build change

Related Issue(s)

Fixes #108

Changes Made

  • Added src/intugle/text_processor/ module with:
    • models.py: Pydantic models for Entity, Relationship, RDFTriple, RDFGraph with RDF-star support
    • processor.py: Main TextToSemanticProcessor class for text-to-RDF conversion
    • extractors/base.py: Abstract BaseExtractor interface for pluggable NLP backends
    • extractors/llm_extractor.py: LLM-based entity and relationship extraction using LangChain
    • rdf/builder.py: RDFBuilder class for constructing RDF triples from extracted data
    • mapper.py: SemanticMapper for aligning RDF entities to existing SemanticModel nodes
  • Added SemanticModel.overlay() method for integrating RDF graphs
  • Added intugle text-to-semantic CLI command
  • Exposed TextToSemanticProcessor at package level
  • Added 34 unit tests for all components

Testing

Test Configuration

  • Python Version: 3.10+
  • OS: Windows/Linux/macOS
  • LLM Provider: OpenAI (gpt-4o-mini default)

Test Cases

  • Unit tests pass locally
  • Manual testing completed
  • CLI verification completed

Test Commands

uv run pytest tests/text_processor/ -v
uv run intugle text-to-semantic --help

Screenshots/Examples

from intugle import TextToSemanticProcessor, SemanticModel

# Step 1: Convert unstructured text to RDF triples
text_input = """
Invoice 123 was issued by Vendor A on March 4, 2024 for an amount of $5,400.
"""
processor = TextToSemanticProcessor(model="gpt-4o-mini", output_format="rdf_star")
rdf_graph = processor.parse(text_input)

# Step 2: Map RDF to the existing semantic model
semantic_model = SemanticModel(data_input)
semantic_model.overlay(rdf_graph, match_threshold=0.85)

Checklist

  • My code follows the code style of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or linter errors
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published
  • I have updated the relevant notebooks (if applicable)
  • I have checked my code and corrected any misspellings

Documentation Updates

  • README.md updated
  • Docstrings added/updated
  • Documentation site updated (if needed)
  • Notebook examples updated (if applicable)
  • CHANGELOG updated (if applicable)

Breaking Changes

  • This PR introduces breaking changes
  • Migration guide provided (if applicable)

Performance Impact

  • Performance benchmarks run
  • No significant performance impact
  • Performance improvement
  • Performance regression

Additional Context

This implementation uses LLM-based extraction (via LangChain) for flexible entity and relationship discovery. The architecture supports pluggable extractors for future backends (spaCy, Hugging Face, etc.).

Deployment Notes

No special deployment requirements. The CLI command becomes available after installation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Unstructured text processor

1 participant