feat: automated semantic mapping #85
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
sem_map.pyimplements an automated semantic mapping pipeline that turns unstructured {id, text} documents into structured fields based on NL “concepts.” It uses LLM for dynamic schema generation and Palimpzest to execute semantic extraction.Core Functionality
Dynamic schema generation: Converts concept phrases into typed {name, type, desc} columns with two strategies:
FLAT: Direct mapping of concepts to canonical fields.HIERARCHY_FIRST: May decompose a concept into more granular sub-fields when useful.Semantic Extraction (
sem_map): Runspz.sem_mapover the dataset.Tagification & Stats (
expand_sem_map_results_to_tags): A post-processing step that "flattens" extracted entities into boolean tags. It also calculates selectivity statistics for each tag.The script also includes a runnable usage example that loads text data, generates schemas, uses pz to execute semantic extraction, and tagifies the resulting columns.