Skip to content

Conversation

@zhuohangu
Copy link
Collaborator

sem_map.py implements an automated semantic mapping pipeline that turns unstructured {id, text} documents into structured fields based on NL “concepts.” It uses LLM for dynamic schema generation and Palimpzest to execute semantic extraction.

Core Functionality

  • Dynamic schema generation: Converts concept phrases into typed {name, type, desc} columns with two strategies:

    • FLAT: Direct mapping of concepts to canonical fields.

    • HIERARCHY_FIRST: May decompose a concept into more granular sub-fields when useful.

  • Semantic Extraction (sem_map): Runs pz.sem_map over the dataset.

  • Tagification & Stats (expand_sem_map_results_to_tags): A post-processing step that "flattens" extracted entities into boolean tags. It also calculates selectivity statistics for each tag.

  • The script also includes a runnable usage example that loads text data, generates schemas, uses pz to execute semantic extraction, and tagifies the resulting columns.

@zhuohangu zhuohangu requested a review from mdr223 December 18, 2025 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants