Skip to content

Commit e642823

Browse files
committed
Post-process: strip "Unknown" placeholder values from conference metadata and optional string fields
1 parent 1546c64 commit e642823

File tree

3 files changed

+25
-3
lines changed

3 files changed

+25
-3
lines changed

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.1.7] - 2026-02-28
9+
10+
### Fixed
11+
12+
- Post-process: omit `sectionTitle` when empty instead of writing `""` (violates schema `minLength: 1`)
13+
- Post-process: strip "Unknown" placeholder values from conference metadata and optional string fields
14+
815
## [0.1.6] - 2026-02-26
916

1017
### Changed

poster2json/extract.py

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1184,7 +1184,10 @@ def _postprocess_json(data: dict, raw_text: str = "") -> dict:
11841184
content.strip() if isinstance(content, str) else ""
11851185
)
11861186
if content and len(content) > 10:
1187-
cleaned_sections.append({"sectionTitle": title, "sectionContent": content})
1187+
entry = {"sectionContent": content}
1188+
if title:
1189+
entry["sectionTitle"] = title
1190+
cleaned_sections.append(entry)
11881191
# Recover uncaptured raw text as untitled section(s).
11891192
# The LLM sometimes drops footer content (contact info, URLs).
11901193
# Compare raw text lines against section content and reclaim
@@ -1223,7 +1226,6 @@ def _postprocess_json(data: dict, raw_text: str = "") -> dict:
12231226

12241227
if uncaptured and len(" ".join(uncaptured)) > 10:
12251228
cleaned_sections.append({
1226-
"sectionTitle": "",
12271229
"sectionContent": "\n".join(uncaptured),
12281230
})
12291231

@@ -1247,6 +1249,19 @@ def _postprocess_json(data: dict, raw_text: str = "") -> dict:
12471249

12481250
result = enrich_json_with_identifiers(result, raw_text)
12491251

1252+
# Strip "Unknown" placeholder values the LLM likes to hallucinate.
1253+
# These violate metadata quality expectations — better to omit than guess.
1254+
_UNKNOWN_RE = re.compile(r"^unknown\b", re.IGNORECASE)
1255+
if "conference" in result and isinstance(result["conference"], dict):
1256+
for key in list(result["conference"]):
1257+
val = result["conference"][key]
1258+
if isinstance(val, str) and _UNKNOWN_RE.match(val.strip()):
1259+
del result["conference"][key]
1260+
# Top-level optional string fields
1261+
for key in ("conferenceLocation", "publisher", "researchField"):
1262+
if key in result and isinstance(result[key], str) and _UNKNOWN_RE.match(result[key].strip()):
1263+
del result[key]
1264+
12501265
return result
12511266

12521267

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[tool.poetry]
22

33
name = "poster2json"
4-
version = "0.1.6"
4+
version = "0.1.7"
55
description = "Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models"
66

77
packages = [{ include = "poster2json" }]

0 commit comments

Comments
 (0)