Skip to content

Commit b6d4dd8

Browse files
fix: remove arbitrary limits, fix hardcoded languages, and fix summarizer bugs
Stage 1 quality improvements from the Arbitrary Limits & Dead Code audit: Reference file truncation removed: - codebase_scraper.py: remove code[:500] truncation at 5 locations — reference files now contain complete code blocks for copy-paste usability - unified_skill_builder.py: remove issues[:20], releases[:10], body[:500], and code_snippet[:300] caps in reference files — full content preserved Enhancement summarizer rewrite: - enhance_skill_local.py: replace arbitrary [:5] code block cap with character-budget approach using target_ratio * content_chars - Fix intro boundary bug: track code block state so intro never ends inside a code block, which was desynchronizing the parser - Remove dead _target_lines variable (assigned but never used) - Heading chunks now also respect the character budget Hardcoded language fixes: - unified_skill_builder.py: test examples use ex["language"] instead of always "python" for syntax highlighting - how_to_guide_builder.py: add language field to HowToGuide dataclass, set from workflow at creation, used in AI enhancement prompt Test fixes: - test_enhance_skill_local.py: rename test to test_code_blocks_not_arbitrarily_capped, fix assertion to count actual blocks (```count // 2), use target_ratio=0.9 Documentation: - Add Stage 1 plan, implementation summary, review, and corrected docs - Update CHANGELOG.md with all changes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent b81d55f commit b6d4dd8

10 files changed

+1190
-21
lines changed

CHANGELOG.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,39 @@ All notable changes to Skill Seeker will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [Unreleased]
9+
10+
### 📄 B2: Microsoft Word (.docx) Support & Stage 1 Quality Improvements
11+
12+
### Added
13+
- **Microsoft Word (.docx) support** — New `skill-seekers word --docx <file>` command and `skill-seekers create document.docx` auto-detection. Full pipeline: mammoth → HTML → BeautifulSoup → sections → SKILL.md + references/
14+
- `word_scraper.py``WordToSkillConverter` class (~600 lines) with heading/code/table/image/metadata extraction
15+
- `arguments/word.py``add_word_arguments()` + `WORD_ARGUMENTS` dict
16+
- `parsers/word_parser.py` — WordParser for unified CLI parser registry
17+
- `tests/test_word_scraper.py` — comprehensive test suite (~300 lines)
18+
- **`.docx` auto-detection** in `source_detector.py``create document.docx` routes to word scraper
19+
- **`--help-word`** flag in create command for Word-specific help
20+
- **Word support in unified scraper**`_scrape_word()` method for multi-source scraping
21+
- **`skill-seekers-word`** entry point in pyproject.toml
22+
- **`docx` optional dependency group**`pip install skill-seekers[docx]` (mammoth + python-docx)
23+
24+
### Fixed
25+
- **Reference file code truncation removed**`codebase_scraper.py` no longer truncates code blocks to 500 chars in reference files (5 locations fixed)
26+
- **Enhancement code block limit replaced with token budget**`enhance_skill_local.py` `summarize_reference()` now uses character-budget approach instead of arbitrary `[:5]` code block cap
27+
- **Dead variable removed**`_target_lines` in `enhance_skill_local.py:309` was assigned but never used
28+
- **Intro boundary code block desync fixed**`summarize_reference()` intro section could split inside a code block, desynchronizing the parser; now tracks code block state and ensures safe boundary
29+
- **Test assertion corrected**`test_code_blocks_not_arbitrarily_capped` now correctly counts code blocks (```count // 2) instead of raw marker count
30+
- **Hardcoded `python` language in unified_skill_builder.py** — Test examples now use detected language (`ex["language"]`) instead of always `python`; code snippets no longer truncated to 300 chars
31+
- **Hardcoded `python` language in how_to_guide_builder.py** — Added `language` field to `HowToGuide` dataclass, flows from test extractor → workflow → guide → AI prompt
32+
- **GitHub reference file limits removed**`unified_skill_builder.py` no longer caps issues at 20, releases at 10, or release bodies at 500 chars in reference files
33+
- **GitHub scraper reference limits removed**`github_scraper.py` no longer caps open_issues at 20 or closed_issues at 10
34+
- **PDF scraper fixes** — Real API/LOCAL enhancement (was stub); removed `[:3]` reference file limit
35+
- **Word scraper code detection** — Detect mammoth monospace `<p><br>` blocks as code (not `<pre>/<code>`)
36+
- **Language detector method** — Fixed `detect_from_text``detect_from_code` in word scraper
37+
38+
### Changed
39+
- **Enhancement summarizer architecture** — Character-budget approach respects `target_ratio` for both code blocks and heading chunks, replacing hard limits with proportional allocation
40+
841
## [3.1.3] - 2026-02-24
942

1043
### 🐛 Hotfix — Explicit Chunk Flags & Argument Pipeline Cleanup

0 commit comments

Comments
 (0)