Skip to content

Commit b81d55f

Browse files
feat(B2): add Microsoft Word (.docx) support
Implements ROADMAP task B2 — full .docx scraping support via mammoth + python-docx, producing SKILL.md + references/ output identical to other source types. New files: - src/skill_seekers/cli/word_scraper.py — WordToSkillConverter class + main() entry point (~600 lines); mammoth → BeautifulSoup pipeline; handles headings, code detection (incl. monospace <p><br> blocks), tables, images, metadata extraction - src/skill_seekers/cli/arguments/word.py — add_word_arguments() + WORD_ARGUMENTS dict - src/skill_seekers/cli/parsers/word_parser.py — WordParser for unified CLI parser registry - tests/test_word_scraper.py — comprehensive test suite (~300 lines) Modified files: - src/skill_seekers/cli/main.py — registered "word" command module - src/skill_seekers/cli/source_detector.py — .docx auto-detection + _detect_word() classmethod - src/skill_seekers/cli/create_command.py — _route_word() + --help-word - src/skill_seekers/cli/arguments/create.py — WORD_ARGUMENTS + routing - src/skill_seekers/cli/arguments/__init__.py — export word args - src/skill_seekers/cli/parsers/__init__.py — register WordParser - src/skill_seekers/cli/unified_scraper.py — _scrape_word() integration - src/skill_seekers/cli/pdf_scraper.py — fix: real enhancement instead of stub; remove [:3] reference file limit; capture run_workflows return - src/skill_seekers/cli/github_scraper.py — fix: remove arbitrary open_issues[:20] / closed_issues[:10] reference file limits - pyproject.toml — skill-seekers-word entry point + docx optional dep - tests/test_cli_parsers.py — update parser count 21→22 Bug fixes applied during real-world testing: - Code detection: detect monospace <p><br> blocks as code (mammoth renders Courier paragraphs this way, not as <pre>/<code>) - Language detector: fix wrong method name detect_from_text → detect_from_code - Description inference: pass None from main() so extract_docx() can infer description from Word document subject/title metadata - Bullet-point guard: exclude prose starting with •/-/* from code scoring - Enhancement: implement real API/LOCAL enhancement (was stub) - pip install message: add quotes around skill-seekers[docx] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent e42aade commit b81d55f

File tree

17 files changed

+2215
-68
lines changed

17 files changed

+2215
-68
lines changed

AGENTS.md

Lines changed: 24 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,12 @@ This file provides essential guidance for AI coding agents working with the Skil
1212

1313
| Attribute | Value |
1414
|-----------|-------|
15-
| **Current Version** | 3.0.0 |
15+
| **Current Version** | 3.1.3 |
1616
| **Python Version** | 3.10+ (tested on 3.10, 3.11, 3.12, 3.13) |
1717
| **License** | MIT |
1818
| **Package Name** | `skill-seekers` (PyPI) |
19+
| **Source Files** | 169 Python files |
20+
| **Test Files** | 101 test files |
1921
| **Website** | https://skillseekersweb.com/ |
2022
| **Repository** | https://github.com/yusufkaraaslan/Skill_Seekers |
2123

@@ -55,7 +57,7 @@ This file provides essential guidance for AI coding agents working with the Skil
5557
```
5658
/mnt/1ece809a-2821-4f10-aecb-fcdf34760c0b/Git/Skill_Seekers/
5759
├── src/skill_seekers/ # Main source code (src/ layout)
58-
│ ├── cli/ # CLI tools and commands (~42k lines)
60+
│ ├── cli/ # CLI tools and commands (~70 modules)
5961
│ │ ├── adaptors/ # Platform adaptors (Strategy pattern)
6062
│ │ │ ├── base.py # Abstract base class (SkillAdaptor)
6163
│ │ │ ├── claude.py # Claude AI adaptor
@@ -70,12 +72,6 @@ This file provides essential guidance for AI coding agents working with the Skil
7072
│ │ │ ├── qdrant.py # Qdrant vector DB adaptor
7173
│ │ │ ├── weaviate.py # Weaviate vector DB adaptor
7274
│ │ │ └── streaming_adaptor.py # Streaming output adaptor
73-
│ │ ├── storage/ # Cloud storage backends
74-
│ │ │ ├── base_storage.py # Storage interface
75-
│ │ │ ├── s3_storage.py # AWS S3 support
76-
│ │ │ ├── gcs_storage.py # Google Cloud Storage
77-
│ │ │ └── azure_storage.py # Azure Blob Storage
78-
│ │ ├── parsers/ # CLI argument parsers
7975
│ │ ├── arguments/ # CLI argument definitions
8076
│ │ ├── presets/ # Preset configuration management
8177
│ │ ├── main.py # Unified CLI entry point
@@ -85,6 +81,7 @@ This file provides essential guidance for AI coding agents working with the Skil
8581
│ │ ├── pdf_scraper.py # PDF extraction
8682
│ │ ├── unified_scraper.py # Multi-source scraping
8783
│ │ ├── codebase_scraper.py # Local codebase analysis
84+
│ │ ├── enhance_command.py # AI enhancement command
8885
│ │ ├── enhance_skill_local.py # AI enhancement (local mode)
8986
│ │ ├── package_skill.py # Skill packager
9087
│ │ ├── upload_skill.py # Upload to platforms
@@ -101,8 +98,8 @@ This file provides essential guidance for AI coding agents working with the Skil
10198
│ │ ├── source_manager.py # Config source management
10299
│ │ └── tools/ # MCP tool implementations
103100
│ │ ├── config_tools.py # Configuration tools
104-
│ │ ├── scraping_tools.py # Scraping tools
105101
│ │ ├── packaging_tools.py # Packaging tools
102+
│ │ ├── scraping_tools.py # Scraping tools
106103
│ │ ├── source_tools.py # Source management tools
107104
│ │ ├── splitting_tools.py # Config splitting tools
108105
│ │ ├── vector_db_tools.py # Vector database tools
@@ -124,7 +121,7 @@ This file provides essential guidance for AI coding agents working with the Skil
124121
│ ├── workflows/ # YAML workflow presets
125122
│ ├── _version.py # Version information (reads from pyproject.toml)
126123
│ └── __init__.py # Package init
127-
├── tests/ # Test suite (98 test files)
124+
├── tests/ # Test suite (101 test files)
128125
├── configs/ # Preset configuration files
129126
├── docs/ # Documentation (80+ markdown files)
130127
│ ├── integrations/ # Platform integration guides
@@ -134,17 +131,6 @@ This file provides essential guidance for AI coding agents working with the Skil
134131
│ ├── blog/ # Blog posts
135132
│ └── roadmap/ # Roadmap documents
136133
├── examples/ # Usage examples
137-
│ ├── langchain-rag-pipeline/ # LangChain example
138-
│ ├── llama-index-query-engine/ # LlamaIndex example
139-
│ ├── pinecone-upsert/ # Pinecone example
140-
│ ├── chroma-example/ # Chroma example
141-
│ ├── weaviate-example/ # Weaviate example
142-
│ ├── qdrant-example/ # Qdrant example
143-
│ ├── faiss-example/ # FAISS example
144-
│ ├── haystack-pipeline/ # Haystack example
145-
│ ├── cursor-react-skill/ # Cursor IDE example
146-
│ ├── windsurf-fastapi-context/ # Windsurf example
147-
│ └── continue-dev-universal/ # Continue.dev example
148134
├── .github/workflows/ # CI/CD workflows
149135
├── pyproject.toml # Main project configuration
150136
├── requirements.txt # Pinned dependencies
@@ -259,7 +245,7 @@ pytest tests/ -v -m "not slow and not integration"
259245

260246
### Test Architecture
261247

262-
- **98 test files** covering all features
248+
- **101 test files** covering all features
263249
- **1880+ tests** passing
264250
- CI Matrix: Ubuntu + macOS, Python 3.10-3.12
265251
- Test markers defined in `pyproject.toml`:
@@ -316,22 +302,19 @@ mypy src/skill_seekers --show-error-codes --pretty
316302
- **Ignored rules:** E501, F541, ARG002, B007, I001, SIM114
317303
- **Import sorting:** isort style with `skill_seekers` as first-party
318304

319-
### MyPy Configuration (from mypy.ini)
320-
321-
```ini
322-
[mypy]
323-
python_version = 3.10
324-
warn_return_any = False
325-
warn_unused_configs = True
326-
disallow_untyped_defs = False
327-
check_untyped_defs = True
328-
ignore_missing_imports = True
329-
no_implicit_optional = True
330-
show_error_codes = True
331-
332-
# Gradual typing - be lenient for now
333-
disallow_incomplete_defs = False
334-
disallow_untyped_calls = False
305+
### MyPy Configuration (from pyproject.toml)
306+
307+
```toml
308+
[tool.mypy]
309+
python_version = "3.10"
310+
warn_return_any = true
311+
warn_unused_configs = true
312+
disallow_untyped_defs = false
313+
disallow_incomplete_defs = false
314+
check_untyped_defs = true
315+
ignore_missing_imports = true
316+
show_error_codes = true
317+
pretty = true
335318
```
336319

337320
### Code Conventions
@@ -662,17 +645,6 @@ Preset configs are in `configs/` directory:
662645
- `astrovalley_unified.json` - Astrovalley
663646
- `configs/integrations/` - Integration-specific configs
664647

665-
### Configuration Documentation
666-
667-
Preset configs are in `configs/` directory:
668-
- `godot.json` - Godot Engine
669-
- `blender.json` / `blender-unified.json` - Blender Engine
670-
- `claude-code.json` - Claude Code
671-
- `httpx_comprehensive.json` - HTTPX library
672-
- `medusa-mercurjs.json` - Medusa/MercurJS
673-
- `astrovalley_unified.json` - Astrovalley
674-
- `configs/integrations/` - Integration-specific configs
675-
676648
---
677649

678650
## Key Dependencies
@@ -700,6 +672,8 @@ Preset configs are in `configs/` directory:
700672
| `python-dotenv` | >=1.1.1 | Environment variables |
701673
| `jsonschema` | >=4.25.1 | JSON validation |
702674
| `PyYAML` | >=6.0 | YAML parsing |
675+
| `langchain` | >=1.2.10 | LangChain integration |
676+
| `llama-index` | >=0.14.15 | LlamaIndex integration |
703677

704678
### Optional Dependencies
705679

@@ -852,4 +826,4 @@ Skill Seekers uses JSON configuration files to define scraping targets. Example
852826

853827
*This document is maintained for AI coding agents. For human contributors, see README.md and CONTRIBUTING.md.*
854828

855-
*Last updated: 2026-02-16*
829+
*Last updated: 2026-02-24*

pyproject.toml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,12 @@ azure = [
109109
"azure-storage-blob>=12.19.0",
110110
]
111111

112+
# Word document (.docx) support
113+
docx = [
114+
"mammoth>=1.6.0",
115+
"python-docx>=1.1.0",
116+
]
117+
112118
# RAG vector database upload support
113119
chroma = [
114120
"chromadb>=0.4.0",
@@ -146,6 +152,8 @@ embedding = [
146152

147153
# All optional dependencies combined (dev dependencies now in [dependency-groups])
148154
all = [
155+
"mammoth>=1.6.0",
156+
"python-docx>=1.1.0",
149157
"mcp>=1.25,<2",
150158
"httpx>=0.28.1",
151159
"httpx-sse>=0.4.3",
@@ -186,6 +194,7 @@ skill-seekers-resume = "skill_seekers.cli.resume_command:main"
186194
skill-seekers-scrape = "skill_seekers.cli.doc_scraper:main"
187195
skill-seekers-github = "skill_seekers.cli.github_scraper:main"
188196
skill-seekers-pdf = "skill_seekers.cli.pdf_scraper:main"
197+
skill-seekers-word = "skill_seekers.cli.word_scraper:main"
189198
skill-seekers-unified = "skill_seekers.cli.unified_scraper:main"
190199
skill-seekers-enhance = "skill_seekers.cli.enhance_command:main"
191200
skill-seekers-enhance-status = "skill_seekers.cli.enhance_status:main"

src/skill_seekers/cli/arguments/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
from .scrape import add_scrape_arguments, SCRAPE_ARGUMENTS
2222
from .github import add_github_arguments, GITHUB_ARGUMENTS
2323
from .pdf import add_pdf_arguments, PDF_ARGUMENTS
24+
from .word import add_word_arguments, WORD_ARGUMENTS
2425
from .analyze import add_analyze_arguments, ANALYZE_ARGUMENTS
2526
from .unified import add_unified_arguments, UNIFIED_ARGUMENTS
2627
from .package import add_package_arguments, PACKAGE_ARGUMENTS
@@ -38,11 +39,13 @@
3839
"add_package_arguments",
3940
"add_upload_arguments",
4041
"add_enhance_arguments",
42+
"add_word_arguments",
4143
# Data
4244
"COMMON_ARGUMENTS",
4345
"SCRAPE_ARGUMENTS",
4446
"GITHUB_ARGUMENTS",
4547
"PDF_ARGUMENTS",
48+
"WORD_ARGUMENTS",
4649
"ANALYZE_ARGUMENTS",
4750
"UNIFIED_ARGUMENTS",
4851
"PACKAGE_ARGUMENTS",

src/skill_seekers/cli/arguments/create.py

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -389,6 +389,18 @@
389389
},
390390
}
391391

392+
# Word document specific (from word.py)
393+
WORD_ARGUMENTS: dict[str, dict[str, Any]] = {
394+
"docx": {
395+
"flags": ("--docx",),
396+
"kwargs": {
397+
"type": str,
398+
"help": "DOCX file path",
399+
"metavar": "PATH",
400+
},
401+
},
402+
}
403+
392404
# Multi-source config specific (from unified_scraper.py)
393405
CONFIG_ARGUMENTS: dict[str, dict[str, Any]] = {
394406
"merge_mode": {
@@ -471,6 +483,7 @@ def get_source_specific_arguments(source_type: str) -> dict[str, dict[str, Any]]
471483
"github": GITHUB_ARGUMENTS,
472484
"local": LOCAL_ARGUMENTS,
473485
"pdf": PDF_ARGUMENTS,
486+
"word": WORD_ARGUMENTS,
474487
"config": CONFIG_ARGUMENTS,
475488
}
476489
return source_args.get(source_type, {})
@@ -507,12 +520,13 @@ def add_create_arguments(parser: argparse.ArgumentParser, mode: str = "default")
507520
- 'github': Universal + github-specific
508521
- 'local': Universal + local-specific
509522
- 'pdf': Universal + pdf-specific
523+
- 'word': Universal + word-specific
510524
- 'advanced': Advanced/rare arguments
511525
- 'all': All 120+ arguments
512526
513527
Args:
514528
parser: ArgumentParser to add arguments to
515-
mode: Help mode (default, web, github, local, pdf, advanced, all)
529+
mode: Help mode (default, web, github, local, pdf, word, advanced, all)
516530
"""
517531
# Positional argument for source
518532
parser.add_argument(
@@ -543,6 +557,10 @@ def add_create_arguments(parser: argparse.ArgumentParser, mode: str = "default")
543557
for arg_name, arg_def in PDF_ARGUMENTS.items():
544558
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
545559

560+
if mode in ["word", "all"]:
561+
for arg_name, arg_def in WORD_ARGUMENTS.items():
562+
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
563+
546564
if mode in ["config", "all"]:
547565
for arg_name, arg_def in CONFIG_ARGUMENTS.items():
548566
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
"""Word document command argument definitions.
2+
3+
This module defines ALL arguments for the word command in ONE place.
4+
Both word_scraper.py (standalone) and parsers/word_parser.py (unified CLI)
5+
import and use these definitions.
6+
7+
Shared arguments (name, description, output, enhance-level, api-key,
8+
dry-run, verbose, quiet, workflow args) come from common.py / workflow.py
9+
via ``add_all_standard_arguments()``.
10+
"""
11+
12+
import argparse
13+
from typing import Any
14+
15+
from .common import add_all_standard_arguments
16+
17+
# Word-specific argument definitions as data structure
18+
# NOTE: Shared args (name, description, output, enhance_level, api_key, dry_run,
19+
# verbose, quiet, workflow args) are registered by add_all_standard_arguments().
20+
WORD_ARGUMENTS: dict[str, dict[str, Any]] = {
21+
"docx": {
22+
"flags": ("--docx",),
23+
"kwargs": {
24+
"type": str,
25+
"help": "Direct DOCX file path",
26+
"metavar": "PATH",
27+
},
28+
},
29+
"from_json": {
30+
"flags": ("--from-json",),
31+
"kwargs": {
32+
"type": str,
33+
"help": "Build skill from extracted JSON",
34+
"metavar": "FILE",
35+
},
36+
},
37+
}
38+
39+
40+
def add_word_arguments(parser: argparse.ArgumentParser) -> None:
41+
"""Add all word command arguments to a parser.
42+
43+
Registers shared args (name, description, output, enhance-level, api-key,
44+
dry-run, verbose, quiet, workflow args) via add_all_standard_arguments(),
45+
then adds Word-specific args on top.
46+
47+
The default for --enhance-level is overridden to 0 (disabled) for Word.
48+
"""
49+
# Shared universal args first
50+
add_all_standard_arguments(parser)
51+
52+
# Override enhance-level default to 0 for Word
53+
for action in parser._actions:
54+
if hasattr(action, "dest") and action.dest == "enhance_level":
55+
action.default = 0
56+
action.help = (
57+
"AI enhancement level (auto-detects API vs LOCAL mode): "
58+
"0=disabled (default for Word), 1=SKILL.md only, 2=+architecture/config, 3=full enhancement. "
59+
"Mode selection: uses API if ANTHROPIC_API_KEY is set, otherwise LOCAL (Claude Code)"
60+
)
61+
62+
# Word-specific args
63+
for arg_name, arg_def in WORD_ARGUMENTS.items():
64+
flags = arg_def["flags"]
65+
kwargs = arg_def["kwargs"]
66+
parser.add_argument(*flags, **kwargs)

0 commit comments

Comments
 (0)