Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 12, 2025

Overview

This PR adds support for Microsoft Word documents (.docx and .doc files) as input for Doc Detective test detection and resolution. Word documents are automatically converted to Markdown using Pandoc with a custom Lua filter and processed using the existing Markdown parsing rules.

Changes

Core Implementation

  • Uses Pandoc for Word-to-Markdown conversion (system dependency)
  • Created custom Lua filter (src/word-hidden-text-filter.lua) to extract hidden text from Word documents and convert it to HTML comments
  • Implemented convertWordToMarkdown() function in src/utils.js that:
    • Converts Word documents to Markdown format using Pandoc
    • Applies custom Lua filter to extract hidden text
    • Unescapes HTML characters that Pandoc escapes for safety
    • Includes fallback logic if Lua filter fails
  • Modified parseTests() function to detect Word documents by extension and automatically convert them before processing
  • Added word file type definition in src/config.js with support for .docx and .doc extensions
  • Automatically includes Word format in default file types when using default configuration

Features Supported

All Markdown-based test detection features now work with Word documents:

  • ✅ Bold text detection for click and find actions
  • ✅ Hyperlink processing
  • ✅ Code blocks (limited support)
  • Inline test specifications via Word's native hidden text feature
  • Backward compatibility with HTML comments typed as plain text

Inline Test Specifications with Hidden Text

The preferred method for adding inline test specifications is to use Word's hidden text feature. This provides a cleaner user experience compared to typing HTML comments as visible text.

How to use hidden text in Word:

  1. Type your test specification (e.g., <!-- test { "id": "my-test" } -->)
  2. Select the text
  3. Press Ctrl+D (Windows) or Cmd+D (Mac) to open Font dialog
  4. Check the Hidden checkbox
  5. Click OK

The custom Lua filter extracts hidden text during conversion and wraps it in HTML comment syntax for Doc Detective to parse.

Example:

In your Word document, create hidden text containing:

<!-- test { "id": "my-test" } -->

Then write your visible documentation:

Click **Submit** button

Add another hidden text section:

<!-- step { "goTo": "https://example.com" } -->

Alternative: You can still type HTML comments as plain text (visible in the document) for backward compatibility.

Testing

  • Created comprehensive test suite in src/word.test.js
  • Added integration test with real Word document processing
  • Created sample Word documents (test/artifacts/sample-test.docx and test/artifacts/sample-with-inline-specs.docx) for testing
  • All 37 tests passing (31 existing + 6 new)
  • Zero breaking changes to existing functionality

Documentation

  • Added docs/word-format-support.md with usage examples, features, and limitations
  • Created src/word-hidden-text-filter.md documenting the Lua filter
  • Documented how to use Word's hidden text feature for inline test specifications
  • Added Pandoc installation requirements
  • Included helper scripts for creating test Word documents

Usage Example

const { detectAndResolveTests } = require("doc-detective-resolver");

const results = await detectAndResolveTests({
  config: {
    input: "documentation.docx"
  }
});

Given a Word document with content like:

Doc Detective will automatically detect click and find actions from the bold text.

For inline test specifications, use Word's hidden text feature (Ctrl+D > Check "Hidden"):

[Hidden: <!-- test { "id": "word-test-1" } -->]
Click **Submit** to continue
[Hidden: <!-- step { "goTo": "https://example.com" } -->]

Requirements

Pandoc must be installed on your system:

  • Linux/macOS: apt-get install pandoc or brew install pandoc
  • Windows: Download from pandoc.org
  • Docker: Include Pandoc in your container image

Known Limitations

  1. Only simple bold formatting is reliably converted
  2. Complex layouts (tables, multi-column) may not convert cleanly
  3. Images are not currently processed
  4. Hidden text extraction requires proper Word formatting (Font > Hidden checkbox)
  5. Pandoc must be installed and available in the system PATH

Related Issue

Fixes #67

Original prompt

This section details on the original issue you should resolve

<issue_title>Word format support</issue_title>
<issue_description>Support converting Word files into Doc Detective tests. See parent issue for details.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #67


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 12, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI changed the title [WIP] Add Word format support for Doc Detective tests Add Word format support for .docx and .doc files Oct 12, 2025
Copilot AI requested a review from hawkeyexl October 12, 2025 02:40
Copilot finished work on behalf of hawkeyexl October 12, 2025 02:40
@hawkeyexl
Copy link
Contributor

@copilot Did you evaluate the parent issue before starting? How can I specify inline comments in the Word file via this method of conversion?

Copy link
Contributor Author

Copilot AI commented Oct 12, 2025

@copilot Did you evaluate the parent issue before starting? How can I specify inline comments in the Word file via this method of conversion?

I evaluated the issue description which mentioned "Support converting Word files into Doc Detective tests" but did not include details about inline comment requirements.

To address your question: You can now specify inline test specifications in Word documents by typing HTML comment syntax as plain text. For example:

<!-- test { "id": "my-test" } -->
Click **Submit** button
<!-- step { "goTo": "https://example.com" } -->

The conversion process unescapes the HTML characters so these comments are parsed correctly. I've added:

  • Support for unescaping HTML characters during conversion (commit 11bd5ac)
  • A test document with inline specifications
  • Documentation explaining the feature
  • Tests verifying inline specs work

All supported inline specification types (test start, step, test end, ignore blocks) work in Word documents when typed as plain text.

Copilot AI changed the title Add Word format support for .docx and .doc files Add Word format support for .docx and .doc files with inline test specifications Oct 12, 2025
Copilot finished work on behalf of hawkeyexl October 12, 2025 02:54
@hawkeyexl
Copy link
Contributor

@copilot Instead of mammoth.js, use Pandoc to convert to DOCX to Markdown, and use a custom Lua filter to extract Hidden text from the Word file as HTML-style comments in the Markdown.

Copy link
Contributor Author

Copilot AI commented Oct 21, 2025

@copilot Instead of mammoth.js, use Pandoc to convert to DOCX to Markdown, and use a custom Lua filter to extract Hidden text from the Word file as HTML-style comments in the Markdown.

Implemented in commit 4f316bb. Changes made:

  1. Replaced mammoth.js with Pandoc - Removed npm dependency, now uses Pandoc for conversion
  2. Created custom Lua filter (src/word-hidden-text-filter.lua) - Extracts text marked as "Hidden" in Word and converts to HTML comments
  3. Updated convertWordToMarkdown() - Now uses pandoc command with the Lua filter
  4. Better user experience - Users can mark test specifications as hidden text in Word (Ctrl+D > Check "Hidden") instead of typing HTML comments as visible text

The approach provides:

  • Cleaner Word documents (test specs are hidden from readers)
  • More robust conversion (Pandoc vs mammoth.js)
  • Native Word integration (uses built-in hidden text property)
  • Backward compatibility (still works with plain text HTML comments)

All 37 tests passing. Documentation updated with instructions for using hidden text in Word.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • esm.ubuntu.com
    • Triggering command: /usr/lib/apt/methods/https (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title Add Word format support for .docx and .doc files with inline test specifications Add Word format support for .docx and .doc files with Pandoc and hidden text extraction Oct 21, 2025
Copilot finished work on behalf of hawkeyexl October 21, 2025 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Word format support

2 participants