Skip to content

feat: add client.parse() for the Data Extraction API (/extraction/parse)#12

Merged
nickwinder merged 13 commits into
mainfrom
feat/task-135-parse-ga
May 29, 2026
Merged

feat: add client.parse() for the Data Extraction API (/extraction/parse)#12
nickwinder merged 13 commits into
mainfrom
feat/task-135-parse-ga

Conversation

@nickwinder
Copy link
Copy Markdown
Collaborator

@nickwinder nickwinder commented May 27, 2026

Summary

Adds first-class TypeScript client support for the Data Extraction API (/extraction/parse), which is now generally available. Mirrors the Python sibling PR.

Changes Made

Public surface

  • New client.parse() covering all four processing modes (text, structure, understand, agentic) and both output formats (spatial element list, whole-document markdown).
  • Two convenience wrappers — client.parseToMarkdown() (markdown-only return) and client.parseElements() (spatial-only return, with mode='text' excluded at the type level since the API rejects that combination).
  • Typed ParseResponse envelope with a discriminated union of element variants (paragraph, table, formula, picture, keyValueRegion, handwriting) — if (element.type === 'table') { ... } narrows correctly via the type discriminator.
  • NutrientClient accepts a new optional extractApiKey constructor option (string or async getter) for the Data Extraction product key, which is separate from the Processor key. parse() prefers extractApiKey over apiKey when set; every non-parse method keeps using apiKey. Falls back to apiKey when extractApiKey is omitted so tenants with a single global DWS key still work. Calling /extraction/parse with a Processor-only key returns 403.
  • New ExtractionCredits type module to surface the extraction-credit billing bucket separately from the processor-credit bucket. README, CHANGELOG, and JSDoc all make the distinction explicit.
  • Public type exports for the new surface (ParseResponse, ParseResponseSpatial, ParseResponseMarkdown, ParseInstructions, ParseOptions, ParseMode, ParseOutputFormat, the discriminated ParseElement union, all variant types, ExtractionCredits).

Types & codegen

  • Vendored the upstream OpenAPI spec as dws-data-extraction-spec.yml (sibling to the existing dws-api-spec.yml).
  • New npm run generate:types:extract script (peer to the existing generate:types) that runs openapi-typescript against the vendored spec into src/generated/extract-types.ts.
  • src/types/parse.ts derives its schema primitives (ParseMode, ParseElement and all six element subtypes, Bounds, PageRef, Word, TableCell, KeyValuePair, KeyValueEntity, Metrics, Usage, Configuration, ParseErrorResponse, ParagraphRole) from the generated components['schemas'] rather than being hand-rolled — spec drift now flows through automatically. Four types stay hand-composed where they add something the spec doesn't express: ParseOutputOptions / ParseInstructions (spec marks includeWords as required but it has a server-side default), ParseResponseSpatial / ParseResponseMarkdown (cross-field discriminated narrowing so if (output.markdown !== undefined) works without per-call ?. access), and ParseOptions (adds the client-only apiVersion header).

Docs

  • Full README rewrite of the Data Extraction section leading with use cases (RAG ingestion, search indexing, content migration, form/invoice extraction, layout-aware document understanding) followed by mode + output-format selector tables and two worked recipes.
  • New "Setup — separate Extract API key" section in the README explaining the dual-key constructor.
  • docs/METHODS.md and LLM_DOC.md updated to document the new surface and the dual-key requirement.

Live verification (against prod)

Ran a full param sweep against the prod API using examples/assets/sample.pdf (6 pages), covering every documented (mode, output_format) combination, the spec-rejected case, every ParseOptions param, all four input shapes, and a client-side error path:

# Scenario Outcome Cost (rem) ms
01 text + markdown OK md 1922c 6 2931
02 text + spatial EXPECTED 400 ValidationError (per spec) 512
03 structure + markdown OK md 2560c 1.5 619
04 structure + spatial OK 72 elts (paragraph, table) 9 982
05 understand + markdown OK md 5608c 9 19370
06 understand + spatial OK 124 elts (picture, paragraph, handwriting, keyValueRegion) 54 19398
07 agentic + markdown OK md 6975c 18 39849
08 agentic + spatial OK 122 elts (picture, paragraph, handwriting) 108 37582
09 structure+spatial includeWords=true OK 72 elts 9 749
10 structure+md language=['eng','deu'] OK md 2560c 1.5 601
11 text+md apiVersion=2026-05-25 header OK md 1922c 6 708
12 input: Buffer OK md 1922c 6 758
13 input: URL string OK md 2893c 1
14 input: { type: 'url', url } object OK md 2893c 1
15 input: missing local file EXPECTED client-side ValidationError (no network) 0

Separate dual-key smoke also covered routing end-to-end:

  • { apiKey: processor, extractApiKey: extract }getAccountInfo() / extractText() / parse() / parseToMarkdown() all succeed.
  • { apiKey: processor } (no extractApiKey) → getAccountInfo() still works, parseToMarkdown() returns AuthenticationError HTTP 403 from the Extract product — exactly the failure mode the new option exists to prevent.

The Data Extraction API (`POST /extraction/parse`) ships on a separate
OpenAPI document from the existing DWS Processor API. Vendor the public spec
so the new typed client surface is anchored to a checked-in source of truth.

The Processor API spec stays at `dws-api-spec.yml`; the Data Extraction spec
lives alongside it at `dws-data-extraction-spec.yml`.
Introduce hand-written types mirroring the public Data Extraction OpenAPI 3.1
contract (version 2026-05-25):

- ParseMode (text | structure | understand | agentic)
- ParseOutputFormat (spatial | markdown), ParseOutputOptions
- ParseInstructions and ParseOptions request shapes
- ParseResponseSpatial / ParseResponseMarkdown discriminated by output payload
- Per-element types: ParagraphElement, FormulaElement, PictureElement,
  TableElement (with ParseTableCell), KeyValueRegionElement (with
  KeyValuePair / KeyValueEntity), HandwritingElement, and shared
  ParseElementBase / ParseBounds / ParsePageRef / ParseWord
- ParseErrorResponse with structured failingPaths
- ParseMetrics, ParseUsage (carrying data_extraction_credits), ParseConfiguration

The Data Extraction API bills against a separate extraction-credits bucket
from the processor API; type JSDoc makes the distinction explicit so client
code does not conflate the two billing buckets.

Wires the new endpoint into RequestTypeMap / ResponseTypeMap so the existing
HTTP layer stays type-safe end-to-end.
…wrappers

Adds first-class client methods for the Data Extraction API:

- parse(input, options?) — full-fidelity call against POST /extraction/parse,
  supporting local files, buffers, streams, and URL inputs. Handles multipart
  upload for binary inputs and JSON body for URL-only requests.
- parseToMarkdown(input, mode?) — convenience wrapper returning the whole-
  document Markdown string directly. Defaults to mode='text' (cheapest).
- parseElements(input, mode?, includeWords?) — convenience wrapper returning
  the typed spatial-elements array. Defaults to mode='structure'.

Threads x-nutrient-api-version through the HTTP layer when the caller pins
a specific API version.

JSDoc on every new method makes the billing distinction explicit: the
Data Extraction API bills against extraction credits, a separate bucket
from the processor API credits used by the rest of NutrientClient.

The full set of new types is re-exported from the package root.
Adds 19 unit tests around the new /extraction/parse surface:

- Request shape: multipart vs JSON, apiVersion header forwarding, option
  serialisation (language, output, includeWords), default behaviour.
- Mode coverage: all four modes (text, structure, understand, agentic)
  round-trip through the instructions payload.
- Output coverage: spatial elements and whole-document Markdown variants
  validated end-to-end, including extraction-credit accounting on the
  response (data_extraction_credits, not processor credits).
- Error paths: HTTP-layer ValidationError propagation, file-input
  preflight failures surfaced before the request leaves the process.
- Convenience wrappers: parseToMarkdown and parseElements default modes
  and includeWords forwarding, plus defensive output-mismatch errors.

Adds examples/src/parse_smoke.ts — a live operator-runnable smoke test
that prints a parsed summary plus extraction-credit usage. Documents
the build/pack/install/run recipe in the file header.
- README: new "Data Extraction (/extraction/parse)" section with mode/
  credit table, request examples for spatial + Markdown outputs, URL
  input, convenience wrappers, and a pointer to the smoke example.
- docs/METHODS.md: new entries for parse, parseToMarkdown, parseElements
  inserted alongside the existing extract* convenience methods.
- LLM_DOC.md: inject the same three method signatures so coding agents
  steered by this rule file know about parse and the extraction-credits
  bucket.
- CHANGELOG.md: Unreleased entry covering the new client surface, the
  newly-exported public types, the live smoke script, and an explicit
  call-out that /extraction/parse bills against extraction credits
  (separate from processor API credits).

Every doc surface that mentions cost says "extraction credits" explicitly
so downstream readers cannot conflate the two billing buckets.
- CHANGELOG: correct path to live smoke script
- METHODS.md: fix dangling sentence on parseElements compile-time guard
Factor the inline extraction-credit billing shape out of ParseUsage into a
standalone ExtractionCredits interface in src/types/extraction_credits.ts,
mirroring the Python client's type-factoring approach.

ParseUsage.data_extraction_credits now references ExtractionCredits instead
of an anonymous inline type, making the billing object reusable if future
endpoints surface the same shape.

ExtractionCredits is re-exported from the package root alongside the other
parse types.
Lead with the "Designed for" preamble naming the three canonical workflows
(RAG/search indexing, form/invoice extraction, layout-aware understanding)
before describing modes and output formats.

Broaden the @param input description to explicitly mention non-PDF inputs
(Office documents, images), matching the actual endpoint capability rather
than implying PDF-only like sign().

Update the @example block to show a form/invoice extraction recipe alongside
the RAG recipe, and replace the generic paragraph-walk with a keyValueRegion
traversal that a form-extraction caller can copy directly.
Restructure the README's /extraction/parse section to lead with use cases
(RAG ingestion, form/invoice extraction, layout-aware understanding) before
the mode table and code, matching the Python client's documentation approach.

Add:
- "Choosing an output format" table (markdown vs spatial, with shape and
  best-for columns).
- "Modes — when to use which" table with credit costs and decision guidance.
- Two worked recipes: RAG ingestion (PDF → Markdown → embed) and
  form/invoice extraction (PDF → spatial elements → structured object),
  each with the convenience-wrapper alternative shown alongside.
- Explicit note that the endpoint accepts PDFs, Office documents, and
  images — not PDFs only.
- Mention of the new ExtractionCredits type in the exported-types list.

Update METHODS.md parse/parseToMarkdown/parseElements entries to match:
lead with use-case positioning, add a parameters table, align examples
with the recipe pattern from the README.
@nickwinder nickwinder added the enhancement New feature or request label May 27, 2026
@nickwinder nickwinder self-assigned this May 27, 2026
@nickwinder nickwinder marked this pull request as ready for review May 28, 2026 00:33
DWS Extract is a separate product from DWS Processor with its own API key
and credit pool. Calling /extraction/parse with the Processor key returns
403. Add an optional `extractApiKey` constructor option (string or async
getter) that parse() prefers over apiKey when set; every non-parse method
keeps using apiKey. Falls back to apiKey when extractApiKey is omitted,
so tenants with a single global DWS key still work.

The routing happens via a per-call options copy that swaps apiKey to the
extract key — leaves this.options untouched and covers both the multipart
file-input path and the JSON url-input path.

Drop the bundled parse smoke script — its dual-key dance and pack/install
recipe were superseded by the unit-test coverage of the request shape,
response handling, and routing. Live verification against a real account
belongs to ad-hoc developer sessions, not committed scaffolding.

Mirrors PR #47 on the Python sibling client.
Add `npm run generate:types:extract` that runs openapi-typescript against
the vendored dws-data-extraction-spec.yml into src/generated/extract-types.ts,
peer to the existing `generate:types` flow for the Processor spec.

Rewrite src/types/parse.ts so the schema primitives derive from the
generated `components['schemas']` rather than being hand-rolled:

- ParseMode, ParseOutputFormat
- ParseElement and the six element subtypes (ParagraphElement,
  FormulaElement, PictureElement, TableElement, KeyValueRegionElement,
  HandwritingElement)
- ParseElementBase, ParseBounds, ParsePageRef, ParseWord
- ParseTableCell, KeyValuePair, KeyValueEntity
- ParseMetrics, ParseUsage, ParseConfiguration
- ParseErrorResponse, ParseErrorDetails, ParseErrorFailingPath
- ParagraphRole (now `NonNullable<ParagraphElement['role']>`)

Keep four types hand-composed where they add something the spec doesn't
express:

- ParseOutputOptions / ParseInstructions — the spec marks
  `OutputOptions.includeWords` as required, but the server has a default
  and clients shouldn't be forced to pass it.
- ParseResponseSpatial / ParseResponseMarkdown — cross-field discriminated
  narrowing (`elements?: undefined` / `markdown?: undefined`) the spec's
  ParseOutput doesn't model, letting callers write
  `if (output.markdown !== undefined)` without per-call `?.` access.
- ParseOptions — adds the client-only `apiVersion` header concern that
  isn't a body field in the spec.

Net: ~210 lines of hand-rolled type definitions deleted, replaced with
one-line aliases that re-route through the generated schema. The public
surface (every exported name) is unchanged.
…spec re-export

Most APIs in this client (sign, ocr, watermark, redact, etc.) don't have a
dedicated `src/types/<api>.ts` file — they reach types via
`components['schemas']['X']` from `src/generated/api-types.ts`. The
`src/types/parse.ts` and `src/types/extraction_credits.ts` files added on
this branch were an outlier: most of their content was thin one-line
aliases over the generated extract spec.

Collapse to the rest-of-codebase pattern:

- Delete `src/types/parse.ts` (was 254 lines, mostly aliases).
- Delete `src/types/extraction_credits.ts` (single hand-rolled interface that
  duplicated the generated `Usage.data_extraction_credits` shape).
- Move the 5 hand-composed types into `src/types/http.ts` (it already
  imports `ParseInstructions` / `ParseResponse` to type the endpoint maps):
  `ParseOutputOptions`, `ParseInstructions`, `ParseOptions`,
  `ParseResponseSpatial`, `ParseResponseMarkdown`, plus the derived
  `ExtractionCredits` alias. Each carries the JSDoc explaining why it's
  hand-composed instead of derived.
- Drop the 23 cosmetic spec-alias exports from the package root. Consumers
  who need element-subtype types reach them via the new
  `extractComponents['schemas']['ParagraphElement']` namespace re-export,
  mirroring how Processor types are exposed via the existing `components`
  namespace.

The package's public surface still exports the 7 hand-composed types
(`ParseOutputOptions`, `ParseInstructions`, `ParseOptions`, `ParseResponse`,
`ParseResponseSpatial`, `ParseResponseMarkdown`, `ExtractionCredits`) by
name. Internal consumers (`src/client.ts`, the parse unit tests) shift to
`extractComponents['schemas']['X']` for spec-derived types.

Net: -290 lines on the type-definition surface, no behaviour change.
Five findings from review:

1. Empty-string `extractApiKey` bypassed constructor validation.
   `apiKey` uses `!options.apiKey` (falsy, catches `''`); the new
   `extractApiKey` validator only checked `!== undefined` plus the type
   guard, so `extractApiKey: ''` passed, propagated into the per-call
   options as `apiKey: ''`, and produced `Authorization: Bearer ` with no
   token — surfacing as a confusing server-side 401 instead of a
   constructor-time `ValidationError`. Add an explicit empty-string check.

2. `extractErrorMessage` in `src/http.ts` checked snake_case (`error_message`,
   `error_description`) and generic message fields but not `errorMessage`
   (camelCase) — the field DWS Extract returns on every 4xx/5xx. Result:
   the server's specific message (e.g. `"invalid mode: 'vlm'"`) was
   silently replaced by the generic `HTTP <status>: <statusText>` string.
   Add `errorMessage` to the priority list.

3. `parse()` accepted `mode='text' + output.format='spatial'` and let the
   server reject with 400. The Python sibling client adds a client-side
   `ValidationError` for this case (after reviewer feedback). The TS
   `parseElements()` wrapper blocked it at the type level via `Exclude`,
   but the low-level `parse()` did not. Add a pre-flight runtime guard.

4. `RequestTypeMap` JSDoc on `/extraction/parse` claimed `instructions`
   was optional for multipart upload, but the type definition marks it
   required and the implementation always passes it (an empty object when
   no options are supplied). Update the comment to match the type.

5. `parse()` `@param options.language` JSDoc described the field as
   "string or array of ISO 639-2 codes". The underlying spec also accepts
   lowercase language names (`'english'`, `'german'`) and `+`-joined
   multilingual strings (`'eng+spa'`). Document all four accepted forms.

Adds three unit tests (empty-string `extractApiKey`, `errorMessage`
extraction, text+spatial pre-flight rejection). 292 tests pass.
@nickwinder nickwinder requested a review from HungKNguyen May 28, 2026 05:28
@nickwinder nickwinder merged commit cf56d22 into main May 29, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants