feat: add client.parse() for the Data Extraction API (/extraction/parse) by nickwinder · Pull Request #12 · PSPDFKit-labs/nutrient-dws-client-typescript

nickwinder · 2026-05-27T09:05:29Z

Summary

Adds first-class TypeScript client support for the Data Extraction API (/extraction/parse), which is now generally available. Mirrors the Python sibling PR.

Changes Made

Public surface

New client.parse() covering all four processing modes (text, structure, understand, agentic) and both output formats (spatial element list, whole-document markdown).
Two convenience wrappers — client.parseToMarkdown() (markdown-only return) and client.parseElements() (spatial-only return, with mode='text' excluded at the type level since the API rejects that combination).
Typed ParseResponse envelope with a discriminated union of element variants (paragraph, table, formula, picture, keyValueRegion, handwriting) — if (element.type === 'table') { ... } narrows correctly via the type discriminator.
NutrientClient accepts a new optional extractApiKey constructor option (string or async getter) for the Data Extraction product key, which is separate from the Processor key. parse() prefers extractApiKey over apiKey when set; every non-parse method keeps using apiKey. Falls back to apiKey when extractApiKey is omitted so tenants with a single global DWS key still work. Calling /extraction/parse with a Processor-only key returns 403.
New ExtractionCredits type module to surface the extraction-credit billing bucket separately from the processor-credit bucket. README, CHANGELOG, and JSDoc all make the distinction explicit.
Public type exports for the new surface (ParseResponse, ParseResponseSpatial, ParseResponseMarkdown, ParseInstructions, ParseOptions, ParseMode, ParseOutputFormat, the discriminated ParseElement union, all variant types, ExtractionCredits).

Types & codegen

Vendored the upstream OpenAPI spec as dws-data-extraction-spec.yml (sibling to the existing dws-api-spec.yml).
New npm run generate:types:extract script (peer to the existing generate:types) that runs openapi-typescript against the vendored spec into src/generated/extract-types.ts.
src/types/parse.ts derives its schema primitives (ParseMode, ParseElement and all six element subtypes, Bounds, PageRef, Word, TableCell, KeyValuePair, KeyValueEntity, Metrics, Usage, Configuration, ParseErrorResponse, ParagraphRole) from the generated components['schemas'] rather than being hand-rolled — spec drift now flows through automatically. Four types stay hand-composed where they add something the spec doesn't express: ParseOutputOptions / ParseInstructions (spec marks includeWords as required but it has a server-side default), ParseResponseSpatial / ParseResponseMarkdown (cross-field discriminated narrowing so if (output.markdown !== undefined) works without per-call ?. access), and ParseOptions (adds the client-only apiVersion header).

Docs

Full README rewrite of the Data Extraction section leading with use cases (RAG ingestion, search indexing, content migration, form/invoice extraction, layout-aware document understanding) followed by mode + output-format selector tables and two worked recipes.
New "Setup — separate Extract API key" section in the README explaining the dual-key constructor.
docs/METHODS.md and LLM_DOC.md updated to document the new surface and the dual-key requirement.

Live verification (against prod)

Ran a full param sweep against the prod API using examples/assets/sample.pdf (6 pages), covering every documented (mode, output_format) combination, the spec-rejected case, every ParseOptions param, all four input shapes, and a client-side error path:

#	Scenario	Outcome	Cost (rem)	ms
01	`text` + `markdown`	OK md 1922c	6	2931
02	`text` + `spatial`	EXPECTED 400 `ValidationError` (per spec)	—	512
03	`structure` + `markdown`	OK md 2560c	1.5	619
04	`structure` + `spatial`	OK 72 elts (paragraph, table)	9	982
05	`understand` + `markdown`	OK md 5608c	9	19370
06	`understand` + `spatial`	OK 124 elts (picture, paragraph, handwriting, keyValueRegion)	54	19398
07	`agentic` + `markdown`	OK md 6975c	18	39849
08	`agentic` + `spatial`	OK 122 elts (picture, paragraph, handwriting)	108	37582
09	`structure`+`spatial` `includeWords=true`	OK 72 elts	9	749
10	`structure`+`md` `language=['eng','deu']`	OK md 2560c	1.5	601
11	`text`+`md` `apiVersion=2026-05-25` header	OK md 1922c	6	708
12	input: `Buffer`	OK md 1922c	6	758
13	input: URL string	OK md 2893c	1	—
14	input: `{ type: 'url', url }` object	OK md 2893c	1	—
15	input: missing local file	EXPECTED client-side `ValidationError` (no network)	—	0

Separate dual-key smoke also covered routing end-to-end:

{ apiKey: processor, extractApiKey: extract } → getAccountInfo() / extractText() / parse() / parseToMarkdown() all succeed.
{ apiKey: processor } (no extractApiKey) → getAccountInfo() still works, parseToMarkdown() returns AuthenticationError HTTP 403 from the Extract product — exactly the failure mode the new option exists to prevent.

The Data Extraction API (`POST /extraction/parse`) ships on a separate OpenAPI document from the existing DWS Processor API. Vendor the public spec so the new typed client surface is anchored to a checked-in source of truth. The Processor API spec stays at `dws-api-spec.yml`; the Data Extraction spec lives alongside it at `dws-data-extraction-spec.yml`.

Introduce hand-written types mirroring the public Data Extraction OpenAPI 3.1 contract (version 2026-05-25): - ParseMode (text | structure | understand | agentic) - ParseOutputFormat (spatial | markdown), ParseOutputOptions - ParseInstructions and ParseOptions request shapes - ParseResponseSpatial / ParseResponseMarkdown discriminated by output payload - Per-element types: ParagraphElement, FormulaElement, PictureElement, TableElement (with ParseTableCell), KeyValueRegionElement (with KeyValuePair / KeyValueEntity), HandwritingElement, and shared ParseElementBase / ParseBounds / ParsePageRef / ParseWord - ParseErrorResponse with structured failingPaths - ParseMetrics, ParseUsage (carrying data_extraction_credits), ParseConfiguration The Data Extraction API bills against a separate extraction-credits bucket from the processor API; type JSDoc makes the distinction explicit so client code does not conflate the two billing buckets. Wires the new endpoint into RequestTypeMap / ResponseTypeMap so the existing HTTP layer stays type-safe end-to-end.

…wrappers Adds first-class client methods for the Data Extraction API: - parse(input, options?) — full-fidelity call against POST /extraction/parse, supporting local files, buffers, streams, and URL inputs. Handles multipart upload for binary inputs and JSON body for URL-only requests. - parseToMarkdown(input, mode?) — convenience wrapper returning the whole- document Markdown string directly. Defaults to mode='text' (cheapest). - parseElements(input, mode?, includeWords?) — convenience wrapper returning the typed spatial-elements array. Defaults to mode='structure'. Threads x-nutrient-api-version through the HTTP layer when the caller pins a specific API version. JSDoc on every new method makes the billing distinction explicit: the Data Extraction API bills against extraction credits, a separate bucket from the processor API credits used by the rest of NutrientClient. The full set of new types is re-exported from the package root.

Adds 19 unit tests around the new /extraction/parse surface: - Request shape: multipart vs JSON, apiVersion header forwarding, option serialisation (language, output, includeWords), default behaviour. - Mode coverage: all four modes (text, structure, understand, agentic) round-trip through the instructions payload. - Output coverage: spatial elements and whole-document Markdown variants validated end-to-end, including extraction-credit accounting on the response (data_extraction_credits, not processor credits). - Error paths: HTTP-layer ValidationError propagation, file-input preflight failures surfaced before the request leaves the process. - Convenience wrappers: parseToMarkdown and parseElements default modes and includeWords forwarding, plus defensive output-mismatch errors. Adds examples/src/parse_smoke.ts — a live operator-runnable smoke test that prints a parsed summary plus extraction-credit usage. Documents the build/pack/install/run recipe in the file header.

- README: new "Data Extraction (/extraction/parse)" section with mode/ credit table, request examples for spatial + Markdown outputs, URL input, convenience wrappers, and a pointer to the smoke example. - docs/METHODS.md: new entries for parse, parseToMarkdown, parseElements inserted alongside the existing extract* convenience methods. - LLM_DOC.md: inject the same three method signatures so coding agents steered by this rule file know about parse and the extraction-credits bucket. - CHANGELOG.md: Unreleased entry covering the new client surface, the newly-exported public types, the live smoke script, and an explicit call-out that /extraction/parse bills against extraction credits (separate from processor API credits). Every doc surface that mentions cost says "extraction credits" explicitly so downstream readers cannot conflate the two billing buckets.

- CHANGELOG: correct path to live smoke script - METHODS.md: fix dangling sentence on parseElements compile-time guard

Factor the inline extraction-credit billing shape out of ParseUsage into a standalone ExtractionCredits interface in src/types/extraction_credits.ts, mirroring the Python client's type-factoring approach. ParseUsage.data_extraction_credits now references ExtractionCredits instead of an anonymous inline type, making the billing object reusable if future endpoints surface the same shape. ExtractionCredits is re-exported from the package root alongside the other parse types.

@example

Lead with the "Designed for" preamble naming the three canonical workflows (RAG/search indexing, form/invoice extraction, layout-aware understanding) before describing modes and output formats. Broaden the @param input description to explicitly mention non-PDF inputs (Office documents, images), matching the actual endpoint capability rather than implying PDF-only like sign(). Update the @example block to show a form/invoice extraction recipe alongside the RAG recipe, and replace the generic paragraph-walk with a keyValueRegion traversal that a form-extraction caller can copy directly.

Restructure the README's /extraction/parse section to lead with use cases (RAG ingestion, form/invoice extraction, layout-aware understanding) before the mode table and code, matching the Python client's documentation approach. Add: - "Choosing an output format" table (markdown vs spatial, with shape and best-for columns). - "Modes — when to use which" table with credit costs and decision guidance. - Two worked recipes: RAG ingestion (PDF → Markdown → embed) and form/invoice extraction (PDF → spatial elements → structured object), each with the convenience-wrapper alternative shown alongside. - Explicit note that the endpoint accepts PDFs, Office documents, and images — not PDFs only. - Mention of the new ExtractionCredits type in the exported-types list. Update METHODS.md parse/parseToMarkdown/parseElements entries to match: lead with use-case positioning, add a parameters table, align examples with the recipe pattern from the README.

DWS Extract is a separate product from DWS Processor with its own API key and credit pool. Calling /extraction/parse with the Processor key returns 403. Add an optional `extractApiKey` constructor option (string or async getter) that parse() prefers over apiKey when set; every non-parse method keeps using apiKey. Falls back to apiKey when extractApiKey is omitted, so tenants with a single global DWS key still work. The routing happens via a per-call options copy that swaps apiKey to the extract key — leaves this.options untouched and covers both the multipart file-input path and the JSON url-input path. Drop the bundled parse smoke script — its dual-key dance and pack/install recipe were superseded by the unit-test coverage of the request shape, response handling, and routing. Live verification against a real account belongs to ad-hoc developer sessions, not committed scaffolding. Mirrors PR #47 on the Python sibling client.

Add `npm run generate:types:extract` that runs openapi-typescript against the vendored dws-data-extraction-spec.yml into src/generated/extract-types.ts, peer to the existing `generate:types` flow for the Processor spec. Rewrite src/types/parse.ts so the schema primitives derive from the generated `components['schemas']` rather than being hand-rolled: - ParseMode, ParseOutputFormat - ParseElement and the six element subtypes (ParagraphElement, FormulaElement, PictureElement, TableElement, KeyValueRegionElement, HandwritingElement) - ParseElementBase, ParseBounds, ParsePageRef, ParseWord - ParseTableCell, KeyValuePair, KeyValueEntity - ParseMetrics, ParseUsage, ParseConfiguration - ParseErrorResponse, ParseErrorDetails, ParseErrorFailingPath - ParagraphRole (now `NonNullable<ParagraphElement['role']>`) Keep four types hand-composed where they add something the spec doesn't express: - ParseOutputOptions / ParseInstructions — the spec marks `OutputOptions.includeWords` as required, but the server has a default and clients shouldn't be forced to pass it. - ParseResponseSpatial / ParseResponseMarkdown — cross-field discriminated narrowing (`elements?: undefined` / `markdown?: undefined`) the spec's ParseOutput doesn't model, letting callers write `if (output.markdown !== undefined)` without per-call `?.` access. - ParseOptions — adds the client-only `apiVersion` header concern that isn't a body field in the spec. Net: ~210 lines of hand-rolled type definitions deleted, replaced with one-line aliases that re-route through the generated schema. The public surface (every exported name) is unchanged.

…spec re-export Most APIs in this client (sign, ocr, watermark, redact, etc.) don't have a dedicated `src/types/<api>.ts` file — they reach types via `components['schemas']['X']` from `src/generated/api-types.ts`. The `src/types/parse.ts` and `src/types/extraction_credits.ts` files added on this branch were an outlier: most of their content was thin one-line aliases over the generated extract spec. Collapse to the rest-of-codebase pattern: - Delete `src/types/parse.ts` (was 254 lines, mostly aliases). - Delete `src/types/extraction_credits.ts` (single hand-rolled interface that duplicated the generated `Usage.data_extraction_credits` shape). - Move the 5 hand-composed types into `src/types/http.ts` (it already imports `ParseInstructions` / `ParseResponse` to type the endpoint maps): `ParseOutputOptions`, `ParseInstructions`, `ParseOptions`, `ParseResponseSpatial`, `ParseResponseMarkdown`, plus the derived `ExtractionCredits` alias. Each carries the JSDoc explaining why it's hand-composed instead of derived. - Drop the 23 cosmetic spec-alias exports from the package root. Consumers who need element-subtype types reach them via the new `extractComponents['schemas']['ParagraphElement']` namespace re-export, mirroring how Processor types are exposed via the existing `components` namespace. The package's public surface still exports the 7 hand-composed types (`ParseOutputOptions`, `ParseInstructions`, `ParseOptions`, `ParseResponse`, `ParseResponseSpatial`, `ParseResponseMarkdown`, `ExtractionCredits`) by name. Internal consumers (`src/client.ts`, the parse unit tests) shift to `extractComponents['schemas']['X']` for spec-derived types. Net: -290 lines on the type-definition surface, no behaviour change.

Five findings from review: 1. Empty-string `extractApiKey` bypassed constructor validation. `apiKey` uses `!options.apiKey` (falsy, catches `''`); the new `extractApiKey` validator only checked `!== undefined` plus the type guard, so `extractApiKey: ''` passed, propagated into the per-call options as `apiKey: ''`, and produced `Authorization: Bearer ` with no token — surfacing as a confusing server-side 401 instead of a constructor-time `ValidationError`. Add an explicit empty-string check. 2. `extractErrorMessage` in `src/http.ts` checked snake_case (`error_message`, `error_description`) and generic message fields but not `errorMessage` (camelCase) — the field DWS Extract returns on every 4xx/5xx. Result: the server's specific message (e.g. `"invalid mode: 'vlm'"`) was silently replaced by the generic `HTTP <status>: <statusText>` string. Add `errorMessage` to the priority list. 3. `parse()` accepted `mode='text' + output.format='spatial'` and let the server reject with 400. The Python sibling client adds a client-side `ValidationError` for this case (after reviewer feedback). The TS `parseElements()` wrapper blocked it at the type level via `Exclude`, but the low-level `parse()` did not. Add a pre-flight runtime guard. 4. `RequestTypeMap` JSDoc on `/extraction/parse` claimed `instructions` was optional for multipart upload, but the type definition marks it required and the implementation always passes it (an empty object when no options are supplied). Update the comment to match the type. 5. `parse()` `@param options.language` JSDoc described the field as "string or array of ISO 639-2 codes". The underlying spec also accepts lowercase language names (`'english'`, `'german'`) and `+`-joined multilingual strings (`'eng+spa'`). Document all four accepted forms. Adds three unit tests (empty-string `extractApiKey`, `errorMessage` extraction, text+spatial pre-flight rejection). 292 tests pass.

nickwinder added 9 commits May 27, 2026 11:36

docs: fix smoke script path and parseElements doc fragment

17e419e

- CHANGELOG: correct path to live smoke script - METHODS.md: fix dangling sentence on parseElements compile-time guard

nickwinder added the enhancement New feature or request label May 27, 2026

nickwinder self-assigned this May 27, 2026

nickwinder marked this pull request as ready for review May 28, 2026 00:33

nickwinder added 4 commits May 28, 2026 14:10

nickwinder requested a review from HungKNguyen May 28, 2026 05:28

HungKNguyen approved these changes May 29, 2026

View reviewed changes

nickwinder merged commit cf56d22 into main May 29, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add client.parse() for the Data Extraction API (/extraction/parse)#12

feat: add client.parse() for the Data Extraction API (/extraction/parse)#12
nickwinder merged 13 commits into
mainfrom
feat/task-135-parse-ga

nickwinder commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nickwinder commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

Public surface

Types & codegen

Docs

Live verification (against prod)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nickwinder commented May 27, 2026 •

edited

Loading