Skip to content

SplitPdfHook ignores custom HTTPClient configuration #171

@trey-trimble-posh

Description

@trey-trimble-posh

SplitPdfHook ignores custom HTTPClient configuration

Description

When providing a custom httpClient to UnstructuredClient, the SplitPdfHook ignores it and creates its own HTTPClientExtension with default settings. This makes it impossible to configure socket-level timeouts for split PDF operations.

Current Behavior

In SplitPdfHook.ts, the sdkInit method discards the user-provided client:

sdkInit(opts: SDKInitOptions): SDKInitOptions {
  const { baseURL } = opts;  // client is ignored
  this.client = new HTTPClientExtension();  // creates new client with defaults
  // ...
  return { baseURL: baseURL, client: this.client };
}

This causes split PDF requests to use Node.js's default undici headersTimeout of 5 minutes, even when users configure extended timeouts via custom HTTPClient.

Expected Behavior

The SplitPdfHook should respect the user-provided httpClient configuration, either by:

  1. Passing the custom client through to HTTPClientExtension
  2. Using the custom client's fetcher for split PDF requests
  3. Exposing timeout configuration options that apply to split PDF operations

Reproduction

import { Agent, fetch as undiciFetch } from 'undici';
import { UnstructuredClient } from 'unstructured-client';
import { HTTPClient } from 'unstructured-client/lib/http';

const agent = new Agent({
  headersTimeout: 20 * 60 * 1000, // 20 minutes
  bodyTimeout: 20 * 60 * 1000,
});

const customFetch = (input, init) => undiciFetch(input, { ...init, dispatcher: agent });
const httpClient = new HTTPClient({ fetcher: customFetch });

const client = new UnstructuredClient({
  security: { apiKeyAuth: 'xxx' },
  httpClient,  // This is ignored when splitPdfPage=true
  timeoutMs: 20 * 60 * 1000,
});

// Large PDF with splitPdfPage=true will timeout at 5 minutes despite config

Workaround

Currently requires setting a global undici dispatcher via setGlobalDispatcher(), which affects all fetch calls in the process.

Environment

  • unstructured-client: 0.25.1
  • Node.js: 20.x

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions