Skip to content

Commit 86d5f75

Browse files
committed
feat: Add URL support to all PDF tools
1 parent 4b2fbc3 commit 86d5f75

File tree

7 files changed

+175
-60
lines changed

7 files changed

+175
-60
lines changed

README.md

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -96,31 +96,35 @@ This server equips your AI agent with the following tools for PDF interaction:
9696

9797
- 📄 **`read_pdf_all_text`:**
9898
- **Description:** Reads all text content and basic information (metadata,
99-
page count) from a specified PDF file.
100-
- **Input:** `{ "path": "string" }` (Relative path to the PDF file)
99+
page count) from a specified PDF file, either local or via URL.
100+
- **Input:** `{ "path": "string" }` OR `{ "url": "string" }` (Provide either
101+
relative path OR URL)
101102
- **Output:** An object containing `text`, `numPages`, `numRenderedPages`,
102103
`info`, `metadata`, and `version` from the PDF.
103104

104105
- 📑 **`read_pdf_page_text`:**
105-
- **Description:** Reads text content from specific pages of a PDF file.
106-
- **Input:** `{ "path": "string", "pages": "number[] | string" }` (Relative
107-
path and an array of 1-based page numbers like `[1, 3, 5]` or a string range
108-
like `'1,3-5,7'`)
106+
- **Description:** Reads text content from specific pages of a PDF file,
107+
either local or via URL.
108+
- **Input:** `{ "path": "string", "pages": "..." }` OR
109+
`{ "url": "string", "pages": "..." }` (Provide path OR URL, plus page
110+
numbers/ranges)
109111
- **Output:** An object containing an array `pages` (each element has `page`
110112
number and extracted `text`) and optionally `missingPages` if some requested
111113
pages couldn't be processed.
112114

113115
- ℹ️ **`get_pdf_metadata`:**
114-
- **Description:** Reads metadata (like author, title, creator, producer,
115-
dates) and general info from a PDF file without extracting all text content
116-
explicitly in the output (though it's parsed internally).
117-
- **Input:** `{ "path": "string" }` (Relative path to the PDF file)
116+
- **Description:** Reads metadata and general info from a PDF file, either
117+
local or via URL.
118+
- **Input:** `{ "path": "string" }` OR `{ "url": "string" }` (Provide either
119+
relative path OR URL)
118120
- **Output:** An object containing `info`, `metadata`, `numPages`, and
119121
`version`.
120122

121123
- #️⃣ **`get_pdf_page_count`:**
122-
- **Description:** Quickly gets the total number of pages in a PDF file.
123-
- **Input:** `{ "path": "string" }` (Relative path to the PDF file)
124+
- **Description:** Quickly gets the total number of pages in a PDF file,
125+
either local or via URL.
126+
- **Input:** `{ "path": "string" }` OR `{ "url": "string" }` (Provide either
127+
relative path OR URL)
124128
- **Output:** An object containing `numPages`.
125129

126130
---

memory-bank/activeContext.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,24 +11,29 @@ implementing the core PDF reading tools based on the `filesystem-mcp` template.
1111
- Updated `package.json` with new project name (`@shtse8/pdf-reader-mcp`),
1212
version, description, and added `pdf-parse` dependency.
1313
- Ran `npm install`.
14-
- Created handler files for the four PDF tools:
14+
- Created handler files for the four PDF tools (initially local path only):
1515
- `src/handlers/readPdfAllText.ts`
1616
- `src/handlers/readPdfPageText.ts`
1717
- `src/handlers/getPdfMetadata.ts`
1818
- `src/handlers/getPdfPageCount.ts`
1919
- Refactored handlers to follow the `ToolDefinition` export pattern found in
2020
`filesystem-mcp` (instead of using `defineHandler`).
2121
- Integrated the new tool definitions into `src/handlers/index.ts`.
22-
- Updated `README.md` to reflect the PDF Reader functionality and tools.
22+
- Updated `README.md` to reflect the PDF Reader functionality and tools
23+
(initially local path only).
24+
- Removed unused filesystem handlers (e.g., listFiles, editFile) from
25+
`src/handlers/index.ts` and deleted corresponding `.ts` files.
26+
- **Added URL support:** Modified all PDF handlers and Zod schemas to accept
27+
either a local `path` or a remote `url`. Updated `README.md` again.
2328
- Updated Memory Bank files (`techContext.md`, `systemPatterns.md`,
2429
`projectbrief.md`, `productContext.md`) with initial PDF Reader context.
2530
- Removed unused filesystem handlers (e.g., listFiles, editFile) from
2631
`src/handlers/index.ts` and deleted corresponding `.ts` files.
2732

2833
## 3. Next Steps
2934

30-
- Update `memory-bank/progress.md` to reflect handler removal.
31-
- Build the project (`npm run build`) again after removing handlers.
35+
- Update `memory-bank/progress.md` to reflect URL support.
36+
- Build the project (`npm run build`) again after adding URL support.
3237
- Consider adding basic tests for the PDF handlers.
3338
- Commit the initial implementation to the Git repository.
3439
- Potentially test the server using `@modelcontextprotocol/inspector` or by
@@ -41,3 +46,4 @@ implementing the core PDF reading tools based on the `filesystem-mcp` template.
4146
- `read_pdf_page_text` uses the `pagerender` callback for potentially better
4247
accuracy on specific pages.
4348
- Removed inherited filesystem tools to focus solely on PDF functionality.
49+
- Added support for fetching PDFs via URL using `fetch`.

memory-bank/progress.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
- **Project Setup:** Cloned from `filesystem-mcp`, dependencies installed
66
(`pdf-parse` added).
7-
- **Core Tool Handlers:**
7+
- **Core Tool Handlers (Support both local path and URL):**
88
- `read_pdf_all_text`: Implemented, integrated.
99
- `read_pdf_page_text`: Implemented (using `pagerender`), integrated.
1010
- `get_pdf_metadata`: Implemented, integrated.
@@ -18,14 +18,15 @@
1818

1919
## 2. What's Left to Build/Verify
2020

21-
- **Compilation:** Need to run `npm run build` to check for TypeScript errors.
21+
- **Compilation:** Need to run `npm run build` again after adding URL support.
2222
- **Runtime Testing:**
2323
- Verify the server starts correctly.
24-
- Test each PDF tool with actual PDF files (various types if possible) using
24+
- Test each PDF tool with both local paths and URLs using
2525
`@modelcontextprotocol/inspector` or a live agent.
2626
- Specifically test `read_pdf_page_text` with different page ranges and edge
2727
cases.
28-
- Verify error handling (e.g., file not found, corrupted PDF).
28+
- Verify error handling (e.g., file not found, URL fetch errors, corrupted
29+
PDF).
2930
- **Testing Framework:** Consider adding automated tests (e.g., using Jest or
3031
Vitest) for handlers.
3132
- **Refinement:** Review code for potential improvements or edge cases missed.
@@ -35,16 +36,17 @@
3536

3637
## 3. Current Status
3738

38-
Initial implementation of the core PDF reading tools is complete. Documentation
39-
updated. Ready for build and testing.
39+
Implementation of core PDF reading tools (with URL support) is complete.
40+
Documentation updated. Ready for final build and testing.
4041

4142
## 4. Known Issues/Risks
4243

4344
- **`pdf-parse` Limitations:** The accuracy of text extraction, especially for
4445
complex layouts or scanned PDFs, depends heavily on `pdf-parse`. Page number
4546
detection in `pagerender` might need verification (1-based vs 0-based).
46-
- **Error Handling:** Current error handling is basic; more specific error types
47-
or details might be needed based on testing.
47+
- **Error Handling:** Basic error handling for file access and URL fetching
48+
implemented. More specific PDF parsing errors might need refinement based on
49+
testing.
4850
- **Performance:** Performance on very large PDF files hasn't been tested.
4951
- **Inherited Filesystem Tools:** Removed. The server now focuses exclusively on
5052
PDF reading tools. Documentation reflects this.

src/handlers/getPdfMetadata.ts

Lines changed: 35 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,12 @@ import type { ToolDefinition } from './index.js';
77

88
// Define the Zod schema for input arguments
99
const GetPdfMetadataArgsSchema = z.object({
10-
path: z.string().min(1, 'Path cannot be empty.'),
11-
}).strict();
10+
path: z.string().min(1).optional().describe("Relative path to the local PDF file."),
11+
url: z.string().url().optional().describe("URL of the PDF file."),
12+
}).strict().refine(
13+
(data) => (data.path && !data.url) || (!data.path && data.url), // Ensure either path or url is provided, but not both
14+
{ message: "Either 'path' or 'url' must be provided, but not both." }
15+
);
1216

1317
// Infer TypeScript type for arguments
1418
type GetPdfMetadataArgs = z.infer<typeof GetPdfMetadataArgsSchema>;
@@ -25,10 +29,29 @@ const handleGetPdfMetadataFunc = async (args: unknown) => {
2529
throw new McpError(ErrorCode.InvalidParams, 'Argument validation failed');
2630
}
2731

28-
const safePath = resolvePath(parsedArgs.path);
32+
const { path: relativePath, url } = parsedArgs;
33+
let dataBuffer: Buffer;
34+
let sourceDescription: string = 'unknown source'; // Initialize
2935

3036
try {
31-
const dataBuffer = await fs.readFile(safePath);
37+
// Fetch or read the PDF buffer
38+
if (relativePath) {
39+
sourceDescription = `'${relativePath}'`;
40+
const safePath = resolvePath(relativePath);
41+
dataBuffer = await fs.readFile(safePath);
42+
} else if (url) {
43+
sourceDescription = `'${url}'`;
44+
const response = await fetch(url);
45+
if (!response.ok) {
46+
throw new McpError(ErrorCode.InternalError, `Failed to fetch PDF from ${url}. Status: ${response.status} ${response.statusText}`);
47+
}
48+
const arrayBuffer = await response.arrayBuffer();
49+
dataBuffer = Buffer.from(arrayBuffer);
50+
} else {
51+
throw new McpError(ErrorCode.InvalidParams, "Missing 'path' or 'url'.");
52+
}
53+
54+
// Now parse the buffer
3255
// We only need metadata, but pdf-parse reads everything anyway
3356
const data = await pdf(dataBuffer);
3457

@@ -41,11 +64,15 @@ const handleGetPdfMetadataFunc = async (args: unknown) => {
4164
version: data.version,
4265
};
4366
} catch (error: any) {
44-
let errorMessage = `Failed to read or parse PDF for metadata at '${parsedArgs.path}'.`;
45-
if (error.code === 'ENOENT') {
46-
errorMessage = `File not found at '${parsedArgs.path}'. Resolved to: ${safePath}`;
67+
if (error instanceof McpError) throw error; // Re-throw known MCP errors
68+
69+
let errorMessage = `Failed to read or parse PDF for metadata from ${sourceDescription}.`;
70+
// Keep ENOENT check for local files
71+
if (relativePath && error.code === 'ENOENT') {
72+
const safePath = resolvePath(relativePath); // Resolve again for error message
73+
errorMessage = `File not found at '${relativePath}'. Resolved to: ${safePath}`;
4774
} else if (error instanceof Error) {
48-
errorMessage += ` Reason: ${error.message}`;
75+
errorMessage += ` Reason: ${error.message}`;
4976
} else {
5077
errorMessage += ` Unknown error: ${String(error)}`;
5178
}

src/handlers/getPdfPageCount.ts

Lines changed: 35 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,12 @@ import type { ToolDefinition } from './index.js';
77

88
// Define the Zod schema for input arguments
99
const GetPdfPageCountArgsSchema = z.object({
10-
path: z.string().min(1, 'Path cannot be empty.'),
11-
}).strict();
10+
path: z.string().min(1).optional().describe("Relative path to the local PDF file."),
11+
url: z.string().url().optional().describe("URL of the PDF file."),
12+
}).strict().refine(
13+
(data) => (data.path && !data.url) || (!data.path && data.url), // Ensure either path or url is provided, but not both
14+
{ message: "Either 'path' or 'url' must be provided, but not both." }
15+
);
1216

1317
// Infer TypeScript type for arguments
1418
type GetPdfPageCountArgs = z.infer<typeof GetPdfPageCountArgsSchema>;
@@ -25,10 +29,29 @@ const handleGetPdfPageCountFunc = async (args: unknown) => {
2529
throw new McpError(ErrorCode.InvalidParams, 'Argument validation failed');
2630
}
2731

28-
const safePath = resolvePath(parsedArgs.path);
32+
const { path: relativePath, url } = parsedArgs;
33+
let dataBuffer: Buffer;
34+
let sourceDescription: string = 'unknown source'; // Initialize
2935

3036
try {
31-
const dataBuffer = await fs.readFile(safePath);
37+
// Fetch or read the PDF buffer
38+
if (relativePath) {
39+
sourceDescription = `'${relativePath}'`;
40+
const safePath = resolvePath(relativePath);
41+
dataBuffer = await fs.readFile(safePath);
42+
} else if (url) {
43+
sourceDescription = `'${url}'`;
44+
const response = await fetch(url);
45+
if (!response.ok) {
46+
throw new McpError(ErrorCode.InternalError, `Failed to fetch PDF from ${url}. Status: ${response.status} ${response.statusText}`);
47+
}
48+
const arrayBuffer = await response.arrayBuffer();
49+
dataBuffer = Buffer.from(arrayBuffer);
50+
} else {
51+
throw new McpError(ErrorCode.InvalidParams, "Missing 'path' or 'url'.");
52+
}
53+
54+
// Now parse the buffer
3255
// We only need the page count, but pdf-parse reads everything
3356
const data = await pdf(dataBuffer);
3457

@@ -37,11 +60,15 @@ const handleGetPdfPageCountFunc = async (args: unknown) => {
3760
numPages: data.numpages,
3861
};
3962
} catch (error: any) {
40-
let errorMessage = `Failed to read or parse PDF for page count at '${parsedArgs.path}'.`;
41-
if (error.code === 'ENOENT') {
42-
errorMessage = `File not found at '${parsedArgs.path}'. Resolved to: ${safePath}`;
63+
if (error instanceof McpError) throw error; // Re-throw known MCP errors
64+
65+
let errorMessage = `Failed to read or parse PDF for page count from ${sourceDescription}.`;
66+
// Keep ENOENT check for local files
67+
if (relativePath && error.code === 'ENOENT') {
68+
const safePath = resolvePath(relativePath); // Resolve again for error message
69+
errorMessage = `File not found at '${relativePath}'. Resolved to: ${safePath}`;
4370
} else if (error instanceof Error) {
44-
errorMessage += ` Reason: ${error.message}`;
71+
errorMessage += ` Reason: ${error.message}`;
4572
} else {
4673
errorMessage += ` Unknown error: ${String(error)}`;
4774
}

src/handlers/readPdfAllText.ts

Lines changed: 36 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,12 @@ import type { ToolDefinition } from './index.js'; // Import the internal interfa
77

88
// 1. Define the Zod schema for input arguments
99
const ReadPdfAllTextArgsSchema = z.object({
10-
path: z.string().min(1, 'Path cannot be empty.'),
11-
}).strict(); // Use strict to prevent unexpected arguments
10+
path: z.string().min(1).optional().describe("Relative path to the local PDF file."),
11+
url: z.string().url().optional().describe("URL of the PDF file."),
12+
}).strict().refine(
13+
(data) => (data.path && !data.url) || (!data.path && data.url), // Ensure either path or url is provided, but not both
14+
{ message: "Either 'path' or 'url' must be provided, but not both." }
15+
);
1216

1317
// Infer TypeScript type for arguments
1418
type ReadPdfAllTextArgs = z.infer<typeof ReadPdfAllTextArgsSchema>;
@@ -26,10 +30,29 @@ const handleReadPdfAllTextFunc = async (args: unknown) => {
2630
throw new McpError(ErrorCode.InvalidParams, 'Argument validation failed');
2731
}
2832

29-
const safePath = resolvePath(parsedArgs.path);
30-
33+
const { path: relativePath, url } = parsedArgs;
34+
let dataBuffer: Buffer;
35+
let sourceDescription: string = 'unknown source'; // Initialize here
3136
try {
32-
const dataBuffer = await fs.readFile(safePath);
37+
if (relativePath) {
38+
sourceDescription = `'${relativePath}'`;
39+
const safePath = resolvePath(relativePath);
40+
dataBuffer = await fs.readFile(safePath);
41+
} else if (url) {
42+
sourceDescription = `'${url}'`;
43+
const response = await fetch(url);
44+
if (!response.ok) {
45+
// Use InternalError or a more generic code if NetworkError doesn't exist
46+
throw new McpError(ErrorCode.InternalError, `Failed to fetch PDF from ${url}. Status: ${response.status} ${response.statusText}`);
47+
}
48+
const arrayBuffer = await response.arrayBuffer();
49+
dataBuffer = Buffer.from(arrayBuffer);
50+
} else {
51+
// This should be caught by Zod refine, but as a safeguard:
52+
throw new McpError(ErrorCode.InvalidParams, "Missing 'path' or 'url'.");
53+
}
54+
55+
// Now parse the buffer
3356
const data = await pdf(dataBuffer);
3457

3558
// pdf-parse returns numpages, numrender, info, metadata, text, version
@@ -45,12 +68,15 @@ const handleReadPdfAllTextFunc = async (args: unknown) => {
4568
version: data.version,
4669
};
4770
} catch (error: any) {
48-
// Provide a more specific error message if possible
49-
let errorMessage = `Failed to read or parse PDF at '${parsedArgs.path}'.`;
50-
if (error.code === 'ENOENT') {
51-
errorMessage = `File not found at '${parsedArgs.path}'. Resolved to: ${safePath}`;
71+
if (error instanceof McpError) throw error; // Re-throw known MCP errors
72+
73+
let errorMessage = `Failed to read or parse PDF from ${sourceDescription}.`; // Remove default value here, already initialized
74+
// Keep ENOENT check for local files
75+
if (relativePath && error.code === 'ENOENT') {
76+
const safePath = resolvePath(relativePath); // Resolve again for error message
77+
errorMessage = `File not found at '${relativePath}'. Resolved to: ${safePath}`;
5278
} else if (error instanceof Error) {
53-
errorMessage += ` Reason: ${error.message}`;
79+
errorMessage += ` Reason: ${error.message}`;
5480
} else {
5581
errorMessage += ` Unknown error: ${String(error)}`;
5682
}

0 commit comments

Comments
 (0)