Skip to content

feat: add document-to-markdown MCP tool, improve KG structuring & retrieval, and extend blob storage#28

Open
Jurij89 wants to merge 3 commits intomainfrom
feat/pdf-to-markdown-tool
Open

feat: add document-to-markdown MCP tool, improve KG structuring & retrieval, and extend blob storage#28
Jurij89 wants to merge 3 commits intomainfrom
feat/pdf-to-markdown-tool

Conversation

@Jurij89
Copy link
Contributor

@Jurij89 Jurij89 commented Feb 4, 2026

  • Add document-to-markdown MCP tool (.pdf, .docx, .pptx) for ingestion → markdown → JSON-LD → DKG publishing
  • Improve DKG agent system prompt to structure raw data into knowledge graphs and retrieve data via SPARQL using schema.org and FOAF
  • Extend blob storage to support subfolders for better organization

…rieval, and extend blob storage:

- Add document-to-markdown MCP tool (.pdf, .docx, .pptx) for ingestion → markdown → JSON-LD → DKG publishing
- Improve DKG agent system prompt to structure raw data into knowledge graphs and retrieve data via SPARQL using schema.org and FOAF
- Extend blob storage to support subfolders for better organization
@Jurij89 Jurij89 requested a review from Copilot February 4, 2026 16:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds document conversion capabilities to the DKG essentials plugin, enhances the DKG agent's knowledge structuring abilities, and improves blob storage organization.

Changes:

  • Adds a new document-to-markdown MCP tool that converts PDF, DOCX, and PPTX files to markdown using Mistral OCR
  • Significantly expands the DKG agent system prompt with detailed instructions for structuring knowledge graphs, conducting SPARQL queries, and extracting deep knowledge from documents
  • Extends blob storage to support subfolder organization through automatic parent directory creation

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts Implements the core document-to-markdown conversion tool with Mistral OCR integration, image extraction, and blob storage
packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts Comprehensive test suite covering tool registration, input validation, file type validation, and blob integration
packages/plugin-dkg-essentials/src/index.ts Exports and initializes the new document-to-markdown plugin
packages/plugin-dkg-essentials/src/createFsBlobStorage.ts Adds parent directory creation to support nested blob paths
packages/plugin-dkg-essentials/package.json Adds Mistral AI SDK and undici dependencies
packages/plugin-dkg-essentials/src/plugins/dkg-tools.ts Updates tool description for clarity
apps/agent/src/shared/chat.ts Major expansion of agent system prompt with detailed knowledge extraction, SPARQL query patterns, and user communication guidelines

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 417 to 421
const imageFilename = `${image.id}`; // ID already includes extension (e.g., "img-0.jpeg")
const imageBlobId = `${BLOB_PREFIX}/${folderId}/${imageFilename}`;
const imageStream = Readable.toWeb(Readable.from(image.data));
await ctx.blob.put(imageBlobId, imageStream, {
name: imageFilename,
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable imageFilename is redundant and reduces code clarity. The comment indicates image.id already contains the complete filename. The variable should be removed and image.id used directly in subsequent code (line 418) to eliminate unnecessary indirection.

Suggested change
const imageFilename = `${image.id}`; // ID already includes extension (e.g., "img-0.jpeg")
const imageBlobId = `${BLOB_PREFIX}/${folderId}/${imageFilename}`;
const imageStream = Readable.toWeb(Readable.from(image.data));
await ctx.blob.put(imageBlobId, imageStream, {
name: imageFilename,
const imageBlobId = `${BLOB_PREFIX}/${folderId}/${image.id}`; // ID already includes extension (e.g., "img-0.jpeg")
const imageStream = Readable.toWeb(Readable.from(image.data));
await ctx.blob.put(imageBlobId, imageStream, {
name: image.id,

Copilot uses AI. Check for mistakes.
Comment on lines 321 to 323
let timeoutId: NodeJS.Timeout;
const timeoutPromise = new Promise<never>((_, reject) => {
timeoutId = setTimeout(() => {
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable timeoutId is declared with let but assigned inside the Promise constructor, which means it may be undefined when clearTimeout(timeoutId) is called on line 345. This creates a race condition where if the Promise rejects/resolves before the setTimeout callback assigns the value, the cleanup will fail. Declare and assign timeoutId before creating timeoutPromise to ensure it's always defined for cleanup.

Copilot uses AI. Check for mistakes.
Comment on lines +250 to +253
// Create a base64 string that decodes to > 50MB
// 50MB = 52,428,800 bytes, base64 encoding increases size by ~33%
// So we need about 70MB of base64 data
const largeSizeBytes = 51 * 1024 * 1024; // 51MB
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on lines 250-251 is misleading. The test creates a 51MB buffer and then base64-encodes it (line 255), which actually creates ~68MB of base64 data. However, the tool validates the decoded size (line 520 in document-to-markdown.ts), not the base64 size. The comment should clarify that the 51MB buffer will decode to 51MB, which exceeds the 50MB limit, making the base64 size discussion irrelevant to this validation.

Suggested change
// Create a base64 string that decodes to > 50MB
// 50MB = 52,428,800 bytes, base64 encoding increases size by ~33%
// So we need about 70MB of base64 data
const largeSizeBytes = 51 * 1024 * 1024; // 51MB
// Create a base64 string whose decoded content is > 50MB
// 50MB = 52,428,800 bytes; the size check is performed on the decoded bytes
// So a 51MB buffer is sufficient to exceed the 50MB limit when decoded
const largeSizeBytes = 51 * 1024 * 1024; // 51MB decoded

Copilot uses AI. Check for mistakes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how should I feel about changing so many things in package.lock? was this really necessary ?

* Validate that MISTRAL_API_KEY is set
* @throws Error if API key is missing
*/
function validateMistralApiKey(): string {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why mistral ? maybe lets have some generic llm support or an interface where we can implement multiple llm

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mistral was selected because it is best performing for pdf->markdown conversion.

but i agree, the way we create new LLM clients is very "all over the place".

// Apply page range filter if specified
let pages = response.pages;
if (options?.pageStart !== undefined || options?.pageEnd !== undefined) {
const start = (options.pageStart ?? 1) - 1; // Convert to 0-indexed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should 1 be a constant? e.g. DEFAULT_PAGE_START = 1.

Comment on lines 483 to 487
pageStart: z
.number()
.optional()
.describe("First page to process (1-indexed)"),
pageEnd: z
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some validation for input params, like page start not being bigger than page end, can it be negative, etc.

@@ -194,31 +194,348 @@ export const DEFAULT_SYSTEM_PROMPT = `
You are a DKG Agent that helps users interact with the OriginTrail Decentralized Knowledge Graph (DKG) using available Model Context Protocol (MCP) tools.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should revisit the whole prompt and try to narrow it down somewhat? This could influence the context of the llm quite a lot, especially for other tool which heavily rely on the llm for final processing, like retrieval for example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried the prompt quite a bit and works pretty well. Also checked and it's only 4k tokens - with nmost models having now 200k or 400k context window, I don't see this as an issue I think?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm In this context I see document-to-markdown tool instructions even if that is a plugin which is not always included? or am I missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool is part of the dkg essentials plugin, which contains basic tools that enable the DKG agent (DKG interaction tools + this basic document processing) - so yes, the plugin is always included and comes out of the box.

- Extract DocumentConversionProvider interface for pluggable OCR backends
- Move Mistral-specific code to providers/mistral.ts
- Add provider registry with factory pattern (createProvider, getAvailableProviders)
- Split into modular structure: types, validation, blob-integration, providers
- Support runtime provider selection via DOCUMENT_CONVERSION_PROVIDER env var
- Add createDocumentToMarkdownPlugin() factory for custom configuration
- Update tests with provider registry and mock provider tests
- Update documentation with provider abstraction usage
`Document-to-markdown plugin initialized with provider: ${provider.name}`,
);

mcp.registerTool(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could register an api endpoint too

Comment on lines +103 to +112
} catch (error) {
// Log warning but don't fail plugin initialization
// Provider errors will surface when the tool is actually used
console.warn(
`Document conversion provider "${providerName}" initialization deferred: ` +
`${error instanceof Error ? error.message : String(error)}`,
);
// Create a lazy provider that initializes on first use
provider = createLazyProvider(providerName);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why load a lazy provider if the tool will fail without a valid provider later on? We should fail fast here with a constructive error, it's easier to debug and less code to maintain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants