-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Initial Checks
- I confirm that I'm using the latest version of MCP Python SDK
- I confirm that I searched for my issue in https://github.com/modelcontextprotocol/python-sdk/issues before opening this issue
Description
Something does not square for me about the MCP Protocol standard, as owned an operated by Anthropic.
Here is the bug. I have a simple 20-line function that receives an image URL and applies a custom OCR toolkit. This is well-documented code with simple parameter structure. This runs locally (in this case). For those who do not want to read the code, there is no magic here: the result is a simple dictionary with a text field containing the OCR string extracted from the image. Optional flag add_token adds a further JSON dictionary of metadata about each token, add_typographical adds pixel data about the image font, and filter_as_block and confidence_threshold are simple quality filters. References to CloudNode (our proprietary internal infrastructure) do not contribute to this except to properly download the file (to a BytesIO object). Nothing surprising in any way.
Sitting here at this coffeeshop this morning, the MCP server has begun failing in two MCP Protocol ways.
def ocr(
image_url: str,
add_tokens=True,
add_typographical=False,
filter_as_block=False,
confidence_threshold: float = 0.50,
) -> dict:
"""
Extract text from an image using Remarkable OCR.
This tool performs optical character recognition on the provided
image using the RemarkableOCR engine. It supports multiple output formats
and filtering options to customize the OCR results for different use cases.
Args:
image_url (str): Image URL (http://, https://, file://, or uri://)
add_tokens (bool, optional): Include raw OCR tokens in response.
Defaults to True. When True, adds detailed token information.
add_typographical (bool, optional): Include typographical statistics.
Defaults to False. When True, adds enriched typographical data.
filter_as_block (bool, optional): Filter text using block assumptions.
Defaults to False. When True, applies block-based text filtering.
confidence_threshold (float, optional): Minimum confidence threshold
for OCR recognition. Defaults to 0.50.
Returns:
dict: CloudNode format response containing:
- ok (bool): Operation success status
- msg (str): Status message or error description
- data (dict): OCR result with:
- text (str): Readable text lines extracted from the image
- tokens (list, optional): Raw OCR tokens if add_tokens=True
- typographical (dict, optional): Typographical statistics if add_typographical=True
Note:
The image input should be a URL using CloudNode's supported protocols.
RemarkableOCR handles all image format conversion internally.
"""
- The LLM is constructing input calls to use parameters set by, for instance,
add_tokens=false, instead of the proper Python booleanFalse. The downstream OCR MCP server tool (my code) is interpreting this as the string'false', when checked with a reasonableif add_token:, and so the OCR software adds the tokens to the result (and succeeds, as the code is well-developed and properly maintained). However, this an incorrect invocation of the MCP tool by the LLM as instructed by the MCP Protocol communication. I must presume that not only does the MCP server directly specify that its software is operating Python (in this case), but the docstring of the function is clearly sufficient for a large LLM (Big Pickle by Zen) to not make this error. The only improvement to my OCR code would be type enforcing, i.e.,add_tokens: bool=Falsein theocr()function declaration.
cloudnode-home_ocr [image_url=file:///twain.jpg, filter_as_block=true, add_tokens=false, add_typographical=false]
- The results returned by the ocr() function are significantly large when
add_tokens=True(roughly 500 characters for each token) and dumping all this information into the LLM context breaks the LLM session. This behavior between LLMs and MCPs continues to confuse me as to why this failure mode does not have a mature design pattern already. Forgive my agog, but this is day one engineering; and you are responsible for an industry design pattern (and we already see Amazon fingers a lot of stewing problems). Yes, technically, the responsibility for this error is the LLM itself choosing to dump the result to context (OpenCode executed), and the LLM 'should' be aware enough to write this data to a temporary file which is then selectively read in a subagent session owning its own context, ultimately summarizing its results to the primary agent to continue in its original context. But, kindly, why on Earth would MCP Protocol standards (and OpenCode) not be designed to do this by default, from the program management onset, to create provenance and control over API and external functions. Be serious. There is no reason the LLM could not communicate the maximum extent of its context, and the MCP Protocol be aware that the response needs to be cached on disk instead, or this be a core configuration of any MCP server on the client side for simple bookkeeping, reconstruction, and provenance. (This surely would have led OpenCode to design their agents systems to expect on disk file interactions and caching, as well, and you guys are all smarter than me.) As for the problem-as-is manifest on the OpenCode side (without such changes to MCP Protocol), a standard subagent could simply dump the entire file back into the context back to the agent. And, yes, OpenCode agent paradigms are, of course, not the responsibility of the MCP Protocol, nor its team. But MCP Protocol is responsible on the client side, and so not allowing configurations out of the box to the response and dumping to a temporary file instead of filtrating up to the agent is a design choice of the MCP Protocol. To really highlight the insolvent security issues inherent in the MCP + Agent paradigms of the industry today: in this case, in the failure mode of the LLM (Big Pickle) response is to, unbelievably, attempt to execute custom Python code to solve the OCR problem. This makes no sense within a robust execution environment for software implementation, and I am surely not the first person to raise this discussion here. Thanks. You guys are the best.
Google is strap-hung to return Reddit forums instead of proper results (or forbid GitHub issues), but here is the same problem from a different user six days ago. Not only do the standard solutions require total control of the MCP server tool code itself, these require intense, and standardized, data handling on the server side across the entire tool suite, PLUS the LLM to do intense heavy lifting with foreign files and pagination. You guys are making these design choices for Amazon at Anthropic and calling this open source, which clearly this has no reckoning to be. And the best Google can do is cough up an OpenAI-cohort chat between future bots. Can we agree on that as an industry? Thanks. You guys are the best. https://www.reddit.com/r/mcp/comments/1q2lm5f/comment/nxgfv5b/
Thanks. You guys are the best.
Property of Starlight LLC (as is every piece of content created by my software)
Python & MCP Python SDK
Python 3.13.9
MCP 1.25
OpenCode 1.1.8
update 11:48AM
Indeed, having improved the my own code to enforce types, there is no transparency into the MCP coercion logic. As you can see here, the LLM display output is similar to that before (using the lowercase boolean and no quotes) but the parameters passed to the MCP tool are now booleans. So there is no native inspection into what the MCP Protocol is doing for many of the most important LLM use cases, especially for naive users. Indeed, during debugging the MCP server executed a timeout, and the LLM began writing custom Python code to execute its own filesystem reads and third-party OCR packages (pytesseract, in this case) which is an insane failure mode. You cannot go to an enterprise manager and tell them this, so I question what this industry is really doing. Every vibe-coding joke on the web starts with intern analysts putting OpenAI API calls into try/catch statements on JS pages. Be serious. Thanks. You guys are the best.
Thanks. You guys are the best.