This allows llm to use tools to process user-uploaded images. #1761

paintstar · 2025-11-20T14:05:45Z

paintstar
Nov 20, 2025

In the original code, the framework would hand over the user-uploaded image to the visual model to generate a description, and then hand over the user's question and description together to the large language model for resolution.

async def join_minio_file_description_to_query(minio_files, query):
    final_query = query
    if minio_files and isinstance(minio_files, list):
        file_descriptions = []
        for file in minio_files:
            if isinstance(file, dict) and "description" in file and file["description"]:
                file_descriptions.append(file["description"])

        if file_descriptions:
            final_query = "User provided some reference files:\n"
            final_query += "\n".join(file_descriptions) + "\n\n"
            final_query += f"User wants to answer questions based on the above information: {query}"
    return final_query

However, this prevents the large language model from using tools to manipulate the image because it simply cannot access the image.

In fact, during the previous preprocessing, the file information did indeed store its URL. If we pass both the URL and description to the large language model, it can then call the configured mcp tool. First, the image URL is passed to the mcp tool, which then processes the image.

Something like this:

async def join_minio_file_information_to_query(minio_files, query):
    final_query = query
    if minio_files and isinstance(minio_files, list):
        file_descriptions = []
        file_url = []   # add file_url
        for file in minio_files:
            if isinstance(file, dict) and "description" in file and file["description"]:
                file_descriptions.append(file["description"])
                file_url.append(file["url"])   # append the url

        if file_descriptions:
            final_query = "User provided some reference files:\n"
            final_query += "\n" + "Reference file URLs:\n"   # show the URLs to llm
            final_query += "\n".join(file_url)
            final_query += "\n" + "Reference file description:\n"
            final_query += "\n".join(file_descriptions) + "\n\n"
            final_query += f"User wants to answer questions based on the above information: {query}"

    return final_query

I have confirmed that the above changes can achieve the desired effect.
I'm not sure if this aligns with nexent's development strategy. If I submit a pull request, will it be accepted? Or is there a better approach that I'm unaware of?

Jasonxia007 · 2025-11-25T03:08:44Z

Jasonxia007
Nov 25, 2025
Collaborator

Hi there! That's a pretty good point, and we have been dealing with it for some days. However, we have essentially completed the development of this part, so this PR is unnecessary.
We would turn those key behaviour on files (such as describe images into text) into a more general agent tool, then append its output onto the query.

We appreciate the suggestions you've raised above, and we warmly welcome you to directly submit PRs or raise suggestions in the future. Our developers will always be free to review them and provide feedback on merging.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This allows llm to use tools to process user-uploaded images. #1761

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

This allows llm to use tools to process user-uploaded images. #1761

Uh oh!

paintstar Nov 20, 2025

Replies: 1 comment

Uh oh!

Uh oh!

Jasonxia007 Nov 25, 2025 Collaborator

paintstar
Nov 20, 2025

Jasonxia007
Nov 25, 2025
Collaborator